About Stein's estimators: the original result and extensions

(1)

About Stein's estimators: the original result and extensions

Auteur : Demaret, Tom Promoteur(s) : Swan, Yvik Faculté : Faculté des Sciences

Diplôme : Master en sciences mathématiques, à finalité spécialisée en statistique Année académique : 2018-2019

URI/URL : http://hdl.handle.net/2268.2/6981

Avertissement à l'attention des usagers :

Tous les documents placés en accès ouvert sur le site le site MatheO sont protégés par le droit d'auteur. Conformément aux principes énoncés par la "Budapest Open Access Initiative"(BOAI, 2002), l'utilisateur du site peut lire, télécharger, copier, transmettre, imprimer, chercher ou faire un lien vers le texte intégral de ces documents, les disséquer pour les indexer, s'en servir de données pour un logiciel, ou s'en servir à toute autre fin légale (ou prévue par la réglementation relative au droit d'auteur). Toute utilisation du document à des fins commerciales est strictement interdite.

Par ailleurs, l'utilisateur s'engage à respecter les droits moraux de l'auteur, principalement le droit à l'intégrité de l'oeuvre et le droit de paternité et ce dans toute utilisation que l'utilisateur entreprend. Ainsi, à titre d'exemple, lorsqu'il reproduira un document par extrait ou dans son intégralité, l'utilisateur citera de manière complète les sources telles que

mentionnées ci-dessus. Toute utilisation non explicitement autorisée ci-avant (telle que par exemple, la modification du document ou son résumé) nécessite l'autorisation préalable et expresse des auteurs ou de leurs ayants droit.

(2)

Master’s thesis

About Stein’s estimators

The original result and extensions

Author:

Tom Demaret

Supervisor:

Yvik Swan

A thesis submitted in fulfillment of the requirements for the Master’s degree in mathematics

Department of mathematics Faculty of Sciences

Academic year 2018-2019

(3)

I would like to thank Pr. Yvik Swan for proposing this topic to me and for his help and patience along the way.

Thanks also to the readers: in the hope they will have an enjoyable time and find this work interesting.

2

(4)

1 Stein’s original result 6 2 First generalization: elliptically symmetric laws 14

3 Second generalization: high dimension 24

3.1 Reminders on the Moore-Penrose inverse . . . 25 3.2 The main result . . . 26 3.3 Technical results . . . 29

4 Simulations 41

A Appendix: Technical results 55

A.1 Results on Wishart matrices . . . 55 A.2 Results on the Moore-Penrose Inverse . . . 57

B Appendix R code 59

3

(5)

Suppose you are interested in estimating the mean θ of a Gaussian law and dispose, to this end, of a single random Gaussian variable X (which, as the Gaussian law is sum- stable, could itself be the sample mean of a large number of independent Gaussian random variables). Given the symmetry of the normal distribution, our intuition tells us that looking at X itself is the best way to estimate θ. To assess the quality of our intuitive estimator, the square error loss function and its associated risk, the mean square error, are the most common evaluation criterion. There are many reasons behind this popularity:

convenience, elegance, mathematical tractability... In our case, an additional reason is its intrinsic link with the Gaussian law. Indeed, for the normal distribution, taking the maximum likelihood estimator is the same as minimizing the mean square error, and the estimator obtained this way is X, confirming our intuition. To further assess the quality of an estimator, there exists other ways: two important ones to set up the premises of this work are given in the following definitions.

Definition. An estimator δ^∗ of a parameter θ ∈ Θ is minimax with respect to a risk function R(θ, δ) if

sup

θ∈Θ

R(θ, δ^∗) = inf

δ sup

θ∈Θ

R(θ, δ)

Intuitively, an estimator is minimax if it is "the best in the worst case".

Definition. An estimatorδ^∗ of a parameterθ∈Θis admissible with respect to a risk function R(θ, δ) if no other estimator dominates it, meaning there does not exist an estimator δ such that R(θ, δ)≤R(θ, δ^∗) for all θ and R(θ, δ)< R(θ, δ^∗) for at least some θ.

In one dimension, the estimator X is minimax and admissible (see [12] and references in it), and at first glance, nothing seems to indicate that it should be any different in higher dimensions. In fact, the estimatorX, for a variable whose mean is the vectorθ, is minimax in any dimension and if the dimension is p and the covariance isσ²I (we look at this case for simplicity as it is equivalent to looking at p independent one dimensional Gaussian variables and it makes computations easy), then the mean square error, E[kX−θk²], is equal to pσ². It has also been shown to be admissible for p = 2 (see [12]), but in 1956, Charles Stein found estimators that dominateXas soon aspis greater or equal to3. More precisely, he exhibited a biased estimator of the form X+g(X), withg a certain function, that is also minimax: its risk is p minus a term proportional to _kXk¹ ² so it becomes equal to that of X when the norm of θ goes to infinity, but with lower risk for all finite θ. One

4

(6)

could say it exploits the weakness in the definition of a minimax estimator. This somewhat paradoxical result, stating in essence that combined information on unrelated events could bring better results overall than looking at each individually, came as a big surprise when first presented, but soon led to a plethora of research. This thesis aims to explore some of this research.

In Chapter 1, the first result by Stein is presented: starting from the simple case of identity covariance matrix and working our way step by step to the most direct generalizations.

This is mostly taken from [12, 13, 14].

In Chapter 2, the estimators are extended to different probability laws. The link between mean square error and the Gaussian law comes from the presence of a (x−θ)² term in the density function. This term appears in other laws, called elliptically symmetric laws, so it is natural that those laws behave similarly. This is taken from [6].

In our world filled with data, the case of high dimension and low sample size is getting more and more common and brings new kinds of problems. This is explored in Chapter 3.

The main result is from [2].

Finally, the results are illustrated through simulations in Chapter 4 and some technical side results used are proved in the Appendix.

(7)

Stein’s original result

The James-Stein estimator is an estimator for the meanθof a normal distribution which, at the price of a bias, dominates the usual estimator under mean square error: E

hkθˆ−θk²i , where θˆis our estimator.

The result is based on a simple lemma, proved using Fubini’s theorem.

Lemma 1.1. Let X be a real random variable following a standard Gaussian law N(0,1) and g : R → R an absolutely continuous function such that g⁰ is integrable. Then, if E|g⁰(X)|<∞,

E[g⁰(X)] =E[Xg(X)].

Proof. The density function φ(x) of the standard Gaussian law, as it will be noted from now on, is such that φ⁰(x) =−xφ(x).

Note also how, using R

Ryφ(y)dy = 0, Z x

−∞

−yφ(y)dy= Z +∞

x

yφ(y)dy ∀x∈R

We then have E[g⁰(X)] =

Z

R

g⁰(x)φ(x)dx

= Z

R

g⁰(x) Z x

−∞

(−yφ(y))dy

dx

= Z +∞

0

g⁰(x)

Z +∞

x

yφ(y)dy

dx− Z 0

−∞

g⁰(x) Z x

−∞

yφ(y)dy

dx

= Z +∞

0

yφ(y) Z y

0

g⁰(x)dx

dy− Z 0

−∞

yφ(y) Z 0

y

g⁰(x)dx

dy

= Z

R

yφ(y)(g(y)−g(0))dy

=E[Xg(X)]−g(0)E[X] =E[Xg(X)]

6

(8)

The result can be extended to any Gaussian random variable Y ∼ N(µ, σ²) by writing Y =σX+µwhere X ∼ N(0,1)and h(y) =g

y−µ σ

. We then obtain

E[h⁰(Y)] = 1 σE

g⁰

Y −µ σ

= 1

σE[g⁰(X)] = 1

σ E[Xg(X)]

= 1 σE

Y −µ σ g

Y −µ σ

=E

Y −µ σ² h(Y)

.

The result then needs to be extended to any dimension. To remain as general as possible, we will use the following definition.

Definition 1.1. A function h : R^p → R is called almost differentiable if there exists a function ∇h:R^p →R^p such that, for alla∈R^p,

h(x+a) =h(x) + Z 1

0

a· ∇h(x+ta)dt.

The function ∇h is essentially the vector of partial derivatives, which is why we will use the following notations from now on. For a function f :R^p →R,

• ∇f = (∂1f, ..., ∂pf)⁰

• div(f) = Pp

i=1∂if(X)

• ∆f =Pp

i=1∂_i²f(X).

Lemma 1.2. Let X be a p-dimensional random variable following a standard Gaussian law with meanθ and the identity as covariance matrix, andh:R^p →Ran almost differentiable function. If E|∇h(X)|<∞, then

E[∇h(X)] =E[(X−θ)h(X)].

Proof. For all i∈ {1, ..., p}, write X = (X_i, X−i), meaning that we decompose X between its ith component and all the others. Because X is normal, X_i and X−i are independent and therefore, we find that, using Lemma 1.1,

E[∂_ih(X)|X−i] =E[(X_i−θ_i)h(X)|X−i], and then, taking the expectation,

E[∂ih(X)] =E[(Xi−θi)h(X)], whence the conclusion.

(9)

For what follows, we first consider X to be a p-dimensional random variable following a standard Gaussian law with mean θ and the identity as covariance matrix.

Theorem 1.1. For an almost differentiable function g : R^p → R^p (meaning that all its components are almost differentiable), such that

E

" _p X

i=1

|∂igi(X)|

#

<∞, (A)

we have

E

kX+g(X)−θk²

=p+E

kg(X)k²+ 2div(g(X))

. (1.1)

Proof. Using Lemma 1.2, we have E

(X_i+g_i(X)−θ_i)²

=E

(X_i−θ_i)²+ 2(X_i−θ_i)g_i(X) +g²_i(X)

= 1 + 2E[∂igi(X)] +E

g²_i(X) and we get the result by summing overi.

This theorem gives an expression for the MSE of any estimator of the form X+g(X) by decomposing it between the error of the usual estimator E[kX−θk²] = p and a term that depends on the functiong. The goal will be to make this term negative. With this in mind, we concentrate in (1.1) on functions g :R^p →R^p of the form

g =∇logf = ∇f f

with f such that this is well defined. This leads to a modified version of Theorem 1.1.

Theorem 1.2. Let f : R^p → R⁺0 be a almost differentiable function such that ∇f is also almost differentiable,

E

"

1 f(X)

p

X

i=1

|∂_i²f(X)|

#

<∞ (A’)

and

E

k∇logf(X)k²

<∞.

Then

E

kX+∇logf(X)−θk²

=p+E

2∇²f(X)

f(X) −k∇f(X)k² f²(X)

=p+ 4E

"

∇²p f(X) pf(X)

# .

Remark. Note that condition A’ is simply the rewriting of condition A with our particular choice of g and the other conditions ensure that the final expression is finite.

(10)

Proof. Using

div(∇logf) = ∆f

f − k∇fk² f² . we directly get from 1.1

E

=p+E

k∇fk² f² + 2

∆f

f − k∇fk² f²

=p+E

2∆f

f − k∇fk² f²

.

Finally, as

∆p

f

= div(∇f) = div ∇f

2√ f

= ∆f 2√

f − k∇fk² 4f³² , we can replace k∇fk² by

2f(∆f)−4f³²(∆p f) to get the desired result.

From this, it follows that if we can find a function f satisfying the assumptions of the theorem and such that∆

p f(x)

≤0, then we have an estimator, X+∇logf(X), that dominates the usual estimator X. Indeed, in that case,

E

≤p=E

kX−θk²

The famous James-Stein estimator is obtained by choosing f(x) = 1

kXk² b

(see [15]

for more details).

We have

∇f(X) =−b 1

kXk²

−(b+1)

2X so

∇logf(X) = ∇f(X)

f(X) = −2b kXk²X and the estimator is

1− 2b kXk²

X. As we have

∆p

f(X)

=−b(p−2−b) kXk^b+2 ,

this estimator dominates X for 0 ≤ b ≤ (p − 2) (when p > 2). Its risk is equal to p−4E

b(p−2−b) kXk²

. The improvement is therefore maximal for b = ^p−2₂ , whence the following definition.

(11)

Definition 1.2. The usual James-Stein estimator for X ∼ N_p(θ, I) is

1− p−2 kXk²

X. (1.2)

Its risk is equal to p−(p−2)²E 1

kXk²

.

It is interesting to note that the degree of the improvement depends on the value of kXk²: the closerX is to0, the bigger the improvement will be. On the other hand, ifX is big, the James-Stein estimator will be very close to the actual value X. This means that the estimator is much more useful when X ∼ Np(θ, I) with kθk small. It may therefore be useful to modify the estimator, by centering the data first. This will be studied by simulations, in Section 4.

We will now work by steps to generalize this result to any covariance matrix.

Let’s first look at the case where the covariance matrix is a multiple of the identity, i.e. X ∼ N_p(θ, σ²I). This is the case if we consider a sample of p independent normally distributed variables as a vector of size p.

If σ² is known, looking at X

σ and applying previous results, the James-Stein estimator

for X takes the form

1− (p−2)σ² kXk²

X (1.3)

and its risk is

σ²

p−(p−2)²E 1

kXk²

.

If σ² is unknown, it needs to be estimated and we assume we have at our disposal a variable s ∼ σ²χ²_n, independent of X. This would typically be obtained through a sample of n+ 1 independent normally distributed variables v_i (v_i ∼ N(µ, σ²)), by taking s=Pn+1

i=1(v_i−¯v)².

Remark. The notations is used for readability purposes, but be careful that it does not denote the standard deviation. In our example, it denotes n times the sample variance.

Setting

Y = X

σ, η= θ

σ, s^∗ = s

σ² (1.4)

and looking at estimators of the form

1− c(p−2)s kXk²

X, with c a constant to be deter- mined, we get

E

"

X− c(p−2)s kXk² X−θ

2#

(12)

=σ²E

"

Y − c(p−2)s^∗ kYk² Y −η

2#

=σ²E

(Y −η)− c(p−2)s^∗ kYk² Y

0

(Y −η)− c(p−2)s^∗ kYk² Y

=σ²E

kY −ηk²−2c(p−2)s^∗(Y −η)⁰Y

kYk² +c²(p−2)²s^∗2 1 kYk²

=σ²

p−2c(p−2)E[s^∗]E

(Y −η)⁰Y kYk²

+c²(p−2)²E s^∗2

E 1

kYk²

=σ²

p−2c(p−2)nE

(Y −η)⁰Y kYk²

+c²(p−2)²n(n+ 2)E 1

kYk²

using the independence of Y and s^∗ and the expression for the moment of a χ²_n law.

The middle term can easily be computed by integration by parts, but, because Y ∼ N_p(η, I), we can also write

E

"

Y −(p−2) kYk² Y −η

2#

=E

kY −ηk² −2(p−2)(Y −η)⁰Y

kYk² + (p−2)² 1 kYk²

=p+ (p−2)²E 1

kYk²

−2(p−2)E

(Y −η)⁰Y kYk²

and, as we know that the first term is also equal to p−(p−2)²E 1

kYk²

, we find that

E

(Y −η)⁰Y kYk²

= (p−2)E 1

kYk²

Wrapping up, this gives us E

"

X− c(p−2)s kXk² X−θ

2#

=σ²

p−n(p−2)²(2c−c²(n+ 2))E 1

kYk²

.

Since c is arbitrary, we can choose it so as to minimize 2c−c²(n+ 2), which is done by taking c= _n+2¹ .

Definition 1.3. The James-Stein estimator for X ∼ N_p(θ, σ²I), with σ² unknown estimated through s∼σ²χ²_n, is

1− (p−2)s (n+ 2)kXk²

X. (1.5)

Its risk is equal to σ²

p− _n+2ⁿ (p−2)²E 1

kXk²

.

(13)

Note how estimatingσ²only caused a loss of precision by a proportion of _n+2² compared to the case of σ² known.

Taking c= 1

n is another possibility. While less optimal, it yields the estimator

1− (p−2)s nkXk²

X, (1.6)

where s

n appears. As s

n is such thatE hs

n i

=σ², this is simply the estimator (1.3) where σ² has been replaced by its estimation.

The same kind of transformation as in (1.4) can be done if the covariance is of the form Σ = diag(σ₁², ..., σ²_p). If X ∼ N_p(θ,Σ) and S = diag(s₁, ..., s_p) with s_i ∼ σ_i²χ²_n_i for i∈ {1, ..., p}, denoting √

Σ = diag(σ₁, ..., σ_p), we would define Y =√

Σ⁻¹X, η=√

Σ⁻¹θ, s^∗ = Σ⁻¹S Each component of our James-Stein estimator would then be

1−(p−2)s_i n_ikXk²

X_i.

As an example, imagine p independent samples V_i1, ..., V_i(n_i₊₁₎ are available, where V_ij ∼ N(µ_i, σ_i²), for i ∈ {1, ..., p} and j ∈ {1, ...,(n_i + 1)}. Compute V¯_i = _n¹

i+1

Pni+1 j=1 V_ij and s²_i = _n¹

i

Pni+1

j=1 (V_ij −V¯_i)², the sample means and variances of each sample. As we are in the normal case, V¯_i and s²_i are independent for all i and s²_i ∼ ^σ_nⁱ²

iχ²_n

i. We now consider X = ( ¯V₁, ...,V¯_p)⁰ and S = diag(s²₁, ..., s_p²). Clearly, X ∼ N_p(θ,Σ)with θ = (µ₁, ..., µ_p)⁰ and Σ = diag(^σ_n²¹

1, ...,^σ_n²^p

p). So if we take the estimatorXˆ_{J S} with Xˆ_{J S}

i =

1− (p−2)s²_i n_ikXk²

V¯_i, for i∈ {1, ..., p}, we know that

E

hkXˆ_{J S}−θk²i

≤E

kX−θk² .

For more applied examples, the article [4], "Stein’s Paradox in Statistics" by Efron and Morris is a great read.

Finally, we consider the general case X ∼ N_p(θ,Σ)with Σan unknown positive semi- definite matrix. Here we assume having at our disposalS, a Wishart matrix withndegrees of freedom (S ∼ W_p(n,Σ)). Similarly as before, this means for example a sample of size n, V₁, ..., V_n, (V_i ∼ N_p(0,Σ)) is available and S =P

V_iV_i⁰. The loss function we use in this case takes the slightly different form of

E h

(ˆθ−θ)⁰Σ⁻¹(ˆθ−θ)i .

We assume S is invertible and consider estimators of the form

1− c(p−2) X⁰S⁻¹X

X. (1.7)

(14)

We have

Eθ,Σ

1− c(p−2) X⁰S⁻¹X

X−θ

⁰ Σ⁻¹

1− c(p−2) X⁰S⁻¹X

X−θ

= Eθ^∗,I

1− c(p−2) Y⁰S^∗−1Y

Y −θ^∗ 0

1− c(p−2) Y⁰S^∗−1Y

Y −θ^∗

,

by applying the transformationX →P DX =Y, whereDis a matrix such thatDΣD⁰ =I, and P is an orthogonal matrix with its first row proportional to Dθ, so that P DX = Y ∼ N_p(θ^∗, I), with θ^∗ = (√

θ⁰Σ⁻¹θ,0, ...,0)⁰ and S^∗ = P DSD⁰P⁰ (so that S^∗−1 = P⁰⁻¹D⁰⁻¹S⁻¹D⁻¹P⁻¹) following a Wishart W_p(n, I).

As the conditional distribution of Y⁰S^∗−1Y given Y is that of Y⁰Y

Q where Q∼ χ²_n−p+1 (see the appendix for more details), we find ourselves in the same situation as for the case of σ² unknown. Thus, the optimal choice forc is _n−p+3¹ and the James-Stein estimator is

1− p−2

(n−p+ 3)X⁰S⁻¹X

X, (1.8)

its risk being

p− n−p+ 1

n−p+ 3(p−2)²E 1

kXk²

. (1.9)

(15)

First generalization: elliptically symmetric laws

The next natural step is to see if similar estimators exist for other probability laws than the Gaussian. We will imagine having at our disposal n random p-dimensional variables, X and V₁, ..., Vn−1, whose joint density is of the form

f (X−θ)⁰Σ⁻¹(X−θ) +

n−1

X

j=1

V_j⁰Σ⁻¹V_j

!

, (2.1)

with f :R→R⁺ a Lebesgue integrable function.

Both θ and Σare unknown. The V_j will be used to estimate Σ through

S =

n−1

X

j=1

V_jV_j⁰,

so that we can look at estimators of θ of the formδ(X, S) = X+g(X, S) under the loss E

(δ(X, S)−θ)⁰Σ⁻¹(δ(X, S)−θ) .

The matrix S is again assumed to be invertible.

All the distributions defined this way are elliptically symmetric. Taking f(R²) proportional to exp(−¹₂R²) yields the multivariate normal, while taking it proportional to

1 + _ν¹R²−(ν+p)/2

gives the multivariate t-distribution.

As before, we will look at E

(X+g(X, S)−θ)⁰Σ⁻¹(X+g(X, S)−θ)

=E

(X−θ)⁰Σ⁻¹(X−θ) +E

(2g(X, S)⁰Σ⁻¹(X−θ) +E

g(X, S)⁰Σ⁻¹g(X, S) ,

and make it so that the last two terms are negative for the estimator to dominate X.

14

(16)

The result depends on two lemmas allowing us to express the two terms we’re interested in using expectationE^∗, which will denote the expectation with respect to the distribution

C⁻¹F (X−θ)⁰Σ⁻¹(X−θ) +

n−1

X

j=1

V_j⁰Σ⁻¹V_j

!

where F is defined by

F(t) = Z +∞

t

f(s)ds (2.2)

and C⁻¹ is a normalizing constant, i.e.

C = Z

R^p×...×R^p

F (x−θ)⁰Σ⁻¹(x−θ) +

n−1

X

j=1

v_j⁰Σ⁻¹v_j

!

dxdv₁...dv_n−1. This is in contrast to the standard expectation E, with respect to thef in (2.1).

Before jumping into the results, we can try to get some intuition as to where this F comes from. In the multinormal case, i.e. f(R²) ∝ exp(−₂¹R²), we have F = f and therefore E^∗ = E. Considering dimension 1 for simplicity, the result in Lemma 1.1 in Chapter 1 is based on the fact that φ⁰(x) = −xφ(x) for the normal density function. For variables with density of the form f(x²), this can be generalized by writing

(F(x²))⁰ =−2xf(x²).

The following lemma is therefore simply a generalization of Stein’s Lemma 1.2 in Chap- ter 1, coinciding with it in the multinormal case.

Lemma 2.1. If g(x, .) is a differentiable function, then E

g(X, S)⁰Σ⁻¹(X−θ)

=CE^∗[div_X(g(X, S))]. (2.3) Lemma 2.2. Let T(X, S) be a function from R^p×R^p×p to R^p×p. Then we have

E

tr T(X, S)Σ⁻¹

= 2CE^∗

D_1/2^∗ (T(X, S))

+C(n−p−2)E^∗

tr(S⁻¹T(X,S) with

D^∗_1/2(T(X, S)) =

p

X

i=1

∂T_ii(X, S)

∂S_ii +1 2

X

i6=j

∂T_ij(X, S)

∂S_ij .

This D_1/2^∗ can be seen as a generalization of divergence for matrices, with the 1 2 being a symptom of the symmetry of S.

(17)

Remark. Notationtr(A) is used for the trace of a p×pmatrix, tr(A) =

p

X

i=1

A_ii,

and ∂

∂A_ij denotes the derivative with respect to the component (i, j) of the matrix A (or vector in some cases).

Before proving these lemmas, we will see how they can be used, as it is quite direct.

Theorem 2.1. If E[X⁰X]<∞ and E[g⁰(X, S)g(X, S)]<∞, then E

(X+g(X, S)−θ)⁰Σ⁻¹(X+g(X, S)−θ)

−E

(X−θ)⁰Σ⁻¹(X−θ)

= E

(2g(X, S)⁰Σ⁻¹(X−θ) +E

g(X, S)⁰Σ⁻¹g(X, S)

=CE^∗

2div_X(g(X, S)) + (n−p−2)g⁰(X, S)S⁻¹g(X, S) + 2D^∗_1/2(g(X, S)g⁰(X, S)) . (2.4) Proof. Using Lemma 2.1 on the first term of the difference gives

E

2g(X, S)⁰Σ⁻¹(X−θ)

=CE^∗[2div_X(g(X, S))]. Lemma 2.2, with T(X, S) = g(X, S)g⁰(X, S), on the second gives

E

g(X, S)⁰Σ⁻¹g(X, S)

=E

tr g(X, S)⁰Σ⁻¹g(X, S)

=E

tr Σ⁻¹g(X, S)g(X, S)⁰

= 2CE^∗

D^∗_1/2(g(X, S)g(X, S)⁰) +C(n−p−2)E^∗

tr(S⁻¹g(X,S)g(X,S)⁰ . The results follows immediately.

In the light of equation (2.4), we are in a similar position as after Theorem 1.1 in Chapter 1. Following the same heuristic, it will be our objective to identify functions g : R^p×R^p×p → R^p such that (2.4) is negative, hereby yielding domination in terms of MSE. This will be performed in Corollary 2.2, at the end of this chapter.

The proofs of the two lemmas rely on an integration by slice result, as well as a corollary involving Stokes Theorem, that can be derived from [5](Theorem 3.2.12). This dividesR^p into ellipsoids and allows the same kind of "integration by parts" generalization as in Stein’s Lemma 1.1.

Lemma 2.3. For any r∈R and any continuously differentiable function φ defined on R^p, let [φ=r] be the submanifold in R^p associated with φ. Then, for any Lebesgue integrable function f, we have

Z

R^p

f(x)dx= Z

{r∈R|[φ=r]6=∅}

Z

[φ=r]

f(x)

k∇φ(x)kdσ_rdr, where σ_r is the area measure defined on [φ =r].

(18)

Corollary 2.1. If g is a function defined on R^p such that ∇φ·g is integrable, then Z

R^p

∇φ(x)·g(x)dx= Z

{r∈R|[φ=r]6=∅}

Z

Br

div(g(x))dxdr,

whereB_r is the set with boundary[φ =r]corresponding, for anyx∈[φ=r], to the outward normal vector ∇φ(x).

We now prove the two lemmas.

Proof of Lemma 2.1. We want to compute E

(g(X, S)⁰Σ⁻¹(X−θ)

= Z

R^p×...×R^p

Z

R^p

g(x, s)⁰Σ⁻¹(x−θ)f (x−θ)⁰Σ⁻¹(x−θ) +

n−1

X

j=1

v_j⁰Σ⁻¹vj

!

dxdv1...dvn−1

We define φ(x) = p

(x−θ)⁰Σ⁻¹(x−θ) to use Lemma 2.3 and Corollary 2.1 on the inner integral. We have

∇φ(x) = Σ⁻¹(x−θ) p(x−θ)⁰Σ⁻¹(x−θ) and therefore, setting R =p

(x−θ)⁰Σ⁻¹(x−θ), Z

R^p

g(x, s)⁰Σ⁻¹(x−θ)f (x−θ)⁰Σ⁻¹(x−θ) +

n−1

X

j=1

v_j⁰Σ⁻¹v_j

! dx

=

Z +∞

0

f R²+

n−1

X

j=1

v_j⁰Σ⁻¹v_j

!Z

[φ=R]

g(x, s)⁰Σ⁻¹(x−θ)

k∇φ(x)k dσ_rdR

=

Z +∞

0

f R²+

n−1

X

j=1

v_j⁰Σ⁻¹v_j

!Z

[φ=R]

g(x, s)⁰p

(x−θ)⁰Σ⁻¹(x−θ) ∇φ(x)

k∇φ(x)kdσ_rdR

=

Z +∞

0

Rf R²+

n−1

X

j=1

v_j⁰Σ⁻¹v_j

!Z

[φ=R]

g(x, s)⁰ ∇φ(x)

k∇φ(x)kdσ_rdR

= Z

R^p

Rf R² +

n−1

X

j=1

v⁰_jΣ⁻¹v_j

!

∇φ(x)·g(x)dx

(using Lemma 2.3 backwards so that we can now use Corollary 2.1)

=

Z +∞

0

Rf R²+

n−1

X

j=1

v_j⁰Σ⁻¹v_j

!Z

[φ≤R]

div_x(g(x, s))dxdR

(19)

= Z

R^p

div_x(g(x, s)) Z +∞

√(x−θ)⁰Σ⁻¹(x−θ)

Rf R²+

n−1

X

j=1

v_j⁰Σ⁻¹v_j

! dRdx

= Z

R^p

divx(g(x, s))1 2

Z +∞

(x−θ)⁰Σ⁻¹(x−θ)

f r+

n−1

X

j=1

v⁰_jΣ⁻¹vj

! drdx

= Z

R^p

div_x(g(x, s))F (x−θ)⁰Σ⁻¹(x−θ) +

n−1

X

j=1

v_j⁰Σ⁻¹v_j

! dx.

Replacing this into the first expression gives the desired result in (2.3).

Proof of Lemma 2.2. First develop the term in the expectation:

tr(T(X, S)Σ⁻¹) = tr(T(X, S)Σ⁻¹SS⁻¹

= tr(T(X, S)Σ⁻¹

n−1

X

i=1

V_iV_i⁰S⁻¹

=

n−1

X

i=1

tr(V_i⁰S⁻¹T(X, S)Σ⁻¹V_i)

=

n−1

X

i=1

V_i⁰S⁻¹T(X, S)Σ⁻¹V_i.

Then, mimicking the proof of Lemma 2.1 done just before, with Vi instead of X −θ and g(V_i, S) =T(X, S)S⁻¹V_i, we get (withS^lm denoting S_lm⁻¹)

E

tr(T(X, S)Σ⁻¹)

=C

n−1

X

i=1

E^∗

div_V_i(T(X, S)S⁻¹V_i)

=C

n−1

X

i=1

E^∗

" _p X

j=1

∂

∂V_ij

p

X

m=1 p

X

l=1

T_jl(X, S)S^lmV_im

!#

=CE^∗[A₁+A₂ +A₃], (2.5)

and

A₁ =

n−1

X

i=1 p

X

j=1 p

X

m=1 p

X

l=1

∂

∂V_ijV_im

T_jl(X, S)S^lm,

A₂ =

n−1

X

i=1 p

X

j=1 p

X

m=1 p

X

l=1

V_im ∂

∂Vij

T_jl(X, S)

S^lm,

and A₃ =

n−1

X

i=1 p

X

j=1 p

X

m=1 p

X

l=1

V_imT_jl(X, S) ∂

∂V_ijS^lm

.

(20)

We then computeA₁, A₂ and A₃ separately. First, A₁ =

n−1

X

i=1 p

X

j=1 p

X

m=1 p

X

l=1

δ_jmT_jl(X, S)S^lm

= (n−1)

p

X

j=1 p

X

l=1

T_jl(X, S)S^lj

= (n−1)tr(T(X, S)S⁻¹).

ForA₂, becauseS is symmetric, we first get A₂ =

n−1

X

i=1 p

X

j=1 p

X

m=1 p

X

l=1

V_imS^lm X

q≤r

∂T_jl(X, S)

∂S_qr

∂V_ij

! .

Using the definition of S=Pn−1

i=1 V_iV_i⁰, we have

∂Sqr

∂V_ij = ∂

∂V_ij(V_iqV_ir) = V_iqδ_jr+V_irδ_jq and then

n−1

X

i=1

(V_iqδ_jr+V_irδ_jq)V_im =S_mqδ_jr+S_mrδ_jq.

The following is immediate from the definition of the inverse, but it is reminded because it will be used several times: _p

X

m=1

S^amSmb =δab. (2.6)

We now get

A₂ =

p

X

j=1 p

X

m=1 p

X

l=1

S^lm X

q≤r

∂Tjl(X, S)

∂S_qr S_mqδ_jr+S_mrδ_jq

!

=

p

X

j=1 p

X

l=1

X

q≤r

∂T_jl(X, S)

∂S_qr (δ_lqδ_jr+δ_lrδ_jq) (thanks to (2.6))

=

p

X

j=1 p

X

l=1

X

q≤r

∂T_jl(X, S)

∂S_lj δ_lqδ_jr+∂T_jl(X, S)

∂S_jl δ_lrδ_jq

=

p

X

j=1 p

X

l=1

X

q≤r

∂T_jl(X, S)

∂S_lj (δ_lqδ_jr+δ_lrδ_jq).

Finally noting that

X

q≤r

(δ_lqδ_jr+δ_lrδ_jq) =

2 if j =l, 1 if j 6=l,

(21)

we get

A₂ = 2

p

X

j=1

∂T_jj(X, S)

∂S_jj +1 2

X

j6=p

∂T_jl(X, S)

∂S_jl

!

= 2D_1/2^∗ (T(X, S)).

The last term left is A₃. Starting the same way as for A₂, we can get A₃ =

p

X

j=1 p

X

l=1

T_jl(X, S)

p

X

m=1

X

q≤r

∂S^lm

∂S_qr(S_mqδ_jr+S_mrδ_jq). (2.7) To compute ∂S^lm

∂S_qr, we use (2.6) again and take the derivative on both sides to find

p

X

j=1

∂S^lj

∂S_stS_jn+S^lj∂S_jn

∂S_st

= 0.

As S is symmetric, if q6=r, ∂S_jn

∂S_qr =δ_jqδ_nr+δ_jrδ_nq, which leads to

p

X

j=1

∂S^lj

∂S_qrS_jn =−

p

X

j=1

S^lj(δ_jqδ_nr+δ_jrδ_nq)

=−S^lqδ_nr−S^lrδ_nq. We now multiply this by S^nm and sum over n to get

p

X

n=1 p

X

j=1

∂S^lj

∂S_qrS_jnS^nm =

p

X

n=1

−S^lqδ_nr −S^lrδ_nq S^nm.

Killing a sum thanks to (2.6) again, this gives us

p

X

j=1

∂S^lj

∂Sqr

δ_jm = ∂S^lm

∂Sqr

=−S^lqS^rm−S^lrS^qm. The case q =r is treated similarly, as ∂S_jn

∂S_qq =δjqδnq, to get

∂S^lm

∂S_qr =

(−S^lqS^rm−S^lrS^qm if q6=r,

−S^lqS^qm if q=r. (2.8)

In (2.7), the second half becomes

p

X

m=1

X

q≤r

∂S^lm

∂S_qr(S_mqδ_jr+S_mrδ_jq)

(22)

=−

p

X

m=1

X

q=r

2S_mqδ_jqS^lqS^qm+X

q<r

(S_mqδ_jr+S_mrδ_jq)(S^lqS^rm+S^lrS^qm)

!

=−

p

X

m=1

2S_mjS^ljS^jm+X

q<r

(S_mqδ_jr+S_mrδ_jq)(S^lqS^rm+S^lrS^qm)

! .

Placing this last expression into the one forA₃ in (2.7), after using once again that S and S⁻¹ are symmetric and (2.6), we obtain

A₃ =−

p

X

j=1 p

X

l=1

T_jl(X, S) 2S^lj +X

q<r

(δ_jrS^lr+δ_jqS^lq)

!

=−

p

X

j=1 p

X

l=1

T_jl(X, S)S^lj 2 +X

q<r

(δ_jr+δ_jq)

!

=−

p

X

j=1 p

X

l=1

T_jl(X, S)S^lj(2 + (p−1))

=−(p+ 1)tr T(X, S)S⁻¹ .

Substituting the expressions found forA1,A2 and A3 in (2.5), we get the announced result.

As anticipated, we can now find estimators that dominate X.

Corollary 2.2. Let r:R→Rbe a nondecreasing positive function bounded by 2(p−2) n−p+ 2. If E[X⁰X]<∞ and E

X⁰X (X⁰S⁻¹X)²

<∞, then the estimator

1− r(X⁰S⁻¹X) X⁰S⁻¹X

X (2.9)

dominates X under mean square error.

Remark. The functionr replaces the constants in the estimators from Chapter 1 (thecin (1.7) more specifically) and the bounds on r are reminiscent of those on those constants.

Proof. We apply Theorem 2.1 with g(X, S) = −r(X⁰S⁻¹X)

X⁰S⁻¹X X and show that CE^∗

2div_X(g(X, S)) + (n−p−2)g(X, S)⁰S⁻¹g(X, S) + 2D^∗_1/2(g(X, S)g(X, S)⁰)

≤0.

(2.10) For the first term of (2.10), we need to compute

div_X(g(X, S)) =−div_X

r(X⁰S⁻¹X) X⁰S⁻¹X X

(23)

=−

div_X(X)r(X⁰S⁻¹X)

X⁰S⁻¹X +X⁰∇_X

r(X⁰S⁻¹X) X⁰S⁻¹X

.

As div_X(X) =p and

∇_X

r(X⁰S⁻¹X) X⁰S⁻¹X

=

2r⁰(X⁰S⁻¹X)X⁰S⁻¹X−r(X⁰S⁻¹X) (X⁰S⁻¹X)² S⁻¹X

,

this gives the expression

div_X(g(X, S)) =−

(p−2)r(X⁰S⁻¹X)

X⁰S⁻¹X + 2r⁰(X⁰S⁻¹X)

.

The second term in (2.10) is direct:

g(X, S)⁰S⁻¹g(X, S) = r²(X⁰S⁻¹X) X⁰S⁻¹X . Finally, for the last term, we have

D_1/2^∗ (g(X, S)g(X, S)⁰)

=D^∗_1/2

r²(X⁰S⁻¹X) (X⁰S⁻¹X)² X⁰X

=

p

X

i=1

∂

∂S_ii

r²(X⁰S⁻¹X) (X⁰S⁻¹X)²

X_i² +1 2

X

i6=j

∂

∂S_ij

r²(X⁰S⁻¹X) (X⁰S⁻¹X)²

X_iX_j.

As

∂

∂S_ij

r²(X⁰S⁻¹X) (X⁰S⁻¹X)²

=

2(X⁰S⁻¹X)²r(X⁰S⁻¹X)r⁰(X⁰S⁻¹X)−2(X⁰S⁻¹X)r²(X⁰S⁻¹X) (X⁰S⁻¹X)⁴

∂

∂S_ij(X⁰S⁻¹X) and, using (2.8),

∂

∂S_ij(X⁰S⁻¹X) = X

l,m

X_l⁰∂S^lm

∂S_ij X_m

=−(2−δ_ij)(X⁰S⁻¹)_i(X⁰S⁻¹)_j, it follows that

D^∗_1/2(g(X, S)g⁰(X, S))

=

2(X⁰S⁻¹X)²r(X⁰S⁻¹X)r⁰(X⁰S⁻¹X)−2(X⁰S⁻¹X)r²(X⁰S⁻¹X) (X⁰S⁻¹X)⁴

(24)

×

p

X

i=1

∂

∂Sii

(X⁰S⁻¹X)X_i² +1 2

X

i6=j

∂

∂Sij

(X⁰S⁻¹X)X_iX_j

!

=−2

(X⁰S⁻¹X)²r(X⁰S⁻¹X)r⁰(X⁰S⁻¹X)−(X⁰S⁻¹X)r²(X⁰S⁻¹X) (X⁰S⁻¹X)⁴

×

p

X

i=1

(X⁰S⁻¹)²_iX_i²+1 2

X

i6=j

2(X⁰S⁻¹)_i(X⁰S⁻¹)_jX_iX_j

!

| {z }

=(X⁰S⁻¹X)²

=−2

r(X⁰S⁻¹X)r⁰(X⁰S⁻¹X)− r²(X⁰S⁻¹X) X⁰S⁻¹X

.

Finally putting everything back together in the expression (2.10), we obtain CE^∗

2div_X(g(X, S)) + (n−p−2)g(X, S)⁰S⁻¹g(X, S) + 2D_1/2^∗ (g(X, S)g(X, S)⁰)

=CE^∗

"

2

(p−2)r(X⁰S⁻¹X)

X⁰S⁻¹X + 2r⁰(X⁰S⁻¹X)

+ (n−p−2)r²(X⁰S⁻¹X) X⁰S⁻¹X

−4

r(X⁰S⁻¹X)r⁰(X⁰S⁻¹X)− r²(X⁰S⁻¹X) X⁰S⁻¹X

#

=CE^∗

"

r(X⁰S⁻¹X)

X⁰S⁻¹X −2(p−2) + (n−p−2)r(X⁰S⁻¹X)

−4r⁰(X⁰S⁻¹X) 1 +r(X⁰S⁻¹X)

#

which, given the definition of r, is negative as required.

Choosing r constant and equal to (p−2)

n−p+ 2 (n became n−1 given how we considered the problem in this section) gives the James-Stein estimator. This means that the estimator (1.8) found in the normal case also works in the multivariate Student case: simulations will be used to compare them in Chapter 4.

(25)

Second generalization: high dimension

An implicit assumption for the previous results was for the dimensionpto be smaller than the sample sizen in order for the estimated covariance matrix to be non-singular. Indeed, if, given a sample Y₁, ..., Y_n, we compute a matrix S as before by S = Pn

j=1Y_jY_j⁰, then rk(S)≤min(n, p). It is easier to see by writing S =Y⁰Y, with Y the n×pmatrix whose rows are our observations Y_i. As the Y_i are usually independent and, in the cases that interest us, from a continuous distribution, the rank of S is actually equal to min(n, p) almost surely. Therefore, when p > n, S is a p × p matrix of rank n and is almost surely singular. To illustrate this, let’s look at 3 observations (rounded to the nearest hundredth for readability) from a4-dimensional normal with mean0and covariance matrix the diagonal matrix with1,2,3,4 on the diagonal. The 3vectors are

X₁ = (0.43,−0.45,−1.59,−2.96)⁰ X2 = (−0.78,−0.89,−3.46,3.15)⁰ X₃ = (−1.29,−0.15,−0.47,−1.91)⁰

and the matrix S obtained, along with its singular value decomposition to show its singu- larity and also anticipating on what is to come, is

S =







1.56 −0.12 −0.41 −2.19

−0.12 0.28 1.12 −2.05

−0.41 1.12 4.56 −8.46

−2.19 −2.05 −8.46 21.39







=U







25.27 0 0 0 0 2.52 0 0

0 0 0 0





 V⁰

where U and V are two unitary matrices. For the exact values, the R code used is in the appendix.

When matrices are singular like here, generalization of the inverse exists, with diverse properties. One of the most popular ones, due to its uniqueness, is the Moore-Penrose pseudoinverse and the focus of this section will be to show how to use it to obtain new forms of the James-Stein estimator.

We put ourselves back in the situation where X is a p-dimensional variable normally distributed with meanθand covariance matrixΣand we would like to estimateθ, knowing

24