About Stein's estimators: the original result and extensions
Auteur : Demaret, Tom Promoteur(s) : Swan, Yvik Faculté : Faculté des Sciences
Diplôme : Master en sciences mathématiques, à finalité spécialisée en statistique Année académique : 2018-2019
URI/URL : http://hdl.handle.net/2268.2/6981
Avertissement à l'attention des usagers :
Tous les documents placés en accès ouvert sur le site le site MatheO sont protégés par le droit d'auteur. Conformément aux principes énoncés par la "Budapest Open Access Initiative"(BOAI, 2002), l'utilisateur du site peut lire, télécharger, copier, transmettre, imprimer, chercher ou faire un lien vers le texte intégral de ces documents, les disséquer pour les indexer, s'en servir de données pour un logiciel, ou s'en servir à toute autre fin légale (ou prévue par la réglementation relative au droit d'auteur). Toute utilisation du document à des fins commerciales est strictement interdite.
Par ailleurs, l'utilisateur s'engage à respecter les droits moraux de l'auteur, principalement le droit à l'intégrité de l'oeuvre et le droit de paternité et ce dans toute utilisation que l'utilisateur entreprend. Ainsi, à titre d'exemple, lorsqu'il reproduira un document par extrait ou dans son intégralité, l'utilisateur citera de manière complète les sources telles que
mentionnées ci-dessus. Toute utilisation non explicitement autorisée ci-avant (telle que par exemple, la modification du document ou son résumé) nécessite l'autorisation préalable et expresse des auteurs ou de leurs ayants droit.
Master’s thesis
About Stein’s estimators
The original result and extensions
Author:
Tom Demaret
Supervisor:
Yvik Swan
A thesis submitted in fulfillment of the requirements for the Master’s degree in mathematics
Department of mathematics Faculty of Sciences
Academic year 2018-2019
I would like to thank Pr. Yvik Swan for proposing this topic to me and for his help and patience along the way.
Thanks also to the readers: in the hope they will have an enjoyable time and find this work interesting.
2
1 Stein’s original result 6 2 First generalization: elliptically symmetric laws 14
3 Second generalization: high dimension 24
3.1 Reminders on the Moore-Penrose inverse . . . 25 3.2 The main result . . . 26 3.3 Technical results . . . 29
4 Simulations 41
A Appendix: Technical results 55
A.1 Results on Wishart matrices . . . 55 A.2 Results on the Moore-Penrose Inverse . . . 57
B Appendix R code 59
3
Suppose you are interested in estimating the mean θ of a Gaussian law and dispose, to this end, of a single random Gaussian variable X (which, as the Gaussian law is sum- stable, could itself be the sample mean of a large number of independent Gaussian random variables). Given the symmetry of the normal distribution, our intuition tells us that looking at X itself is the best way to estimate θ. To assess the quality of our intuitive estimator, the square error loss function and its associated risk, the mean square error, are the most common evaluation criterion. There are many reasons behind this popularity:
convenience, elegance, mathematical tractability... In our case, an additional reason is its intrinsic link with the Gaussian law. Indeed, for the normal distribution, taking the maximum likelihood estimator is the same as minimizing the mean square error, and the estimator obtained this way is X, confirming our intuition. To further assess the quality of an estimator, there exists other ways: two important ones to set up the premises of this work are given in the following definitions.
Definition. An estimator δ∗ of a parameter θ ∈ Θ is minimax with respect to a risk function R(θ, δ) if
sup
θ∈Θ
R(θ, δ∗) = inf
δ sup
θ∈Θ
R(θ, δ)
Intuitively, an estimator is minimax if it is "the best in the worst case".
Definition. An estimatorδ∗ of a parameterθ∈Θis admissible with respect to a risk func- tion R(θ, δ) if no other estimator dominates it, meaning there does not exist an estimator δ such that R(θ, δ)≤R(θ, δ∗) for all θ and R(θ, δ)< R(θ, δ∗) for at least some θ.
In one dimension, the estimator X is minimax and admissible (see [12] and references in it), and at first glance, nothing seems to indicate that it should be any different in higher dimensions. In fact, the estimatorX, for a variable whose mean is the vectorθ, is minimax in any dimension and if the dimension is p and the covariance isσ2I (we look at this case for simplicity as it is equivalent to looking at p independent one dimensional Gaussian variables and it makes computations easy), then the mean square error, E[kX−θk2], is equal to pσ2. It has also been shown to be admissible for p = 2 (see [12]), but in 1956, Charles Stein found estimators that dominateXas soon aspis greater or equal to3. More precisely, he exhibited a biased estimator of the form X+g(X), withg a certain function, that is also minimax: its risk is p minus a term proportional to kXk1 2 so it becomes equal to that of X when the norm of θ goes to infinity, but with lower risk for all finite θ. One
4
could say it exploits the weakness in the definition of a minimax estimator. This somewhat paradoxical result, stating in essence that combined information on unrelated events could bring better results overall than looking at each individually, came as a big surprise when first presented, but soon led to a plethora of research. This thesis aims to explore some of this research.
In Chapter 1, the first result by Stein is presented: starting from the simple case of iden- tity covariance matrix and working our way step by step to the most direct generalizations.
This is mostly taken from [12, 13, 14].
In Chapter 2, the estimators are extended to different probability laws. The link be- tween mean square error and the Gaussian law comes from the presence of a (x−θ)2 term in the density function. This term appears in other laws, called elliptically symmetric laws, so it is natural that those laws behave similarly. This is taken from [6].
In our world filled with data, the case of high dimension and low sample size is getting more and more common and brings new kinds of problems. This is explored in Chapter 3.
The main result is from [2].
Finally, the results are illustrated through simulations in Chapter 4 and some technical side results used are proved in the Appendix.
Stein’s original result
The James-Stein estimator is an estimator for the meanθof a normal distribution which, at the price of a bias, dominates the usual estimator under mean square error: E
hkθˆ−θk2i , where θˆis our estimator.
The result is based on a simple lemma, proved using Fubini’s theorem.
Lemma 1.1. Let X be a real random variable following a standard Gaussian law N(0,1) and g : R → R an absolutely continuous function such that g0 is integrable. Then, if E|g0(X)|<∞,
E[g0(X)] =E[Xg(X)].
Proof. The density function φ(x) of the standard Gaussian law, as it will be noted from now on, is such that φ0(x) =−xφ(x).
Note also how, using R
Ryφ(y)dy = 0, Z x
−∞
−yφ(y)dy= Z +∞
x
yφ(y)dy ∀x∈R
We then have E[g0(X)] =
Z
R
g0(x)φ(x)dx
= Z
R
g0(x) Z x
−∞
(−yφ(y))dy
dx
= Z +∞
0
g0(x)
Z +∞
x
yφ(y)dy
dx− Z 0
−∞
g0(x) Z x
−∞
yφ(y)dy
dx
= Z +∞
0
yφ(y) Z y
0
g0(x)dx
dy− Z 0
−∞
yφ(y) Z 0
y
g0(x)dx
dy
= Z
R
yφ(y)(g(y)−g(0))dy
=E[Xg(X)]−g(0)E[X] =E[Xg(X)]
6
The result can be extended to any Gaussian random variable Y ∼ N(µ, σ2) by writing Y =σX+µwhere X ∼ N(0,1)and h(y) =g
y−µ σ
. We then obtain
E[h0(Y)] = 1 σE
g0
Y −µ σ
= 1
σE[g0(X)] = 1
σ E[Xg(X)]
= 1 σE
Y −µ σ g
Y −µ σ
=E
Y −µ σ2 h(Y)
.
The result then needs to be extended to any dimension. To remain as general as possible, we will use the following definition.
Definition 1.1. A function h : Rp → R is called almost differentiable if there exists a function ∇h:Rp →Rp such that, for alla∈Rp,
h(x+a) =h(x) + Z 1
0
a· ∇h(x+ta)dt.
The function ∇h is essentially the vector of partial derivatives, which is why we will use the following notations from now on. For a function f :Rp →R,
• ∇f = (∂1f, ..., ∂pf)0
• div(f) = Pp
i=1∂if(X)
• ∆f =Pp
i=1∂i2f(X).
Lemma 1.2. Let X be a p-dimensional random variable following a standard Gaussian law with meanθ and the identity as covariance matrix, andh:Rp →Ran almost differentiable function. If E|∇h(X)|<∞, then
E[∇h(X)] =E[(X−θ)h(X)].
Proof. For all i∈ {1, ..., p}, write X = (Xi, X−i), meaning that we decompose X between its ith component and all the others. Because X is normal, Xi and X−i are independent and therefore, we find that, using Lemma 1.1,
E[∂ih(X)|X−i] =E[(Xi−θi)h(X)|X−i], and then, taking the expectation,
E[∂ih(X)] =E[(Xi−θi)h(X)], whence the conclusion.
For what follows, we first consider X to be a p-dimensional random variable following a standard Gaussian law with mean θ and the identity as covariance matrix.
Theorem 1.1. For an almost differentiable function g : Rp → Rp (meaning that all its components are almost differentiable), such that
E
" p X
i=1
|∂igi(X)|
#
<∞, (A)
we have
E
kX+g(X)−θk2
=p+E
kg(X)k2+ 2div(g(X))
. (1.1)
Proof. Using Lemma 1.2, we have E
(Xi+gi(X)−θi)2
=E
(Xi−θi)2+ 2(Xi−θi)gi(X) +g2i(X)
= 1 + 2E[∂igi(X)] +E
g2i(X) and we get the result by summing overi.
This theorem gives an expression for the MSE of any estimator of the form X+g(X) by decomposing it between the error of the usual estimator E[kX−θk2] = p and a term that depends on the functiong. The goal will be to make this term negative. With this in mind, we concentrate in (1.1) on functions g :Rp →Rp of the form
g =∇logf = ∇f f
with f such that this is well defined. This leads to a modified version of Theorem 1.1.
Theorem 1.2. Let f : Rp → R+0 be a almost differentiable function such that ∇f is also almost differentiable,
E
"
1 f(X)
p
X
i=1
|∂i2f(X)|
#
<∞ (A’)
and
E
k∇logf(X)k2
<∞.
Then
E
kX+∇logf(X)−θk2
=p+E
2∇2f(X)
f(X) −k∇f(X)k2 f2(X)
=p+ 4E
"
∇2p f(X) pf(X)
# .
Remark. Note that condition A’ is simply the rewriting of condition A with our particular choice of g and the other conditions ensure that the final expression is finite.
Proof. Using
div(∇logf) = ∆f
f − k∇fk2 f2 . we directly get from 1.1
E
kX+∇logf(X)−θk2
=p+E
k∇fk2 f2 + 2
∆f
f − k∇fk2 f2
=p+E
2∆f
f − k∇fk2 f2
.
Finally, as
∆p
f
= div(∇f) = div ∇f
2√ f
= ∆f 2√
f − k∇fk2 4f32 , we can replace k∇fk2 by
2f(∆f)−4f32(∆p f) to get the desired result.
From this, it follows that if we can find a function f satisfying the assumptions of the theorem and such that∆
p f(x)
≤0, then we have an estimator, X+∇logf(X), that dominates the usual estimator X. Indeed, in that case,
E
kX+∇logf(X)−θk2
≤p=E
kX−θk2
The famous James-Stein estimator is obtained by choosing f(x) = 1
kXk2 b
(see [15]
for more details).
We have
∇f(X) =−b 1
kXk2
−(b+1)
2X so
∇logf(X) = ∇f(X)
f(X) = −2b kXk2X and the estimator is
1− 2b kXk2
X. As we have
∆p
f(X)
=−b(p−2−b) kXkb+2 ,
this estimator dominates X for 0 ≤ b ≤ (p − 2) (when p > 2). Its risk is equal to p−4E
b(p−2−b) kXk2
. The improvement is therefore maximal for b = p−22 , whence the following definition.
Definition 1.2. The usual James-Stein estimator for X ∼ Np(θ, I) is
1− p−2 kXk2
X. (1.2)
Its risk is equal to p−(p−2)2E 1
kXk2
.
It is interesting to note that the degree of the improvement depends on the value of kXk2: the closerX is to0, the bigger the improvement will be. On the other hand, ifX is big, the James-Stein estimator will be very close to the actual value X. This means that the estimator is much more useful when X ∼ Np(θ, I) with kθk small. It may therefore be useful to modify the estimator, by centering the data first. This will be studied by simulations, in Section 4.
We will now work by steps to generalize this result to any covariance matrix.
Let’s first look at the case where the covariance matrix is a multiple of the identity, i.e. X ∼ Np(θ, σ2I). This is the case if we consider a sample of p independent normally distributed variables as a vector of size p.
If σ2 is known, looking at X
σ and applying previous results, the James-Stein estimator
for X takes the form
1− (p−2)σ2 kXk2
X (1.3)
and its risk is
σ2
p−(p−2)2E 1
kXk2
.
If σ2 is unknown, it needs to be estimated and we assume we have at our disposal a variable s ∼ σ2χ2n, independent of X. This would typically be obtained through a sample of n+ 1 independent normally distributed variables vi (vi ∼ N(µ, σ2)), by taking s=Pn+1
i=1(vi−¯v)2.
Remark. The notations is used for readability purposes, but be careful that it does not denote the standard deviation. In our example, it denotes n times the sample variance.
Setting
Y = X
σ, η= θ
σ, s∗ = s
σ2 (1.4)
and looking at estimators of the form
1− c(p−2)s kXk2
X, with c a constant to be deter- mined, we get
E
"
X− c(p−2)s kXk2 X−θ
2#
=σ2E
"
Y − c(p−2)s∗ kYk2 Y −η
2#
=σ2E
(Y −η)− c(p−2)s∗ kYk2 Y
0
(Y −η)− c(p−2)s∗ kYk2 Y
=σ2E
kY −ηk2−2c(p−2)s∗(Y −η)0Y
kYk2 +c2(p−2)2s∗2 1 kYk2
=σ2
p−2c(p−2)E[s∗]E
(Y −η)0Y kYk2
+c2(p−2)2E s∗2
E 1
kYk2
=σ2
p−2c(p−2)nE
(Y −η)0Y kYk2
+c2(p−2)2n(n+ 2)E 1
kYk2
using the independence of Y and s∗ and the expression for the moment of a χ2n law.
The middle term can easily be computed by integration by parts, but, because Y ∼ Np(η, I), we can also write
E
"
Y −(p−2) kYk2 Y −η
2#
=E
kY −ηk2 −2(p−2)(Y −η)0Y
kYk2 + (p−2)2 1 kYk2
=p+ (p−2)2E 1
kYk2
−2(p−2)E
(Y −η)0Y kYk2
and, as we know that the first term is also equal to p−(p−2)2E 1
kYk2
, we find that
E
(Y −η)0Y kYk2
= (p−2)E 1
kYk2
Wrapping up, this gives us E
"
X− c(p−2)s kXk2 X−θ
2#
=σ2
p−n(p−2)2(2c−c2(n+ 2))E 1
kYk2
.
Since c is arbitrary, we can choose it so as to minimize 2c−c2(n+ 2), which is done by taking c= n+21 .
Definition 1.3. The James-Stein estimator for X ∼ Np(θ, σ2I), with σ2 unknown esti- mated through s∼σ2χ2n, is
1− (p−2)s (n+ 2)kXk2
X. (1.5)
Its risk is equal to σ2
p− n+2n (p−2)2E 1
kXk2
.
Note how estimatingσ2only caused a loss of precision by a proportion of n+22 compared to the case of σ2 known.
Taking c= 1
n is another possibility. While less optimal, it yields the estimator
1− (p−2)s nkXk2
X, (1.6)
where s
n appears. As s
n is such thatE hs
n i
=σ2, this is simply the estimator (1.3) where σ2 has been replaced by its estimation.
The same kind of transformation as in (1.4) can be done if the covariance is of the form Σ = diag(σ12, ..., σ2p). If X ∼ Np(θ,Σ) and S = diag(s1, ..., sp) with si ∼ σi2χ2ni for i∈ {1, ..., p}, denoting √
Σ = diag(σ1, ..., σp), we would define Y =√
Σ−1X, η=√
Σ−1θ, s∗ = Σ−1S Each component of our James-Stein estimator would then be
1−(p−2)si nikXk2
Xi.
As an example, imagine p independent samples Vi1, ..., Vi(ni+1) are available, where Vij ∼ N(µi, σi2), for i ∈ {1, ..., p} and j ∈ {1, ...,(ni + 1)}. Compute V¯i = n1
i+1
Pni+1 j=1 Vij and s2i = n1
i
Pni+1
j=1 (Vij −V¯i)2, the sample means and variances of each sample. As we are in the normal case, V¯i and s2i are independent for all i and s2i ∼ σni2
iχ2n
i. We now consider X = ( ¯V1, ...,V¯p)0 and S = diag(s21, ..., sp2). Clearly, X ∼ Np(θ,Σ)with θ = (µ1, ..., µp)0 and Σ = diag(σn21
1, ...,σn2p
p). So if we take the estimatorXˆJ S with XˆJ S
i =
1− (p−2)s2i nikXk2
V¯i, for i∈ {1, ..., p}, we know that
E
hkXˆJ S−θk2i
≤E
kX−θk2 .
For more applied examples, the article [4], "Stein’s Paradox in Statistics" by Efron and Morris is a great read.
Finally, we consider the general case X ∼ Np(θ,Σ)with Σan unknown positive semi- definite matrix. Here we assume having at our disposalS, a Wishart matrix withndegrees of freedom (S ∼ Wp(n,Σ)). Similarly as before, this means for example a sample of size n, V1, ..., Vn, (Vi ∼ Np(0,Σ)) is available and S =P
ViVi0. The loss function we use in this case takes the slightly different form of
E h
(ˆθ−θ)0Σ−1(ˆθ−θ)i .
We assume S is invertible and consider estimators of the form
1− c(p−2) X0S−1X
X. (1.7)
We have
Eθ,Σ
1− c(p−2) X0S−1X
X−θ
0 Σ−1
1− c(p−2) X0S−1X
X−θ
= Eθ∗,I
1− c(p−2) Y0S∗−1Y
Y −θ∗ 0
1− c(p−2) Y0S∗−1Y
Y −θ∗
,
by applying the transformationX →P DX =Y, whereDis a matrix such thatDΣD0 =I, and P is an orthogonal matrix with its first row proportional to Dθ, so that P DX = Y ∼ Np(θ∗, I), with θ∗ = (√
θ0Σ−1θ,0, ...,0)0 and S∗ = P DSD0P0 (so that S∗−1 = P0−1D0−1S−1D−1P−1) following a Wishart Wp(n, I).
As the conditional distribution of Y0S∗−1Y given Y is that of Y0Y
Q where Q∼ χ2n−p+1 (see the appendix for more details), we find ourselves in the same situation as for the case of σ2 unknown. Thus, the optimal choice forc is n−p+31 and the James-Stein estimator is
1− p−2
(n−p+ 3)X0S−1X
X, (1.8)
its risk being
p− n−p+ 1
n−p+ 3(p−2)2E 1
kXk2
. (1.9)
First generalization: elliptically symmetric laws
The next natural step is to see if similar estimators exist for other probability laws than the Gaussian. We will imagine having at our disposal n random p-dimensional variables, X and V1, ..., Vn−1, whose joint density is of the form
f (X−θ)0Σ−1(X−θ) +
n−1
X
j=1
Vj0Σ−1Vj
!
, (2.1)
with f :R→R+ a Lebesgue integrable function.
Both θ and Σare unknown. The Vj will be used to estimate Σ through
S =
n−1
X
j=1
VjVj0,
so that we can look at estimators of θ of the formδ(X, S) = X+g(X, S) under the loss E
(δ(X, S)−θ)0Σ−1(δ(X, S)−θ) .
The matrix S is again assumed to be invertible.
All the distributions defined this way are elliptically symmetric. Taking f(R2) pro- portional to exp(−12R2) yields the multivariate normal, while taking it proportional to
1 + ν1R2−(ν+p)/2
gives the multivariate t-distribution.
As before, we will look at E
(X+g(X, S)−θ)0Σ−1(X+g(X, S)−θ)
=E
(X−θ)0Σ−1(X−θ) +E
(2g(X, S)0Σ−1(X−θ) +E
g(X, S)0Σ−1g(X, S) ,
and make it so that the last two terms are negative for the estimator to dominate X.
14
The result depends on two lemmas allowing us to express the two terms we’re interested in using expectationE∗, which will denote the expectation with respect to the distribution
C−1F (X−θ)0Σ−1(X−θ) +
n−1
X
j=1
Vj0Σ−1Vj
!
where F is defined by
F(t) = Z +∞
t
f(s)ds (2.2)
and C−1 is a normalizing constant, i.e.
C = Z
Rp×...×Rp
F (x−θ)0Σ−1(x−θ) +
n−1
X
j=1
vj0Σ−1vj
!
dxdv1...dvn−1. This is in contrast to the standard expectation E, with respect to thef in (2.1).
Before jumping into the results, we can try to get some intuition as to where this F comes from. In the multinormal case, i.e. f(R2) ∝ exp(−21R2), we have F = f and therefore E∗ = E. Considering dimension 1 for simplicity, the result in Lemma 1.1 in Chapter 1 is based on the fact that φ0(x) = −xφ(x) for the normal density function. For variables with density of the form f(x2), this can be generalized by writing
(F(x2))0 =−2xf(x2).
The following lemma is therefore simply a generalization of Stein’s Lemma 1.2 in Chap- ter 1, coinciding with it in the multinormal case.
Lemma 2.1. If g(x, .) is a differentiable function, then E
g(X, S)0Σ−1(X−θ)
=CE∗[divX(g(X, S))]. (2.3) Lemma 2.2. Let T(X, S) be a function from Rp×Rp×p to Rp×p. Then we have
E
tr T(X, S)Σ−1
= 2CE∗
D1/2∗ (T(X, S))
+C(n−p−2)E∗
tr(S−1T(X,S) with
D∗1/2(T(X, S)) =
p
X
i=1
∂Tii(X, S)
∂Sii +1 2
X
i6=j
∂Tij(X, S)
∂Sij .
This D1/2∗ can be seen as a generalization of divergence for matrices, with the 1 2 being a symptom of the symmetry of S.
Remark. Notationtr(A) is used for the trace of a p×pmatrix, tr(A) =
p
X
i=1
Aii,
and ∂
∂Aij denotes the derivative with respect to the component (i, j) of the matrix A (or vector in some cases).
Before proving these lemmas, we will see how they can be used, as it is quite direct.
Theorem 2.1. If E[X0X]<∞ and E[g0(X, S)g(X, S)]<∞, then E
(X+g(X, S)−θ)0Σ−1(X+g(X, S)−θ)
−E
(X−θ)0Σ−1(X−θ)
= E
(2g(X, S)0Σ−1(X−θ) +E
g(X, S)0Σ−1g(X, S)
=CE∗
2divX(g(X, S)) + (n−p−2)g0(X, S)S−1g(X, S) + 2D∗1/2(g(X, S)g0(X, S)) . (2.4) Proof. Using Lemma 2.1 on the first term of the difference gives
E
2g(X, S)0Σ−1(X−θ)
=CE∗[2divX(g(X, S))]. Lemma 2.2, with T(X, S) = g(X, S)g0(X, S), on the second gives
E
g(X, S)0Σ−1g(X, S)
=E
tr g(X, S)0Σ−1g(X, S)
=E
tr Σ−1g(X, S)g(X, S)0
= 2CE∗
D∗1/2(g(X, S)g(X, S)0) +C(n−p−2)E∗
tr(S−1g(X,S)g(X,S)0 . The results follows immediately.
In the light of equation (2.4), we are in a similar position as after Theorem 1.1 in Chapter 1. Following the same heuristic, it will be our objective to identify functions g : Rp×Rp×p → Rp such that (2.4) is negative, hereby yielding domination in terms of MSE. This will be performed in Corollary 2.2, at the end of this chapter.
The proofs of the two lemmas rely on an integration by slice result, as well as a corollary involving Stokes Theorem, that can be derived from [5](Theorem 3.2.12). This dividesRp into ellipsoids and allows the same kind of "integration by parts" generalization as in Stein’s Lemma 1.1.
Lemma 2.3. For any r∈R and any continuously differentiable function φ defined on Rp, let [φ=r] be the submanifold in Rp associated with φ. Then, for any Lebesgue integrable function f, we have
Z
Rp
f(x)dx= Z
{r∈R|[φ=r]6=∅}
Z
[φ=r]
f(x)
k∇φ(x)kdσrdr, where σr is the area measure defined on [φ =r].
Corollary 2.1. If g is a function defined on Rp such that ∇φ·g is integrable, then Z
Rp
∇φ(x)·g(x)dx= Z
{r∈R|[φ=r]6=∅}
Z
Br
div(g(x))dxdr,
whereBr is the set with boundary[φ =r]corresponding, for anyx∈[φ=r], to the outward normal vector ∇φ(x).
We now prove the two lemmas.
Proof of Lemma 2.1. We want to compute E
(g(X, S)0Σ−1(X−θ)
= Z
Rp×...×Rp
Z
Rp
g(x, s)0Σ−1(x−θ)f (x−θ)0Σ−1(x−θ) +
n−1
X
j=1
vj0Σ−1vj
!
dxdv1...dvn−1
We define φ(x) = p
(x−θ)0Σ−1(x−θ) to use Lemma 2.3 and Corollary 2.1 on the inner integral. We have
∇φ(x) = Σ−1(x−θ) p(x−θ)0Σ−1(x−θ) and therefore, setting R =p
(x−θ)0Σ−1(x−θ), Z
Rp
g(x, s)0Σ−1(x−θ)f (x−θ)0Σ−1(x−θ) +
n−1
X
j=1
vj0Σ−1vj
! dx
=
Z +∞
0
f R2+
n−1
X
j=1
vj0Σ−1vj
!Z
[φ=R]
g(x, s)0Σ−1(x−θ)
k∇φ(x)k dσrdR
=
Z +∞
0
f R2+
n−1
X
j=1
vj0Σ−1vj
!Z
[φ=R]
g(x, s)0p
(x−θ)0Σ−1(x−θ) ∇φ(x)
k∇φ(x)kdσrdR
=
Z +∞
0
Rf R2+
n−1
X
j=1
vj0Σ−1vj
!Z
[φ=R]
g(x, s)0 ∇φ(x)
k∇φ(x)kdσrdR
= Z
Rp
Rf R2 +
n−1
X
j=1
v0jΣ−1vj
!
∇φ(x)·g(x)dx
(using Lemma 2.3 backwards so that we can now use Corollary 2.1)
=
Z +∞
0
Rf R2+
n−1
X
j=1
vj0Σ−1vj
!Z
[φ≤R]
divx(g(x, s))dxdR
= Z
Rp
divx(g(x, s)) Z +∞
√(x−θ)0Σ−1(x−θ)
Rf R2+
n−1
X
j=1
vj0Σ−1vj
! dRdx
= Z
Rp
divx(g(x, s))1 2
Z +∞
(x−θ)0Σ−1(x−θ)
f r+
n−1
X
j=1
v0jΣ−1vj
! drdx
= Z
Rp
divx(g(x, s))F (x−θ)0Σ−1(x−θ) +
n−1
X
j=1
vj0Σ−1vj
! dx.
Replacing this into the first expression gives the desired result in (2.3).
Proof of Lemma 2.2. First develop the term in the expectation:
tr(T(X, S)Σ−1) = tr(T(X, S)Σ−1SS−1
= tr(T(X, S)Σ−1
n−1
X
i=1
ViVi0S−1
=
n−1
X
i=1
tr(Vi0S−1T(X, S)Σ−1Vi)
=
n−1
X
i=1
Vi0S−1T(X, S)Σ−1Vi.
Then, mimicking the proof of Lemma 2.1 done just before, with Vi instead of X −θ and g(Vi, S) =T(X, S)S−1Vi, we get (withSlm denoting Slm−1)
E
tr(T(X, S)Σ−1)
=C
n−1
X
i=1
E∗
divVi(T(X, S)S−1Vi)
=C
n−1
X
i=1
E∗
" p X
j=1
∂
∂Vij
p
X
m=1 p
X
l=1
Tjl(X, S)SlmVim
!#
=CE∗[A1+A2 +A3], (2.5)
and
A1 =
n−1
X
i=1 p
X
j=1 p
X
m=1 p
X
l=1
∂
∂VijVim
Tjl(X, S)Slm,
A2 =
n−1
X
i=1 p
X
j=1 p
X
m=1 p
X
l=1
Vim ∂
∂Vij
Tjl(X, S)
Slm,
and A3 =
n−1
X
i=1 p
X
j=1 p
X
m=1 p
X
l=1
VimTjl(X, S) ∂
∂VijSlm
.
We then computeA1, A2 and A3 separately. First, A1 =
n−1
X
i=1 p
X
j=1 p
X
m=1 p
X
l=1
δjmTjl(X, S)Slm
= (n−1)
p
X
j=1 p
X
l=1
Tjl(X, S)Slj
= (n−1)tr(T(X, S)S−1).
ForA2, becauseS is symmetric, we first get A2 =
n−1
X
i=1 p
X
j=1 p
X
m=1 p
X
l=1
VimSlm X
q≤r
∂Tjl(X, S)
∂Sqr
∂Sqr
∂Vij
! .
Using the definition of S=Pn−1
i=1 ViVi0, we have
∂Sqr
∂Vij = ∂
∂Vij(ViqVir) = Viqδjr+Virδjq and then
n−1
X
i=1
(Viqδjr+Virδjq)Vim =Smqδjr+Smrδjq.
The following is immediate from the definition of the inverse, but it is reminded because it will be used several times: p
X
m=1
SamSmb =δab. (2.6)
We now get
A2 =
p
X
j=1 p
X
m=1 p
X
l=1
Slm X
q≤r
∂Tjl(X, S)
∂Sqr Smqδjr+Smrδjq
!
=
p
X
j=1 p
X
l=1
X
q≤r
∂Tjl(X, S)
∂Sqr (δlqδjr+δlrδjq) (thanks to (2.6))
=
p
X
j=1 p
X
l=1
X
q≤r
∂Tjl(X, S)
∂Slj δlqδjr+∂Tjl(X, S)
∂Sjl δlrδjq
=
p
X
j=1 p
X
l=1
X
q≤r
∂Tjl(X, S)
∂Slj (δlqδjr+δlrδjq).
Finally noting that
X
q≤r
(δlqδjr+δlrδjq) =
2 if j =l, 1 if j 6=l,
we get
A2 = 2
p
X
j=1
∂Tjj(X, S)
∂Sjj +1 2
X
j6=p
∂Tjl(X, S)
∂Sjl
!
= 2D1/2∗ (T(X, S)).
The last term left is A3. Starting the same way as for A2, we can get A3 =
p
X
j=1 p
X
l=1
Tjl(X, S)
p
X
m=1
X
q≤r
∂Slm
∂Sqr(Smqδjr+Smrδjq). (2.7) To compute ∂Slm
∂Sqr, we use (2.6) again and take the derivative on both sides to find
p
X
j=1
∂Slj
∂SstSjn+Slj∂Sjn
∂Sst
= 0.
As S is symmetric, if q6=r, ∂Sjn
∂Sqr =δjqδnr+δjrδnq, which leads to
p
X
j=1
∂Slj
∂SqrSjn =−
p
X
j=1
Slj(δjqδnr+δjrδnq)
=−Slqδnr−Slrδnq. We now multiply this by Snm and sum over n to get
p
X
n=1 p
X
j=1
∂Slj
∂SqrSjnSnm =
p
X
n=1
−Slqδnr −Slrδnq Snm.
Killing a sum thanks to (2.6) again, this gives us
p
X
j=1
∂Slj
∂Sqr
δjm = ∂Slm
∂Sqr
=−SlqSrm−SlrSqm. The case q =r is treated similarly, as ∂Sjn
∂Sqq =δjqδnq, to get
∂Slm
∂Sqr =
(−SlqSrm−SlrSqm if q6=r,
−SlqSqm if q=r. (2.8)
In (2.7), the second half becomes
p
X
m=1
X
q≤r
∂Slm
∂Sqr(Smqδjr+Smrδjq)
=−
p
X
m=1
X
q=r
2SmqδjqSlqSqm+X
q<r
(Smqδjr+Smrδjq)(SlqSrm+SlrSqm)
!
=−
p
X
m=1
2SmjSljSjm+X
q<r
(Smqδjr+Smrδjq)(SlqSrm+SlrSqm)
! .
Placing this last expression into the one forA3 in (2.7), after using once again that S and S−1 are symmetric and (2.6), we obtain
A3 =−
p
X
j=1 p
X
l=1
Tjl(X, S) 2Slj +X
q<r
(δjrSlr+δjqSlq)
!
=−
p
X
j=1 p
X
l=1
Tjl(X, S)Slj 2 +X
q<r
(δjr+δjq)
!
=−
p
X
j=1 p
X
l=1
Tjl(X, S)Slj(2 + (p−1))
=−(p+ 1)tr T(X, S)S−1 .
Substituting the expressions found forA1,A2 and A3 in (2.5), we get the announced result.
As anticipated, we can now find estimators that dominate X.
Corollary 2.2. Let r:R→Rbe a nondecreasing positive function bounded by 2(p−2) n−p+ 2. If E[X0X]<∞ and E
X0X (X0S−1X)2
<∞, then the estimator
1− r(X0S−1X) X0S−1X
X (2.9)
dominates X under mean square error.
Remark. The functionr replaces the constants in the estimators from Chapter 1 (thecin (1.7) more specifically) and the bounds on r are reminiscent of those on those constants.
Proof. We apply Theorem 2.1 with g(X, S) = −r(X0S−1X)
X0S−1X X and show that CE∗
2divX(g(X, S)) + (n−p−2)g(X, S)0S−1g(X, S) + 2D∗1/2(g(X, S)g(X, S)0)
≤0.
(2.10) For the first term of (2.10), we need to compute
divX(g(X, S)) =−divX
r(X0S−1X) X0S−1X X
=−
divX(X)r(X0S−1X)
X0S−1X +X0∇X
r(X0S−1X) X0S−1X
.
As divX(X) =p and
∇X
r(X0S−1X) X0S−1X
=
2r0(X0S−1X)X0S−1X−r(X0S−1X) (X0S−1X)2 S−1X
,
this gives the expression
divX(g(X, S)) =−
(p−2)r(X0S−1X)
X0S−1X + 2r0(X0S−1X)
.
The second term in (2.10) is direct:
g(X, S)0S−1g(X, S) = r2(X0S−1X) X0S−1X . Finally, for the last term, we have
D1/2∗ (g(X, S)g(X, S)0)
=D∗1/2
r2(X0S−1X) (X0S−1X)2 X0X
=
p
X
i=1
∂
∂Sii
r2(X0S−1X) (X0S−1X)2
Xi2 +1 2
X
i6=j
∂
∂Sij
r2(X0S−1X) (X0S−1X)2
XiXj.
As
∂
∂Sij
r2(X0S−1X) (X0S−1X)2
=
2(X0S−1X)2r(X0S−1X)r0(X0S−1X)−2(X0S−1X)r2(X0S−1X) (X0S−1X)4
∂
∂Sij(X0S−1X) and, using (2.8),
∂
∂Sij(X0S−1X) = X
l,m
Xl0∂Slm
∂Sij Xm
=−(2−δij)(X0S−1)i(X0S−1)j, it follows that
D∗1/2(g(X, S)g0(X, S))
=
2(X0S−1X)2r(X0S−1X)r0(X0S−1X)−2(X0S−1X)r2(X0S−1X) (X0S−1X)4
×
p
X
i=1
∂
∂Sii
(X0S−1X)Xi2 +1 2
X
i6=j
∂
∂Sij
(X0S−1X)XiXj
!
=−2
(X0S−1X)2r(X0S−1X)r0(X0S−1X)−(X0S−1X)r2(X0S−1X) (X0S−1X)4
×
p
X
i=1
(X0S−1)2iXi2+1 2
X
i6=j
2(X0S−1)i(X0S−1)jXiXj
!
| {z }
=(X0S−1X)2
=−2
r(X0S−1X)r0(X0S−1X)− r2(X0S−1X) X0S−1X
.
Finally putting everything back together in the expression (2.10), we obtain CE∗
2divX(g(X, S)) + (n−p−2)g(X, S)0S−1g(X, S) + 2D1/2∗ (g(X, S)g(X, S)0)
=CE∗
"
2
(p−2)r(X0S−1X)
X0S−1X + 2r0(X0S−1X)
+ (n−p−2)r2(X0S−1X) X0S−1X
−4
r(X0S−1X)r0(X0S−1X)− r2(X0S−1X) X0S−1X
#
=CE∗
"
r(X0S−1X)
X0S−1X −2(p−2) + (n−p−2)r(X0S−1X)
−4r0(X0S−1X) 1 +r(X0S−1X)
#
which, given the definition of r, is negative as required.
Choosing r constant and equal to (p−2)
n−p+ 2 (n became n−1 given how we considered the problem in this section) gives the James-Stein estimator. This means that the estimator (1.8) found in the normal case also works in the multivariate Student case: simulations will be used to compare them in Chapter 4.
Second generalization: high dimension
An implicit assumption for the previous results was for the dimensionpto be smaller than the sample sizen in order for the estimated covariance matrix to be non-singular. Indeed, if, given a sample Y1, ..., Yn, we compute a matrix S as before by S = Pn
j=1YjYj0, then rk(S)≤min(n, p). It is easier to see by writing S =Y0Y, with Y the n×pmatrix whose rows are our observations Yi. As the Yi are usually independent and, in the cases that interest us, from a continuous distribution, the rank of S is actually equal to min(n, p) almost surely. Therefore, when p > n, S is a p × p matrix of rank n and is almost surely singular. To illustrate this, let’s look at 3 observations (rounded to the nearest hundredth for readability) from a4-dimensional normal with mean0and covariance matrix the diagonal matrix with1,2,3,4 on the diagonal. The 3vectors are
X1 = (0.43,−0.45,−1.59,−2.96)0 X2 = (−0.78,−0.89,−3.46,3.15)0 X3 = (−1.29,−0.15,−0.47,−1.91)0
and the matrix S obtained, along with its singular value decomposition to show its singu- larity and also anticipating on what is to come, is
S =
1.56 −0.12 −0.41 −2.19
−0.12 0.28 1.12 −2.05
−0.41 1.12 4.56 −8.46
−2.19 −2.05 −8.46 21.39
=U
25.27 0 0 0 0 2.52 0 0
0 0 0 0
0 0 0 0
V0
where U and V are two unitary matrices. For the exact values, the R code used is in the appendix.
When matrices are singular like here, generalization of the inverse exists, with diverse properties. One of the most popular ones, due to its uniqueness, is the Moore-Penrose pseudoinverse and the focus of this section will be to show how to use it to obtain new forms of the James-Stein estimator.
We put ourselves back in the situation where X is a p-dimensional variable normally distributed with meanθand covariance matrixΣand we would like to estimateθ, knowing
24