HAL Id: hal-00524623
https://hal.archives-ouvertes.fr/hal-00524623
Preprint submitted on 8 Oct 2010
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
Technical appendix to “Adaptive estimation of covariance matrices via Cholesky decomposition”
Nicolas Verzelen
To cite this version:
Nicolas Verzelen. Technical appendix to “Adaptive estimation of covariance matrices via Cholesky
decomposition”. 2010. �hal-00524623�
ISSN: 1935-7524
Technical appendix to “Adaptive estimation of covariance matrices via Cholesky decomposition”
Nicolas Verzelen∗
INRA, UMR 729 MISTEA, F-34060 Montpellier, France SUPAGRO, UMR 729 MISTEA,
F-34060 Montpellier, France e-mail:[email protected] Abstract:
This is a technical appendix to “Adaptive estimation of covariance matrices via Cholesky decomposition (arXiv:1010.1445).
AMS 2000 subject classifications:Primary 62H12; secondary 62F35, 62J05.
Keywords and phrases:Covariance matrix, banding, Cholesky decomposition, directed graphical models, penalized criterion, minimax rate of estimation.
For the sake of clarity, we emphasize the references to [7] in bold . For instance Eq.(12) refers to Eq.(12) in [7], while Eq.(12) stands for Eq.(12) in this paper.
Contents
1 Additional simulations . . . 2
2 Proof of the risk upper bounds . . . 2
2.1 Proof of Lemma10.4 . . . 3
2.2 Proof of Lemma10.5 . . . 4
2.3 Proof of Lemma10.6 . . . 5
2.4 Proof of Lemma10.7 . . . 7
2.5 Proof of Lemma10.8 . . . 8
2.6 Proof of Lemma10.9 . . . 9
2.7 Proof of Corollary6.1 . . . 11
2.8 Proof of Proposition4.5 . . . 11
3 Proofs of the minimax bounds . . . 12
3.1 Main lemma . . . 12
3.2 Adaptive banding . . . 14
3.2.1 Proof of Proposition5.3 . . . 15
3.2.2 Proof of Proposition3.5 . . . 16
3.3 Complete graph selection (proof of Proposition6.2) . . . 18
3.3.1 First case: minimax rate overU1(k, p) . . . 18
3.3.2 Second case: minimax rate overU2(k, p) . . . 21
3.3.3 Technical lemmas . . . 23
4 Proof of the Frobenius bounds . . . 24
4.1 Proof of Corollary5.4 . . . 24
4.2 Proof of Corollary6.3 . . . 26
∗Research mostly carried out at Univ Paris-Sud (Laboratoire de Mat´ematiques, CNRS-UMR 8628) 1
5 Proof of Lemma10.2 . . . 27 6 Proof of Proposition7.2 . . . 28 References . . . 31 1. Additional simulations
We provide here additional simulation results for the complete graph selection problem. We use the same notations and compute the same estimators as in Section8.2.
In the third simulation scheme, we consider the case where the ”good” ordering is completely unknown. We first sample a precision matrix Ωc1 according to the first simulation scheme. Then, we sample uniformly a permutation of {1, . . . , p} and reorder the variables according to this permutation to get the precision matrixΩc3. Its Cholesky factor is generally far less sparse than the Cholesky factor of Ω1, while Ωc1 is as sparse as Ωc3. The purpose of this scheme is to illustrate the limits of procedures based on the Cholesky factor when no suitable ordering is known. As in the second scheme, we only perform the simulations forp= 200, Esp= 1,3,5, andn= 100.
Method Ledoit GLasso Lasso ChoSelectf Kullback discrepancyK(Ω;Ω)b
Esp=1 20.1±0.2 9.1±0.2 8.2±0.1 7.6±0.1 Esp=3 42.8±1.8 22.0±0.2 24.0±0.4 25.3±0.5 Esp=5 52.6±1.0 35.8±0.3 42.7±0.4 49.1±0.5
Operator distancekΩb−Ωk
Esp=1 6.3±0.1 5.6±0.1 4.7±0.1 4.5±0.2 Esp=3 10.0±0.1 10.1±0.1 8.8±0.2 8.0±0.2 Esp=5 14.3±0.2 15.3±0.2 13.0±0.2 12.6±0.2
Operator distancekΩb−1−Σk Esp=1 2.7±0.1 1.7±0.1 2.1±0.1 1.7±0.1 Esp=3 8.4±0.5 6.4±0.3 11.5±0.8 10.8±0.8 Esp=5 16.3±0.8 14.7±0.8 26.5±1.6 25.0±1.4
Table 1: Comparison between the procedures for the third covariance model Ωc3 withp= 200.
We observe different results for the Kullback risk depending on the sparsity. When Esp=1, ChoSelectf still performs better than the other methods. However, the GLasso provides better results than the two other results for Esp=3 and Esp=5. This is not really surprising since the Glasso has been introduced to handle the ”unordered” situation. It seems from the case Esp=5, that the Lasso procedure is more robust to a ”bad” ordering than ChoSelectf. ChoSelectf still performs better than the other procedures in terms of the operator distance between preci- sion matrices. Nevertheless, the differences of performance are less obvious than in the previous schemes. Finally, the Glasso and Ledoit and Wolf’s method exhibit a smaller operator distance between covariance matrices than the Lasso and ChoSelectf.
2. Proof of the risk upper bounds
Lemma 2.1. Let V be a χ2 random variable with N >2 degrees of freedom and let k be some positive integer such that N >2k, then
E 1
Vk
= 1
(N−2). . .(N−2k) and E Vk
=N(N+ 2). . .(N+ 2(k−1)). We refer to Lemma 5 in [1] for the proof of slightly more general version of this lemma.
2.1. Proof of Lemma 10.4 Using expression (31) ofK
t, s;btm,sbm
, we derive 2(1−κ0)K t, s;et,es
= 2K
t, s;btm,bsm
+ (1−κ0) log es
b sm
+ (1−κ0)s+l(et, t) e s
− s+l(btm, t) b
sm +κ0+κ0log s
b sm
.
By definition ofm, log (b es/sbm)≤pen(m)−pen(m). Hence,b 2(1−κ0)K t, s;et,se
≤ 2K t, s;btm,bsm
+ (1−κ0) [pen(m)−pen(m)]b + κ0 s
b sm
+κ0
− s b sm
+ 1 + log s
b sm
−s+l(btm, t)− kΠ⊥m(ǫ+ǫm)k2n
b sm
+ l(et, t)(1−κ0) +s(1−κ0)− kΠ⊥mb(ǫ+ǫmb)k2n
e
s ,
sincebsm=kΠ⊥m(ǫ+ǫm)k2n andes=kΠ⊥mb(ǫ+ǫmb)k2n. As the functionx−logx−1 is non-negative, the term [−s/bsm+ 1 + log (s/bsm)] is non-positive. SinceX<it∗mis the best predictor ofXigiven Xm, it follows thatl(btm, t) =l(btm, tm) +l(tm, t). Hence,
κ0 s b
sm −s+l(btm, t)− kΠ⊥m(ǫ+ǫm)k2n
b
sm ≤ −(1−κ0) s b
sm +kΠ⊥m(ǫ+ǫm)k2n−l(tm, t) b
sm .
In the proof of Lemma 7.5 in [8], we state that l(btm′, tm′)≤ϕmax
nZ∗m′Zm′)−1
kΠm′(ǫ+ǫm′)k2n . This yields
(1−κ0)l(et, t) +s e
s ≤ (1−κ0)s+l(tmb, t) +κ2ϕmax
n(Z∗mbZmb)−1
kΠmb(ǫ+ǫmb)k2n
e s + (1−κ0)(1−κ2)l(et, tmb)
e
s .
Let us gather all these bounds 2(1−κ0)K
t, s;et,se
≤ 2K
t, s;btm,bsm
+ (1−κ0) [pen(m)−pen(m)]b + (1−κ0)l(tmb, t) + (1−κ2)l(et, tmb) +κ2ϕmax
n(Z∗mbZmb)−1
kΠmb(ǫ+ǫmb)k2n
e s
− kΠ⊥mb(ǫ+ǫmb)k2n
e
s +s(1−κ0) 1
e s − 1
b sm
+kΠ⊥m(ǫ+ǫm)k2n−l(tm, t) b
sm
≤ 2K
t, s;btm,bsm
+ (1−κ0) [pen(m)−pen(m)]b + (1−κ0)l(tmb, t) + (1−κ2)l(et, tmb) +κ2ϕmax
n(Z∗mbZmb)−1
kΠmb(ǫ+ǫmb)k2n
e s + kǫk2n−s(1−κ0) 1
b sm −1
e s
+kΠmbǫk2n
e
s + 2hΠ⊥mbǫ,Π⊥mbǫmbin e s
− kΠ⊥mbǫmbk2n
e
s + 2hΠ⊥mǫ,Π⊥mǫmin b sm
+kΠ⊥mǫmk2n−l(tm, t) b
sm
.
We then use Condition (39) onpen(m) and we apply the inequalityb 2hΠ⊥mbǫ,Π⊥mbǫmbin
e
s ≤κ1
l(tmb, t) e
s +κ−11s e s
hΠ⊥mbǫ,Π⊥mbǫmbi2n
sl(tmb, t) . Hence, we conclude that
2(1−κ0)K
t, s;et,es
≤ 2K
t, s;btm,bsm
+ (1−κ0)pen(m) + l(et, t)
e
s [R1(m)b ∨(1−κ2)(1−κ0)] +R2(m) +s e
sR3(m) +b R4(m,m)b . 2.2. Proof of Lemma 10.5
This proof follows the same sketch as the proof of Lemma 7.10 in [8]. The main difference lies in the fact thatκ0is zero in [8]. Letxbe a positive number that we shall fix later. For anyk >0, let us define
δk :=
rπ
2k+ exp(−k/16).
We shall first control the deviations of the random variables involved inR1(m). Applying devi-b ation inequality for χ2 random variables and largest values of Standard Wishart matrices (see e.g. Lemmas 7.2, 7.3, and 7.4 in [8]) to all models m∈ Mensures that there exists an eventB2
such thatP(Bc2)≤4nexp(−nx) and under B2 it holds that kΠ⊥mbǫmbk2n
l(tmb, t) ≥ n− |mb| n
"
1−δn−|mb|−
s2|mb|H(|mb|) n− |mb| −
s 2xn n− |mb|
!
∨0
#2 ,
kΠmb(ǫ+ǫmb)k2n
s+l(tmb, t) ≤ 2|mb| n
h1 +p
H(|mb|) +H(|mb|)i
+ 3x , (1)
kΠ⊥mb(ǫ+ǫmb)k2n
s+l(tmb, t) ≥ n− |mb| n
"
1−δn−|mb|−
s2|mb|H(|mb|) n− |mb| −
s 2xn n− |mb|
!
∨0
#2 ,
nϕmax
h(Z∗mbZmb)−1i
≤
"
1− 1 +p
2H(|mb|) r|mb| n −√
2x
!
∨0
#−2
.
By Assumption (HiK,η), the expression (1 +p
2H(m))b p
|mb|/nis bounded by √η. Moreover, (HiK,η) also ensures that|mb| ≤n/2. Henceδn−|m|b ≤δn/2≤ν(K) fornlarger than some quantity n0(K). Sinceν(K)≤1−√η, we derive that
kΠ⊥mbǫmbk2n
l(tmb, t) ≥
1−|mb| n
[1−ν(K)−√η]2−2√
2x , (2)
kΠ⊥mb(ǫ+ǫmb)k2n
s+l(tmb, t) ≥
1−|mb| n
[1−ν(K)−√η]2−2√
2x , (3)
ϕmax
hn(Z∗mbZmb)−1i
≤ h
1−√η−√ 2x
∨0i−2
.
Constrainingxto be smaller than 1−√η2
/8 ensures that
κ2ϕmax
hn(Z∗mbZmb)−1i
1B1 ≤(K−1)(1−√η−ν(K))2
4 . (4)
Gathering the definition ofR1(.), the inequalities (1), (2), (3), and (4), we get R1(m)b ≤ κ1+ 1−[1−√η−ν(K)]2+|mb|
n (1−√η−ν(K))2U1+√
xU2+xU3 , whereU1,U2, andU3are respectively defined by
U1 := −(1−κ0)Kh 1 +p
2H(|mb|)i2
+ 1 + (1−κ0)(K−1)/2h 1 +p
H(|mb|)i2
≤0 U2 := 2√
2 [1 +Kη]
U3 := 34(K−1)
1−√η−ν(K)2 .
SinceU1 is non-positive, we obtain an upper bound ofR1(m) that does not depend anymore onb the model m. By assumption (Hb iK,η), we know that η <(1−ν(K)−(3/(K+ 2))1/6)2. Hence, coming back to the definition ofκ1allows to prove thatκ1is strictly smaller than [1−√η−ν(K)]2. Setting
x:=
"
1−√η−ν(K)2
−κ1
4U2
#2
∧
1−√η−ν(K)2
−κ1
4U3 ∧ 1−√η2
8 ,
we get
R1(m)b ≤1−1 2
h(1−√η−ν(K))2−κ1
i<1 , under the eventB2.
This is enough to prove Lemma 10.5if we take B1 = B2. In fact, we shall define an event B1 slightly more restrictive in order to simplify the proof of Lemma 10.6. Let B3 be the event defined by
kǫk2n/s≤κ−11 . (5) Since κ−11 is strictly larger than one and since κ1 only depends on K and η, it follows that P(Bc3)≤exp(−nLK,η) withLK,η>0. Finally we take,B1:=B2∩B3.
2.3. Proof of Lemma 10.6
The sketch of this proof is similar to the proof of Lemma 7.11 in [8]. First, under the eventB1, it holds that
kΠ⊥mb(ǫ+ǫmb)k2n
s+l(tmb, t) ≥[1−ν(K)−√η]2/4>0 , κ2ϕmax
h
n(Z∗mbZmb)−1i
≤(K−1)(1−√η−ν(K))2
4 .
This is a consequence of (3), of the choice of x in the previous proof, and of the assumption (HiK,η). Since es = |Π⊥mb(ǫ+ǫmb)k2n, it follows that s/es is upper bounded under the event B1. Hence, we only have to upper bound the expectation ofR3(m) onb B1.
Ehs e
sR3(m1b B1)i
≤LK,ηE[R3(m1b B1)] . (6) Let us consider the random variablesEmb defined by
Emb :=κ−11hΠ⊥mbǫ,Π⊥mbǫmbi2n
sl(tmb, t) +kΠmbǫk2n
s .
By (5), the random variableEmb is upper bounded underB1 by Emb ≤κ−21 hΠ⊥mbǫ/kΠ⊥mbǫkn,Π⊥mbǫmbi2n
sl(tmb, t) +kΠmbǫk2n
s
This upper bound follows the distribution of a linear combinations ofχ2random variables. More details about this observation are given in the proof of Lemma 7.7 in [8]. We shall simultaneously control the deviations of the random variablesEmb,kΠmb(ǫ+ǫmb)k2n/[l(tmb, t) +s], andkΠ⊥mb(ǫ+ ǫmb)k2n/[s+l(tmb, t)] by applying Lemma 1 in [5] and Lemmas 7.2 and 7.3 in [8]. For anyx >0, we define an eventF(x) such that conditionally onF(x)∩B1,
Emb ≤ |mb|+κn −21 +n2q
|mb|+κ−14
[|mb|(ξ+H(|mb|)) +x]
+ 2κ−12[ξ(|mb|+H(|mb|)) +x]/n ,
kΠcm(ǫ+ǫ
c m)k2n
s+l(tmc,t) ≤ n1
h|mb|+ 2p
|mb|[|mb|(1/16 +H(|mb|)) +x] + 2 [|mb|(1/16 +H(|mb|)) +x]i ,
kΠ⊥cmǫcm+ǫk2n
s+l(tmc,t) ≥ n−|nmb|
h1−δn−|mb|−q
|bm|(1+2H(|bm|)) n−|mb| −q
2x n−|mb|
∨0i2 ,
whereδk is defined in the previous proof. Then, the probability ofF(x) satisfies P[F(x)c] ≤ e−x
"
X
m∈M
exp [−|mb|H(|mb|)]
e−ξ|mb|+e−|c16m| +e−|cm|2 #
≤ e−x 1
1−e−ξ + 1
1−e−1/16 + 1 1−e−1/2
.
Let us expand the three deviation bounds thanks to the inequality 2ab≤τ a2+τ−1b2: Emb ≤ |mb|
n
h1 + 2p
ξ+ 2κ−12ξ+τ1ξ+τ2
i+x n
2κ−12+τ2−1+τ1
+ κ−12 n
1 +τ1−1κ−12
+|mb|H(|mb|) n
2κ−12+τ1
+ 2|mb|p H(|mb|) n
≤ |mb| n
1 +p
2H(|mb|)2h
κ−12+ 2p
ξ+ 2κ−12ξ+τ1ξ+τ2
i
+ x
n
2κ−12+τ2−1+τ1 +κ−12
n
1 +τ1−1κ−12 .
Similarly, we get
kΠmb(ǫ+ǫmb)k2n
l(tmb, t) +s ≤ 2|mb| n
h1 +p
2H(|mb|)i2
+ 5x n .
We recall that |m| ≤n/2 by Assumption (HiK,η). If nis larger than some quantityn0(K), then δn−|m|≤δn/2≤ν(K). Applying again Assumption (HiK,η), we get
−K |mb| n− |mb|
1 +p
2H(|mb|)2kΠ⊥mb(ǫ+ǫmb)k2n
l(tmb, t) +s
≤ −K|mb| n
1 +p
2H(|mb|)2"
1−√η−ν(K)−
s 2x n− |mb|
!
∨0
#2
≤ −K|mb| n
1 +p
2H(|mb|)2h
(1−√η−ν(K))2−τ3
i+ 2Kητ3−1x n .
Let us combine these three bounds with the definitions ofR3(m),b κ1, andκ2, and the bound (4).
Hence, under the eventB1∩F(x), it holds that R3(m)b ≤|mb|
n
h1 +p
2H(|mb|)i2
U1+x
nU2+LK,η
n U3 , where
U1 := −K−110 1−√η−ν(K)2
+Kτ3+ 2√
ξ+ 2κ−12ξ+τ1ξ+τ2 , U2 := τ2−1+τ1+LK,η(1 +τ3−1),
U3 := 1 +τ1−1 .
Since K >1, there exists a suitable choice of the constantsξ, τ1, andτ2, only depending onK andη that constrainsU1to be non positive. Hence, under the eventB1∩F(x),
Bmb ≤ LK,η
n +L′(K, η)x n .
SinceP[F(x)c]≤e−xLK,η, we conclude by integrating the last expression with respect tox.
2.4. Proof of Lemma 10.7 Let us assume thatn≥17.
R2(m) = 1−l(tm, t) +kΠ⊥mǫk2n
kΠ⊥mǫ+ǫmk2n
.
We shall first upper bound the expectation ofR2(m).
E
l(tm, t) kΠ⊥mǫ+ǫmk2n
= n
n− |m| −2
l(tm, t)
[s+l(tm, t)] ≥ n− |m| n− |m| −2
l(tm, t)
[s+l(tm, t)] (7) Let us to the second term.
E
kΠ⊥mǫ+ǫmk2n
kΠ⊥mǫk2n
= 1 +l(tm, t) s
n− |m|
n− |m| −2 ≤s+l(tm, t) s
n− |m| n− |m| −2 Applying a convexity argument, we derive that
E
kΠ⊥mǫk2n
kΠ⊥mǫ+ǫmk2n
≥ s
s+l(tm, t)
n− |m| −2
n− |m| . (8)
Gathering the inequalities (7) and (8) allows to conclude that E[R2(m)]≤ 2
n− |m| . We get the upper bound
E[R2(m)1B1] = E[R2(m)]−E
R2(m)1Bc
1
≤ 2
n− |m|+q P(Bc1)
q
E[R22(m)]. (9)
Thus, we need to upper bound the second moment ofR2(m).
E R22(m)
≤ 3 (
1 +E
l2(tm, t) kΠ⊥mǫ+ǫmk4n
+
s
E[kΠ⊥mǫk8n]E
1
kΠ⊥mǫ+ǫmk8n
)
≤ 3
"
1 +
l(tm, t)n
(s+l(t, m, t))(n− |m| −4) 2
+
s(n− |m|+ 6) (s+l(t, m, t))(n− |m| −8)
2# ,
by Lemma 2.1. By Assumption (HiK,η), we have |m| ≤ n/2. If n ≥ 32, we can upper bound E
R22(m)
by a universal constant. Combining this result with (9) enables to conclude that E[R2(m)1B1]≤LK,η/n.
2.5. Proof of Lemma 10.8
We bound the quantityR4(m,m) using the same arguments as in the proof of Theorem 3 in [1].b We first split this quantity into a sum of two terms:
R4(m,m) =b kǫk2n−s(1−κ0)
+
1 b sm −1
e s
+ s(1−κ0)− kǫk2n
+
− 1 b sm
+1 e s
≤ R4,1(m,m) +b R4,2(m)b ,
whereR4,1(m,m) andb R4,2(m) are respectively defined byb R4,1(m,m)b := kǫk2n−s(1−κ0)
+
1 b sm−1
e s
R4,2(m)b := s(1−κ0)− kǫk2n
+
1 e s .
By definition, we know that log (es/bsm) is smaller thanpen(m)−pen(m).b R4,1(m,m)b ≤ kǫk2n−s(1−κ0)
+
1 b sm
log es
b sm
≤ kǫk2n−s(1−κ0)
+
1 b sm
pen(m).
Applying Cauchy-Schwarz inequality yields E[R4,1(m,m)1b B1] ≤ E
"
kǫk2n−s(1−κ0)
+
b sm
#
pen(m)
≤ s
Eh
(kǫk2n−s(1−κ0))2i E
1 b s2m
pen(m)
≤ s sm
s κ20+2
n
n2
(n− |m| −2)(n− |m| −4)pen(m)
≤ Lpen(m),
since |m| ≤ n/2 by Assumption HK,η and since n ≥ 17. Let us turn to R4,2(m). We applyb H¨older’s inequality withv:=⌊n/8⌋andu=v/(v−1).
E[R4,2(m)1b B1] ≤ E
s(1−κ0)− kǫk2n
+
1 e s
≤ Eh
1s(1−κ0)≥kǫk2n
s e s i
≤ P
kǫk2n≤s(1−κ0)1/uh Es
e s
vi1/v
≤ P
kǫk2n≤s(1−κ0)1/u"
X
m∈M
E s
b sm
v#1/v
.
Sincevis smaller thann/8 and since|m|is smaller thann/2 it follows thatn− |m| −2vis larger thann/4. Hence, we can apply Lemma2.1to any modelm∈ M.
E[R4,2(m)b 1B1] ≤ exp
−nκ20 4u
" X
m∈M
nv
(n− |m| −2). . .(n− |m| −2v)
#1/v
≤ n
n/2−2vexp
−nκ20 4u
|M|1/v
≤ nexp
−nκ20 4u
|M|1/v .
Let us bound the cardinality of the collection M. We recall that the dimension of any model m∈ M is assumed to be smaller thann/2 by (HK,η). Besides, for any d∈ {1, . . . , n/2}, there are less than exp(dH(d)) models of dimensiond. Hence,
log (|M|)≤log(n) + sup
d=1,...,n/2
dH(d).
By assumption (HK,η),dH(d) is smaller thann/2. Thus, log(|M|)≤log(n) +n/2 and it follows that|M|1/v is smaller than an universal constant providing thatnis larger than 8. All in all, we get
E[R4,2(m)1b B1]≤Lnexp
−nκ20 4u
.
2.6. Proof of Lemma 10.9
For anyx >0, the following inequality holds x−1−log(x)≤ 9
64
x− 1 x
2
.
This statement is easy to establish by studying the derivative of the associated function. Hence, we upper bound the Kullback divergence
K
t, s;btm,bsm
= s
b sm
+ 1−log s
b sm
+l(btm, t) b sm
≤ 9 64
s2 b s2m +bs2m
s2
+l(tm, t) b
sm +l(btm, tm) b sm .
Thanks to Cauchy-Schwarz inequality, we obtain E
K t, s;et,es 1Bc
1
≤ E
"
9 64
s2 e s2 +se2
s2
+l(tm, t) e
s +l(et, tmb) e s
! 1Bc
1
#
≤ Lq P[Bc1]
vu utX
m∈M
E
"
1m=mb s4 b s4m+bs4m
s4 +l2(tm, t) b
s2m +l2(btm, tm) b s2m
!#
.
As in the proof of Lemma10.8, we apply H¨older’s inequality withv=⌊n/16⌋andu=v/(v−1).
Again, we check that for any modelm∈ M, n− |m| −8v≥1.
E
"
1m=mb s4 b s4m+bs4m
s4 +l2(tm, t) b
s2m +l2(btm, tm
b s2m
!#
≤P[m=m]b 1u
Es4v b s4vm
1v
+E bs4vm
s4v v1
+E
l2v(tm, t) b s2vm
1v
+E l2v(btm, tm) b s2vm
!v1
. We bound the first two terms applying Lemma 2.1 or computing the v-th moment of χ2 random variable.
E s4v
b s4vm
v1
≤ n4
(n− |m| −8v)4 , E
bs4vm s4v
v1
=
(n− |m|)(n− |m|+ 2). . .(n− |m|+ 2(4v−1))(sm)4v (ns)4v
1v
≤ (n− |m|+ 8v)4(s+l(0, t))4
n4s4 .
AskΠ⊥m(ǫ+ǫm)k2n is independent of the couple (kΠm(ǫ+ǫm)k2n,Xm), the random variablesbsm
andl(btm, tm) are independent. We bound the thel2v-risk ofl(btm, t) thanks to Proposition 7.8 in [8].
E l2v(btm, tm) b s2vm
!1v
=
E
l2v(btm, tm) E
1 b s2vm
1v
≤ Lv2|m|2n4
(n− |m| −4v)2 ≤Lv2|m|2n2 n2
(n− |m| −4v)2 . Combining these upper bounds and noting thatn− |m| −8v≥1 and|m| ≤n/2 yields
E
K t, s;et,es 1Bc1
≤
"
2n2
(n− |m| −8v)2 + Lv|m|n2
n− |m| −4v +(n− |m|+ 8v)2 n2
1 +l(0, t) s
2#
× L q
P[Bc1]|M|2v1
≤ LK,ηn5/2
1 + l(0, t) s
exp [−nLK,η] ,
since |M|1/2v is smaller than than an universal constant as explained in the proof of Lemma 10.8. Finally,l(0, t)/sis smaller thanK(t, s; 0,1).
2.7. Proof of Corollary 6.1
First, we claim that the penalties (21) are lower bounded by penalties defined in (7). Suppose that K > e−1. Since log(1 +Kx)≥log(1 +K)xfor anyxbetween 0 and 1, it follows that
peni(mi)≥log(1 +K) |mi| n− |mi|
1 +
q
2 [1 + log ((i−1)/|mi|)]2
, (10)
if |mi|/(n− |mi|){1 +p
2[1 + log((i−1)/|mi|))]2} ≤ 1. If K ≤ e−1, there exists a positive constantζ(K) such that log(1 +Kx)≥√
Kx, for allx≤ζ(K). Hence, we get peni(mi)≥√
K |mi| n− |mi|
1 +
q
2 [1 + log ((i−1)/|mi|)]2
, (11)
ifK≤e−1 and if|mi|/(n− |mi|)[1 +p
2[1 + log((i−1)/|mi|)]2]≤ζ(K).
Gathering (10) and (11), we get that for any K > 1, there exists some K′ > 1 and some ζ(K)>0 such that:
peni(mi)≥K′ |mi| n− |mi|
1 +
q
2 [1 + log ((i−1)/|mi|)]2
,
if|mi|/(n− |mi|){1 +p
2[1 + log((i−1)/|mi|)]2} ≤ζ(K).
For any 2≤i≤pand any 1≤k≤(i−1)∧d,Hi(k) is smaller than 1 + log((i−1)/k). Hence, peni(mi) is lower bounded by a penalty of the form (7) with someK′>1. Assuming that
|mi| n− |mi|
1 +
q
2 [1 + log ((i−1)/|mi|)]2
≤ζ(K)∧η(K′), (12) we derive that (HK′,η) is fulfilled and that the risk bound (23) holds.
We conclude by observing that Condition (12) is satisfied if d[1 + log(p/d)∨0]≤nη′(K), for some suitable functionη′(K).
2.8. Proof of Proposition 4.5
We apply the same arguments as in the proof of Theorem4.4, except that we replaceH(|m|) by lm. Then, Lemmas10.4and10.7are still true.
In the proof of Lemmas 10.8and 10.9, the only difference with the previous case concerns the upper bound of log (|M|). By definition oflm,
|M| −1≤ sup
m∈M\{∅}
exp(|m|lm).
Hence, log(|M|)≤1 + supm∈M\{∅}|m|lm, which is smaller than 1 +n/2 by Assumption (HbayK,η).
Lemmas10.5and10.6also hold whenH(|m|) is replaced bylmas explained in the proof of Proposition 3.5 in [8].
3. Proofs of the minimax bounds
We note dH(., .) the Hamming distance between two vectors. The Hamming distance between two matrices of sizepis defined as the Hamming distance between their vectorialized version of size p2. It is also noteddH(., .).
3.1. Main lemma
We first state two useful lemmas for proving the minimax lower bounds. The first one is known as Varshamov-Gilbert’s lemma, whereas the second one is a modified version of Birg´e’s lemma for covariance estimation.
Lemma 3.1(Varshamov-Gilbert’s lemma). Let{0,1}Dbe equipped with Hamming distancedH. There exists some subset Θof{0,1}D with the following properties
dH(θ, θ′)> D/4 for every(θ, θ′)∈Θ2 with θ6=θ′ and log|Θ| ≥D/8 . We notektkl2 the Euclidean norm of a vectort.
Lemma 3.2. LetA be a subset of{1, . . . , p}. For any positive matricesΩandΩ′, we define the function d(Ω,Ω′)by
d(Ω,Ω′) :=X
i∈A
log
"
1 +kti−t′ik2l2
4
#
+ X
i∈Ac
si
s′i + log si
s′i
−1 . (13)
Let Υbe a subset of square matrices of sizepwhich satisfies the following assumptions:
1. For allΩ∈Υ, ϕmax(Ω)≤2 andϕmin(Ω)≥1/2.
2. There exists(s1,s2)∈[1; 2]2 such that∀Ω∈Υ, ∀1≤i≤p,si∈ {s1,s2}.
Setting δ = minΩ,Ω′∈Υ,Ω6=Ω′d(Ω,Ω′), provided that maxΩ,Ω′∈ΥK(P⊗Ωn;P⊗Ω′n) ≤ κ1log|Υ|, the following lower bound holds
infΩb
sup
Ω∈ΥEΩ
hK Ω;Ωbi
≥κ2δ .
The numerical constants κ1 andκ2 are made explicit in the proof.
Proof of Lemma 3.2. This lemma is mainly based on an application of Birg´e’s version of Fano’s lemma [3]. We provide a statement of the result that is taken from [6] Sect.2.4.
Lemma 3.3. Let(Pi)0≤i≤N be some family of probability distributions and(Ai)0≤i≤N be some family of disjoint events. Let a= min0≤i≤NPi(Ai), then
a≤κ∨
max1≤i≤NK(Pi;P0) log(1 +N)
,
whereκ= 2e/(2e+ 1).
LetΩ be an estimator of Ω. Letb Ω be an estimator that takes its values in Υ and satisfiese d(eΩ,Ω) = minb
Ω′∈Υd(Ω′,Ω)b .
We note (T ,e S) and (e T ,b S) the Cholesky decompositions ofb Ω ande Ω. Letb i∈ {1, . . . , p}. By the triangle inequality,
kti−etik2l2
4 ≤2
"
kti−btik2l2
4 +kbti−etik2l2
4
# .
For any positive numbers a and b, log(1 +a+b)≤log(1 +a) + log(1 +b). Moreover, for any positive numbera, we have log(1 + 2a)≤2 log(1 +a) because the log function is concave. Hence, we get
log
"
1 + kti−etik2l2
4
#
≤2 log
"
1 + kti−btik2l2
4
# + 2 log
"
1 + kbti−etik2l2
4
#
. (14)
Let us define the functionf byf(x) :=x−log(x)−1 for anyx >0. We state that there exists some numerical constantL such that
f si
e si
≤L
f si
b si
+f
esi
b si
. (15)
If si =esi, this inequality holds for anyL >0 sincef(1) = 0 and f is non negative. Ifsi 6=esi, there are two possibilities: either si = s1 and esi = s2 or si = s2 and esi = s1. By deriving f(s1/x) +f(s2/x), one observes that this sum is minimized for x= (s1+s2)/2 and that this minimum equalsf[2/(1 +s1/s2)] +f[2/(1 +s2/s1)]. Hence, we obtain that
f si
e si
≤ f
s1
s2
∨f
s2
s1
fh
2/
1 + ss12i +fh
2/
1 + ss21i
f si
b si
+f
esi
b si
.
Sinces1 ands2lie between one and two, it follows that f
si
e si
≤ sup
1/2≤x≤2
f(x) f[2/(1 +x)] +f
2/(1 + 1x)
f si
b si
+f
esi
b si
. (16)
The ratiof(x)/(f[2/(1 +x)] +f[2/(1 + 1/x)]) is positive and continuous on [1/2; 1[ and ]1; 2]. By studying the Taylor series off(x) atxequals one, we observe thatf(x) = (x−1)2/2 +o[(x−1)2], f(2/(1 +x)) = (x−1)2/8 +o[(x−1)2], andf(2/(1 + 1/x)) = (x−1)2/8 +o[(x−1)2]. Hence, there exists a continuation of the ratio f(x)/(f[2/(1 +x)] +f[2/(1 + 1/x)]) around one. The supremum in (16) is therefore finite and the upper bound (15) holds.
Combining the upper bounds (14) and (15) with the definition ofΩ yieldse d(Ω,Ω)e ≤ 2X
i∈A
( log
"
1 +kti−btik2l2
4
# + log
"
1 + kbti−etik2l2
4
#)
+LX
i∈Ac
f
si
b si
+f
si
e si
≤ Lh
d(Ω,Ω) +b d(eΩ,Ω)b i
≤Ld(Ω,Ω)b . Hence, one can lower bound the risk ofΩ as followsb
sup
Ω∈ΥEΩ
hd Ω,Ωbi
≥L−1δsup
Ω∈ΥPΩ
hΩ6=Ωei
=L−1δ
1−min
Ω∈ΥPΩ
hΩ =Ωbi .
Applying Lemma3.3, we conclude that infΩb
sup
Ω∈ΥEΩ
hd Ω,Ωbi
≥L−1(1−κ)δ , (17)
if maxΩ,Ω′∈ΥK(P⊗Ωn,P⊗Ω′n)≤κlog|Υ|.
Let us now express this minimax lower bound in term of Kullback divergence. Thanks to the chain rule and Lemma 10.1, the Kullback divergence between two positive matrices Ω and Ω′ decomposes as
K(Ω; Ω′) = Xp i=1
1 2
logs′i si
+si
s′i −1 +li(ti, t′i) s′i
.
Straightforward computations allow to prove that the function log(s′i/si) +si/s′i−1 +li(ti, t′i)/s′i is minimized with respect tos′i whens′i=si+li(ti, t′i). This leads to the lower bound
logs′i si +si
s′i −1 + li(ti, t′i) s′i ≥log
1 +li(ti, t′i) si
.
By Definition (29) ofli(., .) the quantityli(ti, t′i) is lower bounded by [ϕmax(Ω)]−1kti−t′ik2l2. By assumption, [ϕmax(Ω)]−1 is larger than 1/2 for any Ω∈Υ. Moreover,si is smaller than 2. We conclude that for any Ω∈Υ and any positive matrix Ω′, the following lower bound holds
2K(Ω; Ω′)≥X
i∈A
log 1 +kti−t′ik2l2
4
!
+X
i∈Ac
si
s′i −log si
s′i
−1 =d(Ω,Ω′).
We conclude by gathering this last bound with (17) and setting κ1 := κand κ2 := L−1(1− κ)/2.
3.2. Adaptive banding
In order to compute the minimax rates of estimation over ellipsoids, we first need to consider a lower bound over the setsTord[k1, . . . , kp, r] andUord[k1, . . . , kp, r].
Definition 3.4. Let (k1, . . . , kp)∈Np such thatki ≤i−1 and let r be a positive number. We respectively define the setsTord[k1, . . . , kp, r] andUord[k1, . . . , kp, r] as
Tord[k1, . . . , kp, r] :=
T ∈T rig(p)s.t. ∀2≤i≤p, T[i, j] = 0 if 1≤j≤i−ki−1 0 orr if i−ki≤j≤i−1
,(18)
Uord[k1, . . . , kp, r] :=
T∗S−1T , T ∈ Tord[k1, . . . , kp, r] andS∈Diag(p) . (19) The setTord[k1, . . . , kp, r] contains lower triangular matrices with unit diagonal such that for each lineibetween 2 andp, the support of the vector (T[i, j])1≤j≤i−1is included in{i−ki, i−ki+ 1, . . . , i−1}. We are able to lower bound the minimax rates of estimation overUord[(k1, . . . , kp), r].
Proposition 3.5. Assume that k:= 1∨max1≤i≤pki is smaller than √n. Then, infbΩ
sup
Ω∈Uord[(k1,...,kp),r]Eh K
Ω;Ωbi
≥L
" p
X
i=2
ki+p
# r2∧1
n
These minimax rates of estimation are not really surprising, since they correspond to the minimax rates of estimation ofpdifferent parametric regression problems whose minimax rates is known to be of the orderki(r2∧1/n). We refer for instance to [6] Prop. 4.8. Moreover, the termp/nis due to the diagonal matricesS in Ω =T∗S−1T. We believe that the assumption “k is smaller than √n” is not necessary but we do not know how to remove it.
3.2.1. Proof of Proposition 5.3
The lower bound (17) is a consequence of Proposition3.5. Letkbe a positive integer smaller than
⌊√n⌋∧(p−1). Given some 0< r <p
a2kR2/k, we consider the setUord[0,1, . . . , k−1, k, . . . , k, r].
Let (T, S) refer to the Cholesky decomposition of a matrix Ω belonging to this set. By definition ofUord,
i−1
X
j=1
T[i, i−j]2 a2j =
Xk j=1
T[i, i−j]2
a2j ≤kr2/a2k≤R2.
Hence, the set Uord[0,1, . . . , k−1, k, . . . , k, r] is included in E(a, R, p). By Proposition 3.5, we obtain the minimax lower bound.
infbΩ
sup
Ω∈E(a,R,p)Eh K
Ω;Ωbi
≥ Lp(k+ 1) a2kR2
k ∧ 1 n
≥ Lp
a2kR2∧k+ 1 n
.
Similarly ifk = 0, the setUord[0, . . . ,0,+∞] is included inE(a, R, p) and the minimax rates of estimation over E(a, R, p) is lower bounded by Lp(a0R2∧1/n) with the conventiona0 = +∞. Taking the infimum for all non-positive integersksmaller than⌊√
n⌋∧(p−1) yields the first result.
Let us now turn to the second part of the proposition. We fix some γ >2. The matrices Ω considered in the proof of Proposition3.5for bounding the minimax rates of estimation over sets of the typeUord[0,1, . . . , k−1, k, . . . , k, r] have their eigenvalues between 1/2 and 2. Hence, the previous lower bound still holds and we get
infΩb
sup
Ω∈E(a,R,p)∩Bop(γ)Eh K
Ω;Ωbi
≥ Lp sup
k=0,...,⌊√
n⌋∧(p−1)
a2kR2∧k+ 1 n
. (20) Letkbe a non-negative integer smaller or equal tod∧(p−1), wheredis the maximal dimension of the models defining the estimatorΩedord. We consider the modelm∈ Mdord defined by
m:= (∅,{1}, . . . ,{i−1, . . . , i−k}, . . . ,{p−1, . . . , p−k}).
This model corresponds to estimating ak-th banded Cholesky factor. By Corollary5.1, the risk ofΩedord is upper bounded by
Eh K
Ω;Ωedordi
≤LK,η
p(k+ 1)
n +K(Ω; Ωm)
+τn(Ω, K, η). (21) Let us upper bound the bias termK(Ω; Ωm). By Equation (31), it decomposes as
2K(Ω; Ωm) = Xp i=1
si
si,mi
−log si
si,mi
−1 + li(ti, ti,mi) si,mi
= Xp i=1
log
1 +li(ti, ti,mi) si
≤ Xp
i=1
li(ti, ti,mi) si
,