Technical appendix to ``Adaptive estimation of covariance matrices via Cholesky decomposition''

(1)

HAL Id: hal-00524623

https://hal.archives-ouvertes.fr/hal-00524623

Preprint submitted on 8 Oct 2010

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Technical appendix to “Adaptive estimation of covariance matrices via Cholesky decomposition”

Nicolas Verzelen

To cite this version:

Nicolas Verzelen. Technical appendix to “Adaptive estimation of covariance matrices via Cholesky

decomposition”. 2010. �hal-00524623�

(2)

ISSN: 1935-7524

Technical appendix to “Adaptive estimation of covariance matrices via Cholesky decomposition”

Nicolas Verzelen^∗

INRA, UMR 729 MISTEA, F-34060 Montpellier, France SUPAGRO, UMR 729 MISTEA,

F-34060 Montpellier, France e-mail:[email protected] Abstract:

This is a technical appendix to “Adaptive estimation of covariance matrices via Cholesky decomposition (arXiv:1010.1445).

AMS 2000 subject classifications:Primary 62H12; secondary 62F35, 62J05.

Keywords and phrases:Covariance matrix, banding, Cholesky decomposition, directed graphical models, penalized criterion, minimax rate of estimation.

For the sake of clarity, we emphasize the references to [7] in bold . For instance Eq.(12) refers to Eq.(12) in [7], while Eq.(12) stands for Eq.(12) in this paper.

Contents

1 Additional simulations . . . 2

2 Proof of the risk upper bounds . . . 2

2.1 Proof of Lemma10.4 . . . 3

2.7 Proof of Corollary6.1 . . . 11

2.8 Proof of Proposition4.5 . . . 11

3 Proofs of the minimax bounds . . . 12

3.1 Main lemma . . . 12

3.2 Adaptive banding . . . 14

3.2.1 Proof of Proposition5.3 . . . 15

3.2.2 Proof of Proposition3.5 . . . 16

3.3 Complete graph selection (proof of Proposition6.2) . . . 18

3.3.1 First case: minimax rate overU¹(k, p) . . . 18

3.3.2 Second case: minimax rate overU²(k, p) . . . 21

3.3.3 Technical lemmas . . . 23

4 Proof of the Frobenius bounds . . . 24

∗Research mostly carried out at Univ Paris-Sud (Laboratoire de Mat´ematiques, CNRS-UMR 8628) 1

(3)

5 Proof of Lemma10.2 . . . 27 6 Proof of Proposition7.2 . . . 28 References . . . 31 1. Additional simulations

We provide here additional simulation results for the complete graph selection problem. We use the same notations and compute the same estimators as in Section8.2.

In the third simulation scheme, we consider the case where the ”good” ordering is completely unknown. We first sample a precision matrix Ω^c₁ according to the first simulation scheme. Then, we sample uniformly a permutation of {1, . . . , p} and reorder the variables according to this permutation to get the precision matrixΩ^c₃. Its Cholesky factor is generally far less sparse than the Cholesky factor of Ω1, while Ω^c₁ is as sparse as Ω^c₃. The purpose of this scheme is to illustrate the limits of procedures based on the Cholesky factor when no suitable ordering is known. As in the second scheme, we only perform the simulations forp= 200, Esp= 1,3,5, andn= 100.

Method Ledoit GLasso Lasso ChoSelect^f Kullback discrepancyK(Ω;Ω)b

Esp=1 20.1±0.2 9.1±0.2 8.2±0.1 7.6±0.1 Esp=3 42.8±1.8 22.0±0.2 24.0±0.4 25.3±0.5 Esp=5 52.6±1.0 35.8±0.3 42.7±0.4 49.1±0.5

Operator distancekΩb−Ωk

Esp=1 6.3±0.1 5.6±0.1 4.7±0.1 4.5±0.2 Esp=3 10.0±0.1 10.1±0.1 8.8±0.2 8.0±0.2 Esp=5 14.3±0.2 15.3±0.2 13.0±0.2 12.6±0.2

Operator distancekΩb⁻¹−Σk Esp=1 2.7±0.1 1.7±0.1 2.1±0.1 1.7±0.1 Esp=3 8.4±0.5 6.4±0.3 11.5±0.8 10.8±0.8 Esp=5 16.3±0.8 14.7±0.8 26.5±1.6 25.0±1.4

Table 1: Comparison between the procedures for the third covariance model Ω^c₃ withp= 200.

We observe different results for the Kullback risk depending on the sparsity. When Esp=1, ChoSelect^f still performs better than the other methods. However, the GLasso provides better results than the two other results for Esp=3 and Esp=5. This is not really surprising since the Glasso has been introduced to handle the ”unordered” situation. It seems from the case Esp=5, that the Lasso procedure is more robust to a ”bad” ordering than ChoSelect^f. ChoSelect^f still performs better than the other procedures in terms of the operator distance between precision matrices. Nevertheless, the differences of performance are less obvious than in the previous schemes. Finally, the Glasso and Ledoit and Wolf’s method exhibit a smaller operator distance between covariance matrices than the Lasso and ChoSelect^f.

2. Proof of the risk upper bounds

Lemma 2.1. Let V be a χ² random variable with N >2 degrees of freedom and let k be some positive integer such that N >2k, then

E 1

V^k

= 1

(N−2). . .(N−2k) and E V^k

=N(N+ 2). . .(N+ 2(k−1)). We refer to Lemma 5 in [1] for the proof of slightly more general version of this lemma.

(4)

2.1. Proof of Lemma 10.4 Using expression (31) ofK

t, s;btm,sbm

, we derive 2(1−κ0)K t, s;et,es

= 2K

t, s;btm,bsm

+ (1−κ0) log es

b sm

+ (1−κ0)s+l(et, t) e s

− s+l(btm, t) b

sm +κ0+κ0log s

b sm

.

By definition ofm, log (b es/sbm)≤pen(m)−pen(m). Hence,b 2(1−κ0)K t, s;et,se

≤ 2K t, s;btm,bsm

+ (1−κ0) [pen(m)−pen(m)]b + κ0 s

b sm

+κ0

− s b sm

+ 1 + log s

b sm

−s+l(btm, t)− kΠ^⊥_m(ǫ+ǫ_m)k²n

b sm

+ l(et, t)(1−κ0) +s(1−κ0)− kΠ^⊥_m_b(ǫ+ǫ_m_b)k²n

e

s ,

sincebsm=kΠ^⊥_m(ǫ+ǫ_m)k²n andes=kΠ^⊥_m_b(ǫ+ǫ_m_b)k²n. As the functionx−logx−1 is non-negative, the term [−s/bsm+ 1 + log (s/bsm)] is non-positive. SinceX<it^∗_mis the best predictor ofXigiven Xm, it follows thatl(btm, t) =l(btm, tm) +l(tm, t). Hence,

κ0 s b

sm −s+l(btm, t)− kΠ^⊥_m(ǫ+ǫ_m)k²n

b

sm ≤ −(1−κ0) s b

sm +kΠ^⊥_m(ǫ+ǫ_m)k²n−l(tm, t) b

sm .

In the proof of Lemma 7.5 in [8], we state that l(btm^′, tm^′)≤ϕmax

nZ^∗_m′Z_m^′)⁻¹

kΠm^′(ǫ+ǫ_m′)k²n . This yields

(1−κ0)l(et, t) +s e

s ≤ (1−κ0)s+l(tm_b, t) +κ2ϕmax

n(Z^∗_m_bZ_m_b)⁻¹

kΠm_b(ǫ+ǫ_m_b)k²n

e s + (1−κ0)(1−κ2)l(et, tm_b)

e

s .

Let us gather all these bounds 2(1−κ0)K

t, s;et,se

≤ 2K

t, s;btm,bsm

+ (1−κ0) [pen(m)−pen(m)]b + (1−κ0)l(tm_b, t) + (1−κ2)l(et, tm_b) +κ2ϕmax

e s

− kΠ^⊥_m_b(ǫ+ǫ_m_b)k²n

e

s +s(1−κ0) 1

e s − 1

b sm

+kΠ^⊥_m(ǫ+ǫ_m)k²n−l(tm, t) b

sm

≤ 2K

t, s;btm,bsm

+ (1−κ0) [pen(m)−pen(m)]b + (1−κ0)l(tm_b, t) + (1−κ2)l(et, tm_b) +κ2ϕmax

e s + kǫk²n−s(1−κ0) 1

b sm −1

e s

+kΠm_bǫk²n

e

s + 2hΠ^⊥_m_bǫ,Π^⊥_m_bǫ_m_biⁿ e s

− kΠ^⊥_m_bǫ_m_bk²n

e

s + 2hΠ^⊥_mǫ,Π^⊥_mǫ_miⁿ b sm

+kΠ^⊥_mǫ_mk²n−l(tm, t) b

sm

.

(5)

We then use Condition (39) onpen(m) and we apply the inequalityb 2hΠ^⊥_m_bǫ,Π^⊥_m_bǫ_m_biⁿ

e

s ≤κ1

l(tm_b, t) e

s +κ⁻₁¹s e s

hΠ^⊥_m_bǫ,Π^⊥_m_bǫ_m_bi²n

sl(tm_b, t) . Hence, we conclude that

2(1−κ0)K

t, s;et,es

≤ 2K

t, s;btm,bsm

+ (1−κ0)pen(m) + l(et, t)

e

s [R1(m)b ∨(1−κ2)(1−κ0)] +R2(m) +s e

sR3(m) +b R4(m,m)b . 2.2. Proof of Lemma 10.5

This proof follows the same sketch as the proof of Lemma 7.10 in [8]. The main difference lies in the fact thatκ0is zero in [8]. Letxbe a positive number that we shall fix later. For anyk >0, let us define

δk :=

rπ

2k+ exp(−k/16).

We shall first control the deviations of the random variables involved inR1(m). Applying devi-b ation inequality for χ² random variables and largest values of Standard Wishart matrices (see e.g. Lemmas 7.2, 7.3, and 7.4 in [8]) to all models m∈ Mensures that there exists an eventB2

such thatP(B^c2)≤4nexp(−nx) and under B2 it holds that kΠ^⊥_m_bǫ_m_bk²n

l(tm_b, t) ≥ n− |mb| n

"

1−δn−|mb|−

s2|mb|H(|mb|) n− |mb| −

s 2xn n− |mb|

!

∨0

#² ,

s+l(tm_b, t) ≤ 2|mb| n

h1 +p

H(|mb|) +H(|mb|)i

+ 3x , (1)

kΠ^⊥_m_b(ǫ+ǫ_m_b)k²n

s+l(tm_b, t) ≥ n− |mb| n

"

1−δn−|mb|−

s2|mb|H(|mb|) n− |mb| −

s 2xn n− |mb|

!

∨0

#² ,

nϕmax

h(Z^∗_m_bZ_m_b)⁻¹i

≤

"

1− 1 +p

2H(|mb|) r|mb| n −√

2x

!

∨0

#−2

.

By Assumption (HⁱK,η), the expression (1 +p

2H(m))b p

|mb|/nis bounded by √η. Moreover, (HⁱK,η) also ensures that|mb| ≤n/2. Henceδ_n−|_m|_b ≤δn/2≤ν(K) fornlarger than some quantity n0(K). Sinceν(K)≤1−√η, we derive that

kΠ^⊥_m_bǫ_m_bk²n

l(tm_b, t) ≥

1−|mb| n

[1−ν(K)−√η]²−2√

2x , (2)

s+l(tm_b, t) ≥

1−|mb| n

[1−ν(K)−√η]²−2√

2x , (3)

ϕmax

hn(Z^∗_m_bZ_m_b)⁻¹i

≤ h

1−√η−√ 2x

∨0i−2

.

Constrainingxto be smaller than 1−√η2

/8 ensures that

κ2ϕmax

hn(Z^∗_m_bZ_m_b)⁻¹i

1_B₁ ≤(K−1)(1−√η−ν(K))²

4 . (4)

(6)

Gathering the definition ofR1(.), the inequalities (1), (2), (3), and (4), we get R1(m)b ≤ κ1+ 1−[1−√η−ν(K)]²+|mb|

n (1−√η−ν(K))²U1+√

xU2+xU3 , whereU1,U2, andU3are respectively defined by









U1 := −(1−κ0)Kh 1 +p

2H(|mb|)i2

+ 1 + (1−κ0)(K−1)/2h 1 +p

H(|mb|)i2

≤0 U2 := 2√

2 [1 +Kη]

U3 := ³₄(K−1)

1−√η−ν(K)² .

SinceU1 is non-positive, we obtain an upper bound ofR1(m) that does not depend anymore onb the model m. By assumption (Hb ⁱK,η), we know that η <(1−ν(K)−(3/(K+ 2))^1/6)². Hence, coming back to the definition ofκ1allows to prove thatκ1is strictly smaller than [1−√η−ν(K)]². Setting

x:=

"

1−√η−ν(K)²

−κ1

4U2

#²

∧

1−√η−ν(K)²

−κ1

4U3 ∧ 1−√η²

8 ,

we get

R1(m)b ≤1−1 2

h(1−√η−ν(K))²−κ1

i<1 , under the eventB2.

This is enough to prove Lemma 10.5if we take B1 = B2. In fact, we shall define an event B1 slightly more restrictive in order to simplify the proof of Lemma 10.6. Let B3 be the event defined by

kǫk²n/s≤κ⁻₁¹ . (5) Since κ⁻¹₁ is strictly larger than one and since κ1 only depends on K and η, it follows that P(B^c3)≤exp(−nLK,η) withLK,η>0. Finally we take,B1:=B2∩B3.

2.3. Proof of Lemma 10.6

The sketch of this proof is similar to the proof of Lemma 7.11 in [8]. First, under the eventB1, it holds that

s+l(tm_b, t) ≥[1−ν(K)−√η]²/4>0 , κ2ϕmax

h

n(Z^∗_m_bZ_m_b)⁻¹i

≤(K−1)(1−√η−ν(K))²

4 .

This is a consequence of (3), of the choice of x in the previous proof, and of the assumption (HⁱK,η). Since es = |Π^⊥_m_b(ǫ+ǫ_m_b)k²n, it follows that s/es is upper bounded under the event B1. Hence, we only have to upper bound the expectation ofR3(m) onb B1.

Ehs e

sR3(m1b _B1)i

≤LK,ηE[R3(m1b _B1)] . (6) Let us consider the random variablesEm_b defined by

Em_b :=κ⁻₁¹hΠ^⊥_m_bǫ,Π^⊥_m_bǫ_m_bi²n

sl(tm_b, t) +kΠm_bǫk²n

s .

(7)

By (5), the random variableEm_b is upper bounded underB1 by Em_b ≤κ⁻²₁ hΠ^⊥_m_bǫ/kΠ^⊥_m_bǫkⁿ,Π^⊥_m_bǫ_m_bi²n

sl(tm_b, t) +kΠm_bǫk²n

s

This upper bound follows the distribution of a linear combinations ofχ²random variables. More details about this observation are given in the proof of Lemma 7.7 in [8]. We shall simultaneously control the deviations of the random variablesEm_b,kΠm_b(ǫ+ǫ_m_b)k²n/[l(tm_b, t) +s], andkΠ^⊥_m_b(ǫ+ ǫ_m_b)k²n/[s+l(tm_b, t)] by applying Lemma 1 in [5] and Lemmas 7.2 and 7.3 in [8]. For anyx >0, we define an eventF(x) such that conditionally onF(x)∩B1,











Em_b ≤ ^|^m^b^|^+κn ⁻²¹ +_n²q

|mb|+κ⁻₁⁴

[|mb|(ξ+H(|mb|)) +x]

+ 2κ⁻₁²[ξ(|mb|+H(|mb|)) +x]/n ,

kΠ_c_m(ǫ+ǫ

c m)k²n

s+l(tm_c,t) ≤ n¹

h|mb|+ 2p

|mb|[|mb|(1/16 +H(|mb|)) +x] + 2 [|mb|(1/16 +H(|mb|)) +x]i ,

kΠ^⊥_c_mǫ_cm+ǫk²n

s+l(tm_c,t) ≥ ⁿ^−|n^m^b^|

h1−δn−|mb|−q

|bm|(1+2H(|bm|)) n−|mb| −q

2x n−|mb|

∨0i² ,

whereδk is defined in the previous proof. Then, the probability ofF(x) satisfies P[F(x)^c] ≤ e⁻^x

"

X

m∈M

exp [−|mb|H(|mb|)]

e⁻^ξ^|^m^b^|+e⁻^|c¹⁶^m| +e⁻^|c^m|² #

≤ e^−x 1

1−e⁻^ξ + 1

1−e⁻^1/16 + 1 1−e⁻^1/2

.

Let us expand the three deviation bounds thanks to the inequality 2ab≤τ a²+τ⁻¹b²: Em_b ≤ |mb|

n

h1 + 2p

ξ+ 2κ⁻₁²ξ+τ1ξ+τ2

i+x n

2κ⁻₁²+τ₂⁻¹+τ1

+ κ⁻₁² n

1 +τ₁⁻¹κ⁻₁²

+|mb|H(|mb|) n

2κ⁻₁²+τ1

+ 2|mb|p H(|mb|) n

≤ |mb| n

1 +p

2H(|mb|)2h

κ⁻₁²+ 2p

ξ+ 2κ⁻₁²ξ+τ1ξ+τ2

i

+ x

n

2κ⁻₁²+τ₂⁻¹+τ1 +κ⁻₁²

n

1 +τ₁⁻¹κ⁻₁² .

Similarly, we get

l(tm_b, t) +s ≤ 2|mb| n

h1 +p

2H(|mb|)i2

+ 5x n .

We recall that |m| ≤n/2 by Assumption (HⁱK,η). If nis larger than some quantityn0(K), then δn−|m|≤δn/2≤ν(K). Applying again Assumption (HⁱK,η), we get

−K |mb| n− |mb|

1 +p

2H(|mb|)2kΠ^⊥_m_b(ǫ+ǫ_m_b)k²n

l(tm_b, t) +s

≤ −K|mb| n

1 +p

2H(|mb|)2"

1−√η−ν(K)−

s 2x n− |mb|

!

∨0

#²

≤ −K|mb| n

1 +p

2H(|mb|)2h

(1−√η−ν(K))²−τ3

i+ 2Kητ₃⁻¹x n .

(8)

Let us combine these three bounds with the definitions ofR3(m),b κ1, andκ2, and the bound (4).

Hence, under the eventB1∩F(x), it holds that R3(m)b ≤|mb|

n

h1 +p

2H(|mb|)i2

U1+x

nU2+LK,η

n U3 , where





U1 := −^K−110 1−√η−ν(K)2

+Kτ3+ 2√

ξ+ 2κ⁻₁²ξ+τ1ξ+τ2 , U2 := τ₂⁻¹+τ1+LK,η(1 +τ₃⁻¹),

U3 := 1 +τ₁⁻¹ .

Since K >1, there exists a suitable choice of the constantsξ, τ1, andτ2, only depending onK andη that constrainsU1to be non positive. Hence, under the eventB1∩F(x),

Bm_b ≤ LK,η

n +L^′(K, η)x n .

SinceP[F(x)^c]≤e⁻^xLK,η, we conclude by integrating the last expression with respect tox.

2.4. Proof of Lemma 10.7 Let us assume thatn≥17.

R2(m) = 1−l(tm, t) +kΠ^⊥_mǫk²n

kΠ^⊥_mǫ+ǫ_mk²n

.

We shall first upper bound the expectation ofR2(m).

E

l(tm, t) kΠ^⊥_mǫ+ǫ_mk²n

= n

n− |m| −2

l(tm, t)

[s+l(tm, t)] ≥ n− |m| n− |m| −2

l(tm, t)

[s+l(tm, t)] (7) Let us to the second term.

E

kΠ^⊥_mǫk²n

= 1 +l(tm, t) s

n− |m|

n− |m| −2 ≤s+l(tm, t) s

n− |m| n− |m| −2 Applying a convexity argument, we derive that

E

kΠ^⊥_mǫk²n

≥ s

s+l(tm, t)

n− |m| −2

n− |m| . (8)

Gathering the inequalities (7) and (8) allows to conclude that E[R2(m)]≤ 2

n− |m| . We get the upper bound

E[R2(m)1_B₁] = E[R2(m)]−E

R2(m)1_B^c

1

≤ 2

n− |m|+q P(B^c1)

q

E[R²₂(m)]. (9)

(9)

Thus, we need to upper bound the second moment ofR2(m).

E R²₂(m)

≤ 3 (

1 +E

l²(tm, t) kΠ^⊥_mǫ+ǫ_mk⁴n

+

s

E[kΠ^⊥_mǫk⁸n]E

1

kΠ^⊥_mǫ+ǫ_mk⁸n

)

≤ 3

"

1 +

l(tm, t)n

(s+l(t, m, t))(n− |m| −4) 2

+

s(n− |m|+ 6) (s+l(t, m, t))(n− |m| −8)

2# ,

by Lemma 2.1. By Assumption (HⁱK,η), we have |m| ≤ n/2. If n ≥ 32, we can upper bound E

R²₂(m)

by a universal constant. Combining this result with (9) enables to conclude that E[R2(m)1_B₁]≤LK,η/n.

We bound the quantityR4(m,m) using the same arguments as in the proof of Theorem 3 in [1].b We first split this quantity into a sum of two terms:

R4(m,m) =b kǫk²n−s(1−κ0)

+

1 b sm −1

e s

+ s(1−κ0)− kǫk²n

+

− 1 b sm

+1 e s

≤ R4,1(m,m) +b R4,2(m)b ,

whereR4,1(m,m) andb R4,2(m) are respectively defined byb R4,1(m,m)b := kǫk²n−s(1−κ0)

+

1 b sm−1

e s

R4,2(m)b := s(1−κ0)− kǫk²n

+

1 e s .

By definition, we know that log (es/bsm) is smaller thanpen(m)−pen(m).b R4,1(m,m)b ≤ kǫk²n−s(1−κ0)

+

1 b sm

log es

b sm

≤ kǫk²n−s(1−κ0)

+

1 b sm

pen(m).

Applying Cauchy-Schwarz inequality yields E[R4,1(m,m)1b _B1] ≤ E

"

kǫk²n−s(1−κ0)

+

b sm

#

pen(m)

≤ s

Eh

(kǫk²n−s(1−κ0))²i E

1 b s²_m

pen(m)

≤ s sm

s κ²₀+2

n

n²

(n− |m| −2)(n− |m| −4)pen(m)

≤ Lpen(m),

(10)

since |m| ≤ n/2 by Assumption HK,η and since n ≥ 17. Let us turn to R4,2(m). We applyb H¨older’s inequality withv:=⌊n/8⌋andu=v/(v−1).

E[R4,2(m)1b _B1] ≤ E

s(1−κ0)− kǫk²n

+

1 e s

≤ Eh

1_s(1₋_κ₀₎_≥kǫk²n

s e s i

≤ P

kǫk²n≤s(1−κ0)1/uh Es

e s

^vi^1/v

≤ P

kǫk²n≤s(1−κ0)^1/u"

X

m∈M

E s

b sm

v#1/v

.

Sincevis smaller thann/8 and since|m|is smaller thann/2 it follows thatn− |m| −2vis larger thann/4. Hence, we can apply Lemma2.1to any modelm∈ M.

E[R4,2(m)b 1_B₁] ≤ exp

−nκ²₀ 4u

" X

m∈M

n^v

(n− |m| −2). . .(n− |m| −2v)

#^1/v

≤ n

n/2−2vexp

−nκ²₀ 4u

|M|^1/v

≤ nexp

−nκ²₀ 4u

|M|^1/v .

Let us bound the cardinality of the collection M. We recall that the dimension of any model m∈ M is assumed to be smaller thann/2 by (HK,η). Besides, for any d∈ {1, . . . , n/2}, there are less than exp(dH(d)) models of dimensiond. Hence,

log (|M|)≤log(n) + sup

d=1,...,n/2

dH(d).

By assumption (HK,η),dH(d) is smaller thann/2. Thus, log(|M|)≤log(n) +n/2 and it follows that|M|^1/v is smaller than an universal constant providing thatnis larger than 8. All in all, we get

E[R4,2(m)1b _B1]≤Lnexp

−nκ²₀ 4u

.

For anyx >0, the following inequality holds x−1−log(x)≤ 9

64

x− 1 x

2

.

This statement is easy to establish by studying the derivative of the associated function. Hence, we upper bound the Kullback divergence

K

t, s;btm,bsm

= s

b sm

+ 1−log s

b sm

+l(btm, t) b sm

≤ 9 64

s² b s²_m +bs²_m

s²

+l(tm, t) b

sm +l(btm, tm) b sm .

(11)

Thanks to Cauchy-Schwarz inequality, we obtain E

K t, s;et,es 1_B^c

1

≤ E

"

9 64

s² e s² +se²

s²

+l(tm, t) e

s +l(et, tm_b) e s

! 1_B^c

1

#

≤ Lq P[B^c₁]

vu utX

m∈M

E

"

1_m=_m_b s⁴ b s⁴_m+bs⁴_m

s⁴ +l²(tm, t) b

s²_m +l²(btm, tm) b s²_m

!#

.

As in the proof of Lemma10.8, we apply H¨older’s inequality withv=⌊n/16⌋andu=v/(v−1).

Again, we check that for any modelm∈ M, n− |m| −8v≥1.

E

"

1_m=_m_b s⁴ b s⁴_m+bs⁴_m

s⁴ +l²(tm, t) b

s²_m +l²(btm, tm

b s²_m

!#

≤P[m=m]b ¹^u



Es^4v b s^4v_m

¹v

+E bs^4v_m

s^4v v¹

+E

l^2v(tm, t) b s^2v_m

¹v

+E l^2v(btm, tm) b s^2v_m

!_v¹

 . We bound the first two terms applying Lemma 2.1 or computing the v-th moment of χ² random variable.

E s^4v

b s^4vm

v¹

≤ n⁴

(n− |m| −8v)⁴ , E

bs^4v_m s^4v

_v¹

=

(n− |m|)(n− |m|+ 2). . .(n− |m|+ 2(4v−1))(sm)^4v (ns)^4v

¹_v

≤ (n− |m|+ 8v)⁴(s+l(0, t))⁴

n⁴s⁴ .

AskΠ^⊥_m(ǫ+ǫ_m)k²n is independent of the couple (kΠm(ǫ+ǫ_m)k²n,X_m), the random variablesbsm

andl(btm, tm) are independent. We bound the thel^2v-risk ofl(btm, t) thanks to Proposition 7.8 in [8].

E l^2v(btm, tm) b s^2v_m

!¹_v

=

E

l^2v(btm, tm) E

1 b s^2v_m

¹_v

≤ Lv²|m|²n⁴

(n− |m| −4v)² ≤Lv²|m|²n² n²

(n− |m| −4v)² . Combining these upper bounds and noting thatn− |m| −8v≥1 and|m| ≤n/2 yields

E

K t, s;et,es 1_B^c₁

≤

"

2n²

(n− |m| −8v)² + Lv|m|n²

n− |m| −4v +(n− |m|+ 8v)² n²

1 +l(0, t) s

²#

× L q

P[B^c1]|M|^2v¹

≤ LK,ηn^5/2

1 + l(0, t) s

exp [−nLK,η] ,

since |M|^1/2v is smaller than than an universal constant as explained in the proof of Lemma 10.8. Finally,l(0, t)/sis smaller thanK(t, s; 0,1).

(12)

2.7. Proof of Corollary 6.1

First, we claim that the penalties (21) are lower bounded by penalties defined in (7). Suppose that K > e−1. Since log(1 +Kx)≥log(1 +K)xfor anyxbetween 0 and 1, it follows that

peni(mi)≥log(1 +K) |mi| n− |mi|

1 +

q

2 [1 + log ((i−1)/|mi|)]²

, (10)

if |mi|/(n− |mi|){1 +p

2[1 + log((i−1)/|mi|))]²} ≤ 1. If K ≤ e−1, there exists a positive constantζ(K) such that log(1 +Kx)≥√

Kx, for allx≤ζ(K). Hence, we get peni(mi)≥√

K |mi| n− |mi|

1 +

q

2 [1 + log ((i−1)/|mi|)]²

, (11)

ifK≤e−1 and if|mi|/(n− |mi|)[1 +p

2[1 + log((i−1)/|mi|)]²]≤ζ(K).

Gathering (10) and (11), we get that for any K > 1, there exists some K^′ > 1 and some ζ(K)>0 such that:

peni(mi)≥K^′ |mi| n− |mi|

1 +

q

2 [1 + log ((i−1)/|mi|)]²

,

if|mi|/(n− |mi|){1 +p

2[1 + log((i−1)/|mi|)]²} ≤ζ(K).

For any 2≤i≤pand any 1≤k≤(i−1)∧d,Hi(k) is smaller than 1 + log((i−1)/k). Hence, peni(mi) is lower bounded by a penalty of the form (7) with someK^′>1. Assuming that

|mi| n− |mi|

1 +

q

2 [1 + log ((i−1)/|mi|)]²

≤ζ(K)∧η(K^′), (12) we derive that (HK^′,η) is fulfilled and that the risk bound (23) holds.

We conclude by observing that Condition (12) is satisfied if d[1 + log(p/d)∨0]≤nη^′(K), for some suitable functionη^′(K).

2.8. Proof of Proposition 4.5

We apply the same arguments as in the proof of Theorem4.4, except that we replaceH(|m|) by lm. Then, Lemmas10.4and10.7are still true.

In the proof of Lemmas 10.8and 10.9, the only difference with the previous case concerns the upper bound of log (|M|). By definition oflm,

|M| −1≤ sup

m∈M\{∅}

exp(|m|lm).

Hence, log(|M|)≤1 + sup_m∈M\{∅}|m|lm, which is smaller than 1 +n/2 by Assumption (H^bayK,η).

Lemmas10.5and10.6also hold whenH(|m|) is replaced bylmas explained in the proof of Proposition 3.5 in [8].

(13)

3. Proofs of the minimax bounds

We note dH(., .) the Hamming distance between two vectors. The Hamming distance between two matrices of sizepis defined as the Hamming distance between their vectorialized version of size p². It is also noteddH(., .).

3.1. Main lemma

We first state two useful lemmas for proving the minimax lower bounds. The first one is known as Varshamov-Gilbert’s lemma, whereas the second one is a modified version of Birg´e’s lemma for covariance estimation.

Lemma 3.1(Varshamov-Gilbert’s lemma). Let{0,1}^Dbe equipped with Hamming distancedH. There exists some subset Θof{0,1}^D with the following properties

dH(θ, θ^′)> D/4 for every(θ, θ^′)∈Θ² with θ6=θ^′ and log|Θ| ≥D/8 . We notektk^l² the Euclidean norm of a vectort.

Lemma 3.2. LetA be a subset of{1, . . . , p}. For any positive matricesΩandΩ^′, we define the function d(Ω,Ω^′)by

d(Ω,Ω^′) :=X

i∈A

log

"

1 +kti−t^′_ik²l2

4

#

+ X

i∈A^c

si

s^′_i + log si

s^′_i

−1 . (13)

Let Υbe a subset of square matrices of sizepwhich satisfies the following assumptions:

1. For allΩ∈Υ, ϕmax(Ω)≤2 andϕmin(Ω)≥1/2.

2. There exists(s1,s₂)∈[1; 2]² such that∀Ω∈Υ, ∀1≤i≤p,si∈ {s₁,s₂}.

Setting δ = minΩ,Ω^′∈Υ,Ω6=Ω^′d(Ω,Ω^′), provided that maxΩ,Ω^′∈ΥK(P^⊗_Ωⁿ;P^⊗_Ω^′ⁿ) ≤ κ1log|Υ|, the following lower bound holds

infΩb

sup

Ω∈ΥEΩ

hK Ω;Ωbi

≥κ2δ .

The numerical constants κ1 andκ2 are made explicit in the proof.

Proof of Lemma 3.2. This lemma is mainly based on an application of Birg´e’s version of Fano’s lemma [3]. We provide a statement of the result that is taken from [6] Sect.2.4.

Lemma 3.3. Let(Pi)_0≤i≤N be some family of probability distributions and(Ai)_0≤i≤N be some family of disjoint events. Let a= min0≤i≤NPi(Ai), then

a≤κ∨

max_1≤i≤NK(Pi;P0) log(1 +N)

,

whereκ= 2e/(2e+ 1).

LetΩ be an estimator of Ω. Letb Ω be an estimator that takes its values in Υ and satisfiese d(eΩ,Ω) = minb

Ω^′∈Υd(Ω^′,Ω)b .

(14)

We note (T ,e S) and (e T ,b S) the Cholesky decompositions ofb Ω ande Ω. Letb i∈ {1, . . . , p}. By the triangle inequality,

kti−etik²l2

4 ≤2

"

kti−btik²l2

4 +kbti−etik²l2

4

# .

For any positive numbers a and b, log(1 +a+b)≤log(1 +a) + log(1 +b). Moreover, for any positive numbera, we have log(1 + 2a)≤2 log(1 +a) because the log function is concave. Hence, we get

log

"

1 + kti−etik²l2

4

#

≤2 log

"

1 + kti−btik²l2

4

# + 2 log

"

1 + kbti−etik²l2

4

#

. (14)

Let us define the functionf byf(x) :=x−log(x)−1 for anyx >0. We state that there exists some numerical constantL such that

f si

e si

≤L

f si

b si

+f

esi

b si

. (15)

If si =esi, this inequality holds for anyL >0 sincef(1) = 0 and f is non negative. Ifsi 6=esi, there are two possibilities: either si = s₁ and esi = s₂ or si = s₂ and esi = s₁. By deriving f(s₁/x) +f(s₂/x), one observes that this sum is minimized for x= (s₁+s₂)/2 and that this minimum equalsf[2/(1 +s₁/s₂)] +f[2/(1 +s₂/s₁)]. Hence, we obtain that

f si

e si

≤ f

s1

s2

∨f

s2

s1

fh

2/

1 + ^s_s¹₂i +fh

2/

1 + ^s_s²₁i

f si

b si

+f

esi

b si

.

Sinces₁ ands₂lie between one and two, it follows that f

si

e si

≤ sup

1/2≤x≤2

f(x) f[2/(1 +x)] +f

2/(1 + ¹_x)

f si

b si

+f

esi

b si

. (16)

The ratiof(x)/(f[2/(1 +x)] +f[2/(1 + 1/x)]) is positive and continuous on [1/2; 1[ and ]1; 2]. By studying the Taylor series off(x) atxequals one, we observe thatf(x) = (x−1)²/2 +o[(x−1)²], f(2/(1 +x)) = (x−1)²/8 +o[(x−1)²], andf(2/(1 + 1/x)) = (x−1)²/8 +o[(x−1)²]. Hence, there exists a continuation of the ratio f(x)/(f[2/(1 +x)] +f[2/(1 + 1/x)]) around one. The supremum in (16) is therefore finite and the upper bound (15) holds.

Combining the upper bounds (14) and (15) with the definition ofΩ yieldse d(Ω,Ω)e ≤ 2X

i∈A

( log

"

1 +kti−btik²l2

4

# + log

"

1 + kbti−etik²l2

4

#)

+LX

i∈A^c

f

si

b si

+f

si

e si

≤ Lh

d(Ω,Ω) +b d(eΩ,Ω)b i

≤Ld(Ω,Ω)b . Hence, one can lower bound the risk ofΩ as followsb

sup

Ω∈ΥEΩ

hd Ω,Ωbi

≥L⁻¹δsup

Ω∈ΥPΩ

hΩ6=Ωei

=L⁻¹δ

1−min

Ω∈ΥPΩ

hΩ =Ωbi .

(15)

Applying Lemma3.3, we conclude that infΩb

sup

Ω∈ΥEΩ

hd Ω,Ωbi

≥L⁻¹(1−κ)δ , (17)

if maxΩ,Ω^′∈ΥK(P^⊗_Ωⁿ,P^⊗_Ω^′ⁿ)≤κlog|Υ|.

Let us now express this minimax lower bound in term of Kullback divergence. Thanks to the chain rule and Lemma 10.1, the Kullback divergence between two positive matrices Ω and Ω^′ decomposes as

K(Ω; Ω^′) = Xp i=1

1 2

logs^′_i si

+si

s^′_i −1 +li(ti, t^′_i) s^′_i

.

Straightforward computations allow to prove that the function log(s^′_i/si) +si/s^′_i−1 +li(ti, t^′_i)/s^′_i is minimized with respect tos^′_i whens^′_i=si+li(ti, t^′_i). This leads to the lower bound

logs^′_i si +si

s^′_i −1 + li(ti, t^′_i) s^′_i ≥log

1 +li(ti, t^′_i) si

.

By Definition (29) ofli(., .) the quantityli(ti, t^′_i) is lower bounded by [ϕmax(Ω)]⁻¹kti−t^′_ik²l2. By assumption, [ϕmax(Ω)]⁻¹ is larger than 1/2 for any Ω∈Υ. Moreover,si is smaller than 2. We conclude that for any Ω∈Υ and any positive matrix Ω^′, the following lower bound holds

2K(Ω; Ω^′)≥X

i∈A

log 1 +kti−t^′_ik²l2

4

!

+X

i∈A^c

si

s^′_i −log si

s^′_i

−1 =d(Ω,Ω^′).

We conclude by gathering this last bound with (17) and setting κ1 := κand κ2 := L⁻¹(1− κ)/2.

3.2. Adaptive banding

In order to compute the minimax rates of estimation over ellipsoids, we first need to consider a lower bound over the setsT^ord[k1, . . . , kp, r] andU^ord[k1, . . . , kp, r].

Definition 3.4. Let (k1, . . . , kp)∈N^p such thatki ≤i−1 and let r be a positive number. We respectively define the setsT^ord[k1, . . . , kp, r] andU^ord[k1, . . . , kp, r] as

T^ord[k1, . . . , kp, r] :=

T ∈T rig(p)s.t. ∀2≤i≤p, T[i, j] = 0 if 1≤j≤i−ki−1 0 orr if i−ki≤j≤i−1

,(18)

U^ord[k1, . . . , kp, r] :=

T^∗S⁻¹T , T ∈ Tôrd[k1, . . . , kp, r] andS∈Diag(p) . (19) The setTôrd[k1, . . . , kp, r] contains lower triangular matrices with unit diagonal such that for each lineibetween 2 andp, the support of the vector (T[i, j])1≤j≤i−1is included in{i−ki, i−ki+ 1, . . . , i−1}. We are able to lower bound the minimax rates of estimation overUôrd[(k1, . . . , kp), r].

Proposition 3.5. Assume that k:= 1∨max_1≤i≤pki is smaller than √n. Then, infbΩ

sup

Ω∈Uord[(k1,...,kp),r]Eh K

Ω;Ωbi

≥L

" p

X

i=2

ki+p

# r²∧1

n

(16)

These minimax rates of estimation are not really surprising, since they correspond to the minimax rates of estimation ofpdifferent parametric regression problems whose minimax rates is known to be of the orderki(r²∧1/n). We refer for instance to [6] Prop. 4.8. Moreover, the termp/nis due to the diagonal matricesS in Ω =T^∗S⁻¹T. We believe that the assumption “k is smaller than √n” is not necessary but we do not know how to remove it.

3.2.1. Proof of Proposition 5.3

The lower bound (17) is a consequence of Proposition3.5. Letkbe a positive integer smaller than

⌊√n⌋∧(p−1). Given some 0< r <p

a²_kR²/k, we consider the setU^ord[0,1, . . . , k−1, k, . . . , k, r].

Let (T, S) refer to the Cholesky decomposition of a matrix Ω belonging to this set. By definition ofU^ord,

i−1

X

j=1

T[i, i−j]² a²_j =

Xk j=1

T[i, i−j]²

a²_j ≤kr²/a²_k≤R².

Hence, the set U^ord[0,1, . . . , k−1, k, . . . , k, r] is included in E(a, R, p). By Proposition 3.5, we obtain the minimax lower bound.

infbΩ

sup

Ω∈E(a,R,p)Eh K

Ω;Ωbi

≥ Lp(k+ 1) a²_kR²

k ∧ 1 n

≥ Lp

a²_kR²∧k+ 1 n

.

Similarly ifk = 0, the setU^ord[0, . . . ,0,+∞] is included inE(a, R, p) and the minimax rates of estimation over E(a, R, p) is lower bounded by Lp(a0R²∧1/n) with the conventiona0 = +∞. Taking the infimum for all non-positive integersksmaller than⌊√

n⌋∧(p−1) yields the first result.

Let us now turn to the second part of the proposition. We fix some γ >2. The matrices Ω considered in the proof of Proposition3.5for bounding the minimax rates of estimation over sets of the typeU^ord[0,1, . . . , k−1, k, . . . , k, r] have their eigenvalues between 1/2 and 2. Hence, the previous lower bound still holds and we get

infΩb

sup

Ω∈E(a,R,p)∩Bop(γ)Eh K

Ω;Ωbi

≥ Lp sup

k=0,...,⌊√

n⌋∧(p−1)

a²_kR²∧k+ 1 n

. (20) Letkbe a non-negative integer smaller or equal tod∧(p−1), wheredis the maximal dimension of the models defining the estimatorΩe^d_ord. We consider the modelm∈ M^dord defined by

m:= (∅,{1}, . . . ,{i−1, . . . , i−k}, . . . ,{p−1, . . . , p−k}).

This model corresponds to estimating ak-th banded Cholesky factor. By Corollary5.1, the risk ofΩe^d_ord is upper bounded by

Eh K

Ω;Ωe^d_ordi

≤LK,η

p(k+ 1)

n +K(Ω; Ωm)

+τn(Ω, K, η). (21) Let us upper bound the bias termK(Ω; Ωm). By Equation (31), it decomposes as

2K(Ω; Ωm) = Xp i=1

si

si,mi

−log si

si,mi

−1 + li(ti, ti,mi) si,mi

= Xp i=1

log

1 +li(ti, ti,mi) si

≤ Xp

i=1

li(ti, ti,mi) si

,