• Aucun résultat trouvé

This paper considers adaptive minimax esti- mation of sparse precision matrices in the high dimensional setting

N/A
N/A
Protected

Academic year: 2022

Partager "This paper considers adaptive minimax esti- mation of sparse precision matrices in the high dimensional setting"

Copied!
34
0
0

Texte intégral

(1)

2016, Vol. 44, No. 2, 455–488 DOI:10.1214/13-AOS1171

©Institute of Mathematical Statistics, 2016

ESTIMATING SPARSE PRECISION MATRIX: OPTIMAL RATES OF CONVERGENCE AND ADAPTIVE ESTIMATION BYT. TONYCAI1, WEIDONG LIU2 ANDHARRISONH. ZHOU3

University of Pennsylvania, Shanghai Jiao Tong University and Yale University

Precision matrix is of significant importance in a wide range of appli- cations in multivariate analysis. This paper considers adaptive minimax esti- mation of sparse precision matrices in the high dimensional setting. Optimal rates of convergence are established for a range of matrix norm losses. A fully data driven estimator based on adaptive constrained1minimization is pro- posed and its rate of convergence is obtained over a collection of parameter spaces. The estimator, called ACLIME, is easy to implement and performs well numerically.

A major step in establishing the minimax rate of convergence is the deriva- tion of a rate-sharp lower bound. A “two-directional” lower bound technique is applied to obtain the minimax lower bound. The upper and lower bounds together yield the optimal rates of convergence for sparse precision matrix estimation and show that the ACLIME estimator is adaptively minimax rate optimal for a collection of parameter spaces and a range of matrix norm losses simultaneously.

1. Introduction. Precision matrix plays a fundamental role in many high- dimensional inference problems. For example, knowledge of the precision ma- trix is crucial for classification and discriminant analyses. Furthermore, precision matrix is critically useful for a broad range of applications such as portfolio op- timization, speech recognition and genomics. See, for example,Lauritzen(1996), Yuan and Lin(2007) andSaon and Chien(2011). Precision matrix is also closely connected to the graphical models which are a powerful tool to model the rela- tionships among a large number of random variables in a complex system and are used in a wide array of scientific applications. It is well known that recovering the structure of an undirected Gaussian graph is equivalent to the recovery of the support of the precision matrix. See, for example,Lauritzen(1996), Meinshausen

Received January 2013; revised June 2013.

1Supported in part by NSF FRG Grant DMS-08-54973 and NSF Grant DMS-12-08982.

2Supported by NSFC Grants 11201298, 11322107 and 11431006, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, Shanghai Pu- jiang Program, Shanghai Shuguang Program, Foundation for the Author of National Excellent Doc- toral Dissertation of PR China and 973 Program (2015CB856004).

3Supported in part by NSF Career Award DMS-06-45676 and NSF FRG Grant DMS-08-54975.

MSC2010 subject classifications.Primary 62H12; secondary 62F12, 62G09.

Key words and phrases.Constrained1-minimization, covariance matrix, graphical model, mini- max lower bound, optimal rate of convergence, precision matrix, sparsity, spectral norm.

455

(2)

and Bühlmann (2006) andCai, Liu and Luo(2011).Liu, Lafferty and Wasserman (2009) extended the result to a more general class of distributions called nonpara- normal distributions.

The problem of estimating a large precision matrix and recovering its sup- port has drawn considerable recent attention and a number of methods have been introduced. Meinshausen and Bühlmann(2006) proposed a neighborhood selec- tion method for recovering the support of a precision matrix. Penalized likelihood methods have also been introduced for estimating sparse precision matrices.Yuan and Lin (2007) proposed an 1 penalized normal likelihood estimator and stud- ied its theoretical properties. See also Friedman, Hastie and Tibshirani (2008), d’Aspremont, Banerjee and El Ghaoui (2008), Rothman et al. (2008),Lam and Fan(2009) andRavikumar et al.(2011).Yuan(2010) applied the Dantzig selector method to estimate the precision matrix and gave the convergence rates for the estimator under the matrix1 norm and spectral norm. Cai, Liu and Luo(2011) introduced an estimator called CLIME using a constrained 1 minimization ap- proach and obtained the rates of convergence for estimating the precision matrix under the spectral norm and Frobenius norm.

Although many methods have been proposed and various rates of convergence have been obtained, it is unclear which estimator is optimal for estimating a sparse precision matrix in terms of convergence rate. This is due to the fact that the min- imax rates of convergence, which can serve as a fundamental benchmark for the evaluation of the performance of different procedures, is still unknown. The goals of the present paper are to establish the optimal minimax rates of convergence for estimating a sparse precision matrix under a class of matrix norm losses and to introduce a fully data driven adaptive estimator that is simultaneously rate optimal over a collection of parameter spaces for each loss in this class.

Let X1, . . . , Xn be a random sample from a p-variate distribution with a co- variance matrix =ij)1i,jp. The goal is to estimate the inverse of , the precision matrix=ij)1i,jp. It is well known that in the high-dimensional setting structural assumptions are needed in order to consistently estimate the pre- cision matrix. The class of sparse precision matrices, where most of the entries in each row/column are zero or negligible, is of particular importance as it is related to sparse graphs in the Gaussian case. For a matrixAand a number 1≤w≤ ∞, the matrixw norm is defined asAw=sup|x|w1|Ax|w. The sparsity of a preci- sion matrix can be modeled by theq balls with 0≤q <1. More specifically, we define the parameter spaceGq(cn,p, Mn,p)by

Gq(cn,p, Mn,p)=

=ij)1i,jp:max

j p i=1

|ωij|qcn,p, 1Mn,p, λmax()/λmin()M1, 0

, (1.1)

where 0≤q <1,Mn,pandcn,pare positive and bounded away from 0 and allowed to grow asnandpgrow,M1>0 is a given constant,λmax()andλmin()are the

(3)

largest and smallest eigenvalues of, respectively, andc1nβp≤exp(γ n) for some constantsβ >1,c1>0 andγ >0. The notationA0 means thatAis sym- metric and positive definite. In the special case ofq=0, a matrix inG0(cn,p, Mn,p) has at mostcn,pnonzero elements on each row/column.

Our analysis establishes the minimax rates of convergence for estimating the precision matrices over the parameter space Gq(cn,p, Mn,p) under the matrixw

norm losses for 1≤w≤ ∞. We shall first introduce a new method using an adap- tive constrained1minimization approach for estimating the sparse precision ma- trices. The estimator, called ACLIME, is fully data-driven and easy to implement.

The properties of the ACLIME are then studied in detail under the matrixwnorm losses. In particular, we establish the rates of convergence for the ACLIME esti- mator which provide upper bounds for the minimax risks.

A major step in establishing the minimax rates of convergence is the derivation of rate sharp lower bounds. As in the case of estimating sparse covariance matrices, conventional lower bound techniques, which are designed and well suited for prob- lems with parameters that are scalar or vector-valued, fail to yield good results for estimating sparse precision matrices under the spectral norm. In the present paper, we apply the “two-directional” lower bound technique first developed inCai and Zhou(2012) for estimating sparse covariance matrices. This lower bound method can be viewed as a simultaneous application of Assouad’s lemma along the row direction and Le Cam’s method along the column direction. The lower bounds match the rates in the upper bounds for the ACLIME estimator, and thus yield the minimax rates.

By combining the minimax lower and upper bounds developed in later sections, the main results on the optimal rates of convergence for estimating a sparse pre- cision matrix under various norms can be summarized in the following theorem.

We focus here on the exact sparse case ofq=0; the optimal rates for the general case of 0≤q <1 are given in the end of Section 4. Here, for two sequences of positive numbers an andbn, anbn means that there exist positive constants c andC independent ofnsuch thatcan/bnC.

THEOREM 1.1. Let Xii.i.d.Np(μ, ), i =1,2, . . . , n, and let 1 ≤ cn,p = o(n1/2(logp)3/2).The minimax risk of estimating the precision matrix=1 over the classG0(cn,p, Mn,p)based on the random sample{X1, . . . , Xn}satisfies

infˆ

sup

G0(k,Mn,p)E ˆ2wMn,p2 c2n,plogp (1.2) n

for all1≤w≤ ∞.

In view of Theorem1.1, the ACLIME estimator to be introduced in Section2, which is fully data driven, attains the optimal rates of convergence simultane- ously for allk-sparse precision matrices in the parameter spacesG0(k, Mn,p)with

(4)

kn1/2(logp)−3/2under the matrixwnorm for all 1≤w≤ ∞. The commonly used spectral norm coincides with the matrix2norm. For a symmetric matrixA, it is known that the spectral normA2is equal to the largest magnitude of eigen- values ofA. When w=1, the matrix 1 norm is simply the maximum absolute column sum of the matrix. As will be seen in Section4, the adaptivity holds for the general q balls Gq(cn,p, Mn,p) with 0≤q <1. The ACLIME procedure is thus rate optimally adaptive to both the sparsity patterns and the loss functions.

In addition to its theoretical optimality, the ACLIME estimator is computation- ally easy to implement for high dimensional data. It can be computed column by column via linear programming and the algorithm is easily scalable. A simulation study is carried out to investigate the numerical performance of the ACLIME es- timator. The results show that the procedure performs favorably in comparison to CLIME.

Our work on optimal estimation of precision matrix given in the present paper is closely connected to a growing literature on estimation of large covariance matri- ces. Many regularization methods have been proposed and studied. For example, Bickel and Levina (2008a,2008b) proposed banding and thresholding estimators for estimating bandable and sparse covariance matrices, respectively, and obtained rate of convergence for the two estimators. See also El Karoui (2008) and Lam and Fan(2009).Cai, Zhang and Zhou(2010) established the optimal rates of con- vergence for estimating bandable covariance matrices.Cai and Yuan(2012) intro- duced an adaptive block thresholding estimator which is simultaneously rate opti- mal over large collections of bandable covariance matrices.Cai and Zhou(2012) obtained the minimax rate of convergence for estimating sparse covariance matri- ces under a range of losses including the spectral norm loss. In particular, a new general lower bound technique was developed.Cai and Liu(2011) introduced an adaptive thresholding procedure for estimating sparse covariance matrices that au- tomatically adjusts to the variability of individual entries.

The rest of the paper is organized as follows. The ACLIME estimator is intro- duced in detail in Section2and its theoretical properties are studied in Section3.

In particular, a minimax upper bound for estimating sparse precision matrices is obtained. Section4establishes a minimax lower bound which matches the mini- max upper bound derived in Section2in terms of the convergence rate. The upper and lower bounds together yield the optimal minimax rate of convergence. A simu- lation study is carried out in Section5to compare the performance of the ACLIME with that of the CLIME estimator. Section6gives the optimal rate of convergence for estimating sparse precision matrices under the Frobenius norm and discusses connections and differences of our work with other related problems. The proofs are given in Section7.

2. Methodology. In this section, we introduce an adaptive constrained 1

minimization procedure, called ACLIME, for estimating a precision matrix. The properties of the estimator are then studied in Section3under the matrixwnorm

(5)

losses for 1≤w≤ ∞and a minimax upper bound is established. The upper bound together with the lower bound given in Section4will show that the ACLIME esti- mator is adaptively rate optimal.

We begin with basic notation and definitions. For a vectora=(a1, . . . , ap)T ∈ Rp, define |a|1 =pj=1|aj| and |a|2 =pj=1a2j. For a matrix A=(aij) ∈ Rp×q, we define the elementwisewnorm by|A|w=(i,j|aij|w)1/w. The Frobe- nius norm ofAis the elementwise2norm.I denotes ap×pidentity matrix. For any two index setsT andT and matrixA, we useAT T to denote the|T| × |T| matrix with rows and columns ofAindexed byT andT, respectively.

For an i.i.d. random sample{X1, . . . , Xn}ofp-variate observations drawn from a populationX, let the sample mean X¯ =n1nk=1Xk and the sample covariance matrix

=σij1i,jp= 1 n−1

n l=1

(Xl− ¯X)(Xl− ¯X)T, (2.1)

which is an unbiased estimate of the covariance matrix=ij)1i,jp.

It is well known that in the high dimensional setting, the inverse of the sam- ple covariance matrix either does not exist or is not a good estimator of . As mentioned in the Introduction, a number of methods for estimatinghave been introduced in the literature. In particular, Cai, Liu and Luo(2011) proposed an estimator called CLIME by solving the following optimization problem:

min||1 subject to: |I|τn, ∈Rp×p, (2.2)

whereτn=CMn,p

logp/nfor some constantC. The convex program (2.2) can be further decomposed intopvector-minimization problems. Letei be a standard unit vector in Rp with 1 in theith coordinate and 0 in all other coordinates. For 1≤ip, letωˆi be the solution of the following convex optimization problem:

min|ω|1 subject to |nωei|τn, (2.3)

whereωis a vector inRp. The final CLIME estimator ofis obtained by putting the columnsωˆi together and applying an additional symmetrization step. This es- timator is easy to implement and possesses a number of desirable properties as shown inCai, Liu and Luo(2011).

The CLIME estimator has, however, two drawbacks. One is that the estimator is not rate optimal, as will be shown later. Another drawback is that the procedure is not adaptive in the sense that the tuning parameterτnis not fully specified and needs to be chosen through an empirical method such as cross-validation.

To overcome these drawbacks of CLIME, we now introduce an adaptive con- strained1-minimization for inverse matrix estimation (ACLIME). The estimator is fully data-driven and adaptive to the variability of individual entries. A key tech- nical result which provides the motivation for the new procedure is the following fact.

(6)

LEMMA 2.1. Let X1, . . . , Xn i.i.d.

Np(μ, ) with logp=O(n1/3).Set S= (sij)1≤i,j≤p =Ip×p,where is the sample covariance matrix defined in(2.1).Then

Varsij=

n1(1+σiiωii), fori=j, n1σiiωjj, fori=j and for allδ≥2,

PIp×p

ijδ

σiiωjjlogp

n ,∀1≤i, jp (2.4)

≥1−O(logp)1/2pδ2/4+1.

REMARK 2.1. The condition logp=O(n1/3)in Lemma2.1can be relaxed to logp=o(n). Under the condition logp=o(n)Theorem 1 in Chapter VIII of Petrov(1975) implies that equation (2.4) still holds by replacing the probability bound 1−O((logp)1/2pδ2/4+1) with 1−O((logp)1/2pδ2/4+1+o(1)). We then need the constantδ >2 so that(logp)−1/2p−δ2/4+1+o(1)=o(1).

A major step in the construction of the adaptive data-driven procedure is to make the constraint in (2.2) and (2.3) adaptive to the variability of individual entries based on Lemma2.1, instead of using a single upper boundλnfor all the entries.

In order to apply Lemma 2.1, we need to estimate the diagonal elements of andii andwjj, i, j =1, . . . , p. Note thatσii can be easily estimated by the sample variancesσii, butωjj are harder to estimate. Hereafter,(A)ij denotes the (i, j )th entry of the matrixA,(a)j denotes thejth element of the vectora. Denote bj =(b1j, . . . , bpj).

The ACLIME procedure has two steps: The first step is to estimateωjj and the second step is to apply a modified version of the CLIME procedure to take into account of the variability of individual entries.

Step1:Estimatingωjj. Note thatσiiωjjiiσjjjj andiiσjjjj ≥ 1, which implies√σiiωjjiiσjjjj, that is, 2

σiiωjjlogp

n ≤ 2(σii

σjjjj logp

n . From equation (2.4), we consider Ip×p

ij≤2(σiiσjjjj

logp

n , 1≤i, jp.

(2.5)

Letˆ1:=ˆ1ij)=ˆ·11, . . . ,ωˆ1·p)be a solution to the following optimization prob- lem:

ˆ

ω1·j =arg min

bjRp

|bj|1: | ˆbjej|λn

σiiσjj×bjj, bjj>0, (2.6)

(7)

wherebj =(b1j, . . . , bpj), 1≤jp,ˆ =+n1Ip×pand λn=δ

logp

n . (2.7)

Here,δis a constant which can be taken as 2. The estimatorˆ1yields estimates of the conditional varianceωjj, 1≤jp. More specifically, we define the estimates ofωjj by

˘

ωjj= ˆωjj1 I

σjj n

logp

+

logp n I

σjj >

n logp

.

Step2:Adaptive estimation. Given the estimates ω˘jj, the final estimatorˆ of is constructed as follows. First, we obtain˜1=:˜1ij)by solvingpoptimization problems: for 1≤jp

˜

ω·1j =arg min

bRp

|b|1:(bˆ −ej)iλn

σiiω˘jj,1≤ip

, (2.8)

whereλnis given in (2.7). We then obtain the estimatorˆ by symmetrizing˜1, ˆ =ˆij)

(2.9)

whereωˆij= ˆωj i= ˜ω1ijIω˜1ijω˜j i1+ ˜ωj i1Iω˜1ij>ω˜1j i. We shall call the estimator ˆ adaptive CLIME, or ACLIME. The estimator adapts to the variability of individual entries by using an entry-dependent threshold for each individual ωij. Note that the optimization problem (2.6) is convex and can be cast as a linear program. The constant δ in (2.7) can be taken as 2 and the resulting estimator will be shown to be adaptively minimax rate optimal for estimating sparse precision matrices.

REMARK2.2. Note thatδ=2 used in the constraint sets is tight, it can not be further reduced in general. If one chooses the constantδ <2, then with probability tending to 1, the true precision matrix will no longer belong to the feasible sets. To see this, consider==Ip×pfor simplicity. It follows fromLiu, Lin and Shao (2008) andCai and Jiang(2011) that

n

logp max

1i<jp| ˆσij| →2

in probability. Thus, P (| ˆIp×p|> λn)→1, which means that if δ <2, the true lies outside of the feasible set with high probability and solving the corresponding minimization problem cannot lead to a good estimator of.

(8)

REMARK2.3. The CLIME estimator uses a universal tuning parameterλn= CMn,p

logp/nwhich does not take into account the variations in the variances σii and the conditional variancesωjj. It will be shown that the convergence rate of CLIME obtained byCai, Liu and Luo(2011) is not optimal. The quantityMn,p is the upper bound of the matrix1norm which is unknown in practice. The cross validation method can be used to choose the tuning parameter in CLIME. However, the estimator obtained through CV can be variable and its theoretical properties are unclear. In contrast, the ACLIME procedure proposed in the present paper does not depend on any unknown parameters and it will be shown that the estimator is minimax rate optimal.

3. Properties of ACLIME and minimax upper bounds. We now study the properties of the ACLIME estimatorˆ proposed in Section2. We shall begin with the Gaussian case whereXN (μ, ). Extensions to non-Gaussian distributions will be discussed later. The following result shows that the ACLIME estimator adaptively attains the convergence rate of

Mn,p1qcn,p logp

n

(1q)/2

over the class of sparse precision matrices Gq(cn,p, Mn,p)defined in (1.1) under the matrixwnorm losses for all 1≤w≤ ∞. The lower bound given in Section4 shows that this rate is indeed optimal and thus ACLIME adapts to both sparsity patterns and this class of loss functions.

THEOREM3.1. Suppose we observe a random sampleX1, . . . , Xn i.i.d.

Np(μ, ).Let=1be the precision matrix.Letδ≥2, logp=O(n1/3)and

cn,p=On1/2q/2/(logp)3/2q/2. (3.1)

Then for some constantC >0

Gq(cinfn,p,Mn,p)P

ˆwCMn,p1qcn,p logp

n

(1q)/2

≥1−O(logp)1/2pδ2/4+1 for all1≤w≤ ∞.

Forq=0, a sufficient condition for estimatingconsistently under the spectral norm is

Mn,pcn,p n

logp =o(1), i.e., Mn,pcn,p=o n

logp

.

This implies that the total number of nonzero elements on each column needs

be √

n in order for the precision matrix to be estimated consistently over

(9)

G0(cn,p, Mn,p). In Theorem 4.1 we show that the upper bound Mn,pcn,p logp

is indeed rate optimal overG0(cn,p, Mn,p). n

REMARK3.1. Following Remark2.1, the condition logp=O(n1/3)in The- orem3.1can be relaxed to logp=o(n). In Theorem3.1, the constantδthen needs to be strictly larger than 2, and the probability bound 1−O((logp)1/2pδ2/4+1) is replaced by 1−O((logp)1/2pδ2/4+1+o(1)). By a similar argument, in the following Theorems3.2and3.3, we need only to assume logp=o(n).

We now consider the rate of convergence under the expectation. For technical reasons, we require the constantδ≥3 in this case.

THEOREM3.2. Suppose we observe a random sampleX1, . . . , Xni.i.d.Np(μ, ).Let =1 be the precision matrix. Let logp=o(n)andδ≥3. Suppose thatpn13/(δ28)and

cn,p=o(n/logp)1/2q/2.

The ACLIME estimatorˆ satisfies,for all1≤w≤ ∞and0≤q <1, sup

Gq(cn,p,Mn,p)E ˆ2wCMn,p2−2qc2n,p logp

n 1q

,

for some constantC >0.

Theorem 3.2 can be extended to non-Gaussian distributions.Let Z =(Z1, Z2, . . . , Zn) be a p-variate random variable with mean μ and covariance ma- trix =ij)1≤i,j≤p. Let =ij)1≤i,j≤p be the precision matrix. Define Yi =(Ziμi)/σii1/2, 1≤ip and(W1, . . . , Wp) :=(Zμ). Assume that there exist some positive constantsηandMsuch that for all 1≤ip,

EexpηYi2M, EexpηWi2ii M.

(3.2)

Then we have the following result.

THEOREM3.3. Suppose we observe an i.i.d.sampleX1, . . . , Xnwith the pre- cision matrixsatisfying condition(3.2).Letlogp=o(n)andpnγ for some γ >0.Suppose that

cn,p=o(n/logp)1/2q/2.

Then there is aδdepending only onη,M andγ such that the ACLIME estimator ˆ satisfies,for all1≤w≤ ∞and0≤q <1,

sup

Gq(cn,p,Mn,p)

E ˆ2wCMn,p22qc2n,p logp

n 1−q

,

for some constantC >0.

(10)

REMARK3.2. Under condition (3.2), it can be shown that an analogous result to Lemma2.1in Section2holds with someδdepending only onηandM. Thus, it can be proved that, under condition (3.2), Theorem3.3holds. The proof is similar to that of Theorem3.2. A practical way to chooseδis using cross validation.

REMARK3.3. Theorems3.1,3.2and3.3follow mainly from the convergence rate under the element-wise norm and the inequalityMwM1 for any symmetric matrixMfrom Lemma7.2. The convergence rate under element-wise norm plays an important role in graphical model selection and in establishing the convergence rate under other matrix norms, such as the Frobenius norm · F. Indeed, from the proof, Theorems3.1,3.2and3.3hold under the matrix1norm.

More specifically, under the conditions of Theorems3.2and3.3we have sup

Gq(cn,p,Mn,p)

E| ˆ|2CMn,p2 logp n ,

Gq(cn,psup,Mn,p)

E ˆ21CMn,p22qcn,p2 logp

n 1q

,

sup

Gq(cn,p,Mn,p)

1

pE ˆ2FCMn,p2−qcn,p logp

n

1−q/2

.

REMARK3.4. The results in this section can be easily extended to the weak q ball with 0≤q <1 to model the sparsity of the precision matrix. A weakq

ball of radiuscinRpis defined as follows:

Bq(c)=ξ∈Rp: |ξ|q(k)ck1,for allk=1, . . . , p, where|ξ|(1)≥ |ξ|(2)≥ · · · ≥ |ξ|(p). Let

Gq(cn,p, Mn,p)= c=ij)1i,jp:ω·,jBq(cn,p), 1Mn,p, λmax()/λmin()M1, 0

(3.3) .

Theorems3.1,3.2and3.3hold with the parameter spaceGq(cn,p, Mn,p)replaced by Gq(cn,p, Mn,p)by a slight extension of Lemma 7.1for the q ball to for the weakq ball similar to equation (51) inCai and Zhou(2012).

4. Minimax lower bounds. Theorem3.2shows that the ACLIME estimator adaptively attains the rate of convergence

Mn,p22qcn,p2 logp

n 1q

(4.1)

under the squared matrixw norm loss for 1≤w≤ ∞over the collection of the parameter spaces Gq(cn,p, Mn,p). In this section, we shall show that the rate of convergence given in (4.1) cannot be improved by any other estimator and thus is indeed optimal among all estimators by establishing minimax lower bounds for estimating sparse precision matrices under the squared matrixwnorm.

(11)

THEOREM 4.1. Let X1, . . . , Xn i.i.d.

Np(μ, )with p > c1nβ for some con- stantsβ >1andc1>0.Assume that

cMn,pq logp

n q/2

cn,p=oMn,pq n(1q)/2(logp)(3q)/2 (4.2)

for some constantc >0.The minimax risk for estimating the precision matrix= 1over the parameter spaceGq(cn,p, Mn,p)under the condition(4.2)satisfies

infˆ

Gq(cn,psup,Mn,p)

E ˆ2wCMn,p22qc2n,p logp

n 1q

for some constantC >0and for all1≤w≤ ∞.

The proof of Theorem4.1is involved. We shall discuss the key technical tools and outline the important steps in the proof of Theorem 4.1in this section. The detailed proof is given in Section7.

4.1. A general technical tool. We use a lower bound technique introduced in Cai and Zhou (2012), which is particularly well suited for treating “two- directional” problems such as matrix estimation. The technique can be viewed as a generalization of both Le Cam’s method and Assouad’s lemma, two classical lower bound arguments. Let X be an observation from a distribution Pθ where θ belongs to a parameter set which has a special tensor structure. For a given positive integer r and a finite set B⊂Rp/{01×p}, let = {0,1}r and Br. Define

==(γ , λ):γandλ. (4.3)

In comparison, the standard lower bound arguments work with eitheroralone.

For example, Assouad’s lemma considers only the parameter setand Le Cam’s method typically applies to a parameter set likewithr=1.Cai and Zhou(2012) gives a lower bound for the maximum risk over the parameter setto the problem of estimating a functionalψ (θ ), belonging to a metric space with metricd.

We need to introduce a few notation before formally stating the lower bound.

For two distributionsPandQwith densitiespandq with respect to any common dominating measure μ, the total variation affinity is given by P∧Q =pq dμ. For a parameterγ =1, . . . , γr)whereγi∈ {0,1}, define

Hγ , γ= r i=1

γiγi (4.4)

be the Hamming distance on{0,1}r.

LetD=Card(). For a givena∈ {0,1}and 1≤ir, we define the mixture distributionP¯a,iby

a,i= 1 2r−1D

θ

Pθ :γi(θ )=a. (4.5)

(12)

So P¯a,i is the mixture distribution over allPθ with γi(θ ) fixed to bea while all other components ofθ vary over all possible values. In our construction of the pa- rameter set for establishing the minimax lower bound,r is the number of possibly nonzero rows in the upper triangle of the covariance matrix and is the set of matrices withr rows to determine the upper triangle matrix.

LEMMA 4.1. For any estimatorT ofψ (θ )based on an observation from the experiment{Pθ, θ},and anys >0

max 2sEθdsT , ψ (θ )αr 2 min

1ir ¯P0,i∧ ¯P1,i, (4.6)

wherea,iis defined in equation(4.5)andαis given by

α= min

{(θ,θ):H (γ (θ ),γ (θ))1}

ds(ψ (θ ), ψ (θ)) H (γ (θ ), γ (θ)). (4.7)

We introduce some new notation to study the affinity ¯P0,i∧ ¯P1,i in equa- tion (4.6). Denote the projection ofθtobyγ (θ )=i(θ ))1ir and to byλ(θ )=i(θ ))1ir. More generally we defineγA(θ )=i(θ ))iAfor a subset A⊆ {1,2, . . . , r}, a projection ofθ to a subset of. A particularly useful example of setAis

{−i} = {1, . . . , i−1, i+1, . . . , r},

for which γi(θ )=1(θ ), . . . , γi1(θ ), γi+1(θ ), γr(θ )). λA(θ ) and λi(θ ) are defined similarly. We denote the set {λA(θ ) :θ} by A. For a ∈ {0,1}, b∈ {0,1}r1, andciBr1, let

Di(a,b,c)=Cardγ:γi(θ )=a, γi(θ )=bandλi(θ )=c and define

(a,i,b,c)= 1 Di(b,c)

θ

Pθ :γi(θ )=a, γ−i(θ )=bandλ−i(θ )=c. (4.8)

In other words,P¯(a,i,b,c)is the mixture distribution over allPθ withλi(θ )varying over all possible values while all other components ofθ remain fixed.

The following lemma gives a lower bound for the affinity in equation (4.6). See Section 2 ofCai and Zhou(2012) for more details.

LEMMA 4.2. Leta,i and(a,i,b,c) be defined in equations (4.5) and(4.8), respectively,then

¯P0,i∧ ¯P1,i ≥Average

γ−i−i

(0,i,γ−i−i)((1,i,γ−i−i)),

where the average overγiandλiis induced by the uniform distribution over.

(13)

4.2. Lower bound for estimating sparse precision matrix. We now apply the lower bound technique developed in Section4.1to establish rate sharp results un- der the matrixw norm. LetX1, . . . , Xni.i.d.Np(μ, 1)withp > c1nβ for some β >1 and c1 >0, where Gq(cn,p, Mn,p). The proof of Theorem 4.1 con- tains four major steps. We first reduce the minimax lower bound under the general matrixw norm, 1≤w≤ ∞, to under the spectral norm. In the second step, we construct in detail a subset F of the parameter space Gq(cn,p, Mn,p) such that the difficulty of estimation over F is essentially the same as that of estimation overGq(cn,p, Mn,p), the third step is the application of Lemma4.1to the carefully constructed parameter set, and finally in the fourth step we calculate the factors α defined in (4.7) and the total variation affinity between two multivariate normal mixtures. We outline the main ideas of the proof here and leave detailed proof of some technical results to Section7.

PROOF OFTHEOREM4.1. We shall divide the proof into four major steps.

Step1: Reducing the general problem to the lower bound under the spectral norm. The following lemma implies that the minimax lower bound under the spec- tral norm yields a lower bound under the general matrixw norm up to a constant factor 4.

LEMMA 4.3. LetX1, . . . , Xn i.i.d.

N (μ, 1),andF be any parameter space of precision matrices.The minimax risk for estimating the precision matrixover F satisfies

infˆ

sup

F E ˆ2w≥ 1 4inf

ˆ

sup

F E ˆ22 (4.9)

for all1≤w≤ ∞.

Step2: Constructing the parameter set. Letr= p/2and letBbe the collec- tion of all vectors(bj)1jp such thatbj =0 for 1≤jpr andbj =0 or 1 forpr+1≤jpunder the constraintb0=k(to be defined later). For each bB and each 1≤mr, define ap×p matrixλm(b)by making themth row ofλm(b)equal toband the rest of the entries 0. It is clear that Card(B)=kr. Set = {0,1}r. Note that each componentbiofλ=(b1, . . . , br)can be uniquely associated with ap×p matrixλi(bi).is the set of all matrices λwith the ev- ery column sum less than or equal to 2k. Define = and let εn,p∈R be fixed. (The exact value ofεn,p will be chosen later.) For eachθ =(γ , λ) withγ =1, . . . , γr)andλ=(b1, . . . , br), we associateθ with a precision matrix (θ )by

(θ )=Mn,p

2

Ip+εn,p r m=1

γmλm(bm)

.

(14)

Finally, we define a collectionFof precision matrices as F=

(θ ):(θ )=Mn,p

2

Ip+εn,p r m=1

γmλm(bm)

, θ=(γ , λ)

.

We now specify the values ofεn,p andk. Set εn,p=υ

logp

n for some 0< υ <min c

2 1/q

−1 8β

, (4.10)

and

k=21cn,p(Mn,pεn,p)q−1, (4.11)

which is at least 1 from equation (4.10). Now we showF is a subset of the pa- rameter spaceGq(cn,p, Mn,p). From the definition ofkin (4.11) note that

maxjp i=j

|ωij|q≤2·21ρn,p(Mn,pεn,p)q· Mn,p

2 εn,p q

cn,p. (4.12)

From equation (4.2), we havecn,p=o(Mn,pq n(1q)/2(logp)(3q)/2), which im- plies

2kεn,pcn,pεn,p1−qMn,p−q =o(1/logp), (4.13)

then

maxi

j

|ωij| ≤Mn,p

2 (1+2kεn,p)Mn,p. (4.14)

SinceA2A1, we have

εn,p r m=1

γmλm(bm) 2

εn,p r m=1

γmλm(bm) 1

≤2kεn,p=o(1), which implies that every(θ )is diagonally dominant and positive definite, and

λmax()Mn,p

2 (1+2kεn,p) and λmin()Mn,p

2 (1−2kεn,p) (4.15)

which immediately implies

λmax() λmin() < M1. (4.16)

Equations (4.12), (4.14), (4.15) and (4.16) all together implyFGq(cn,p, Mn,p).

(15)

Step3: Applying the general lower bound argument. Let X1, . . . , Xn i.i.d.

Np(0, ((θ ))1) with θ and denote the joint distribution by Pθ. Applying Lemmas4.1and4.2to the parameter space, we have

infˆ

maxθ∈22Eθˆ −(θ )22 (4.17)

α·p 4 ·min

i Average

γ−i−i ¯P(0,i,γ−i−i)∧ ¯P(1,i,γ−i−i), where

α= min

{(θ,θ):H (γ (θ ),γ (θ))1}

(θ ))22 H (γ (θ ), γ (θ)) (4.18)

andP¯0,i andP¯1,i are defined as in (4.5).

Step4: Bounding the per comparison loss α defined in (4.18)and the affinity miniAverageγii ¯P(0,i,γ−i−i)∧ ¯P(1,i,γ−i−i)in (4.17). This is done separately in the next two lemmas which are proved in detailed in Section7.

LEMMA4.4. The per comparison lossαdefined in(4.18)satisfies α(Mn,pn,p)2

4p .

LEMMA 4.5. LetX1, . . . , Xn i.i.d.

N (0, ((θ ))1)withθand denote the joint distribution byPθ.Fora∈ {0,1}and1≤ir,define(a,i,b,c)as in(4.8).

Then there exists a constantc1>0such that mini Average

γii

¯P(0,i,γ−i−i)∧ ¯P(1,i,γ−i−i)c1.

Finally, the minimax lower bound for estimating a sparse precision matrix over the collectionGq(cn,p, Mn,p)is obtained by putting together (4.17) and Lemmas 4.4and4.5,

infˆ

sup

Gq(cn,p,Mn,p)

Eˆ −(θ )22

≥ max

(θ )F

Eθˆ −(θ )22(Mn,pn,p)2

4p · p

16 ·c1

c1

64(Mn,pn,p)2=c2Mn,p22qcn,p2 logp

n 1q

,

for some constantc2>0.

Références

Documents relatifs

We give a crossed product con- struction of (stabilized) Cuntz-Li algebras coming from the a-adic numbers and investigate the structure of the associated algebras.. In particular,

If an abstract graph G admits an edge-inserting com- binatorial decomposition, then the reconstruction of the graph from the atomic decomposition produces a set of equations and

The existence of this solution is shown in Section 5, where Dirichlet solutions to the δ-Ricci-DeTurck flow with given parabolic boundary values C 0 close to δ are constructed..

Graduates of the University of Calgary's Geomatics Engineering program shall be exempted from writing all Core subjects, and listed Elective examinations provided they passed

Let X be an algebraic curve defined over a finite field F q and let G be a smooth affine group scheme over X with connected fibers whose generic fiber is semisimple and

Theorem 3 ( Oracle Risk). Assume that d, λ are bounded. Although we have assumed that the dimension of the central space d is bounded, we include it in the convergence rate to

This result improves the previous one obtained in [44], which gave the convergence rate O(N −1/2 ) assuming i.i.d. Introduction and main results. Sample covariance matrices

In [15], the author has proved the first order convergence of the IB method for the Stokes equations with a periodic boundary condition in 2D based on existing estimates between