ASYMPTOTIC DISTRIBUTION OF JIVE IN A HETEROSKEDASTIC IV REGRESSION WITH MANY INSTRUMENTS

Texte intégral

(1)ASYMPTOTIC DISTRIBUTION OF JIVE IN A HETEROSKEDASTIC IV REGRESSION WITH MANY INSTRUMENTS. The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.. Citation. Chao, John C., Norman R. Swanson, Jerry A. Hausman, Whitney K. Newey, and Tiemen Woutersen. “ASYMPTOTIC DISTRIBUTION OF JIVE IN A HETEROSKEDASTIC IV REGRESSION WITH MANY INSTRUMENTS.” Econometric Theory 28, no. 01 (February 13, 2012): 42-86. © Cambridge University Press 2011. As Published. http://dx.doi.org/10.1017/s0266466611000120. Publisher. Cambridge University Press. Version. Final published version. Citable link. http://hdl.handle.net/1721.1/82651. Terms of Use. Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use..

(2) Econometric Theory, 28, 2012, 42–86. doi:10.1017/S0266466611000120. ASYMPTOTIC DISTRIBUTION OF JIVE IN A HETEROSKEDASTIC IV REGRESSION WITH MANY INSTRUMENTS JOHN C. CHAO. University of Maryland. NORMAN R. SWANSON Rutgers University. JERRY A. HAUSMAN AND WHITNEY K. NEWEY MIT. TIEMEN WOUTERSEN. Johns Hopkins University. This paper derives the limiting distributions of alternative jackknife instrumental variables (JIV) estimators and gives formulas for accompanying consistent standard errors in the presence of heteroskedasticity and many instruments. The asymptotic framework includes the many instrument sequence of Bekker (1994, Econometrica 62, 657–681) and the many weak instrument sequence of Chao and Swanson (2005, Econometrica 73, 1673–1691). We show that JIV estimators are √ asymptotically normal and that standard errors are consistent provided that K n /rn → 0 as n → ∞, where K n and rn denote, respectively, the number of instruments and the concentration parameter. This is in contrast to the asymptotic behavior of such classical instrumental variables estimators as limited information maximum likelihood, bias-corrected two-stage least squares, and two-stage least squares, all of which are inconsistent in the presence of heteroskedasticity, unless K n /rn → 0. We also show that the rate of convergence and the form of the asymptotic covariance matrix of the JIV estimators will in general depend on the strength of the instruments as measured by the relative orders of magnitude of rn and K n .. Earlier versions of this paper were presented at the NSF/NBER conference on weak and/or many instruments at MIT in 2003 and at the 2004 winter meetings of the Econometric Society in San Diego, where conference participants provided many useful comments and suggestions. Particular thanks are owed to D. Ackerberg, D. Andrews, J. Angrist, M. Caner, M. Carrasco, P. Guggenberger, J. Hahn, G. Imbens, R. Klein, N. Lott, M. Moriera, G.D.A. Phillips, P.C.B. Phillips, J. Stock, J. Wright, two anonymous referees, and a co-editor for helpful comments and suggestions. Address correspondence to: Whitney K. Newey, Department of Economics, MIT, E52-262D, Cambridge, MA 02142-1347, USA; e-mail: wnewey@mit.edu.. 42. c Cambridge University Press 2011 .

(3) JIVE WITH HETEROSKEDASTICITY. 43. 1. INTRODUCTION It has long been known that the two-stage least squares (2SLS) estimator is biased with many instruments (see, e.g., Sawa, 1968; Phillips, 1983; and the references cited therein). In large part because of this problem, various approaches have been proposed in the literature to reduce the bias of the 2SLS estimator. In recent years, there has been interest in developing procedures that use “delete-one” fitted values in lieu of the usual first-stage ordinary least squares fitted values as the instruments employed in the second stage of the estimation. A number of different versions of these estimators, referred to as jackknife instrumental variables (JIV) estimators, have been proposed and analyzed by Phillips and Hale (1977), Angrist, Imbens, and Krueger (1999), Blomquist and Dahlberg (1999), Ackerberg and Devereux (2009), Davidson and MacKinnon (2006), and Hausman, Newey, Woutersen, Chao, and Swanson (2007). The JIV estimators are consistent with many instruments and heteroskedasticity of unknown form, whereas other estimators, including limited information maximum likelihood (LIML) and bias-corrected 2SLS (B2SLS) estimators are not (see, e.g., Bekker and van der Ploeg, 2005; Ackerberg and Devereux, 2009; Chao and Swanson, 2006; Hausman et al., 2007). The main objective of this paper is to develop asymptotic theory for the JIV estimators in a setting that includes the many instrument sequence of Kunitomo (1980), Morimune (1983), and Bekker (1994) and the many weak instrument sequence of Chao and Swanson (2005). To be precise, √ we show that JIV estimators are consistent and asymptotically normal when K n /rn → 0 as n → ∞, where K n and rn denote the number of instruments and the so-called concentration parameter, respectively. In contrast, consistency of LIML and B2SLS generally requires that Krnn → 0 as n → ∞, meaning that the number of instruments is small relative to the identification strength. We show that both the rate of convergence of the JIV estimator and the form of its asymptotic covariance matrix depend on how weak the available instruments are, as measured by the relative order of magnitude of rn vis-à-vis K n . We also show consistency of the standard errors under heteroskedasticity and many instruments. Hausman et al. (2007) also consider a jackknife form of LIML that is slightly more difficult to compute but is asymptotically efficient relative to JIV under many weak instruments and homoskedasticity. With heteroskedasticity, any of the estimators may outperform the others, as shown by Monte Carlo examples in Hausman et al. Hausman et al. also propose a jackknife version of the Fuller (1977) estimator that has fewer outliers. This paper is a substantially altered and revised version of Chao and Swanson (2004), in which we now allow for the many instrument sequence of Kunitomo (1980), Morimune (1983), and Bekker (1994). In the process of showing the asymptotic normality of JIV, this paper gives a central limit theorem for quadratic (and, more generally, bilinear) forms associated with an idempotent matrix. This theorem can be used to study estimators other than JIV. For example, it has already been used in Hausman et al. (2007) to derive the asymptotic properties of the.

(4) 44. JOHN C. CHAO ET AL.. jackknife versions of the LIML and Fuller (1977) estimators and in Chao, Hausman, Newey, Swanson, and Woutersen (2010) to derive a moment-based test. The rest of the paper is organized as follows. Section 2 sets up the model and describes the estimators and standard errors. Section 3 lays out the framework for the asymptotic theory and presents the main results of our paper. Section 4 comments on the implications of these results and concludes. All proofs are gathered in the Appendixes. 2. THE MODEL AND ESTIMATORS The model we consider is given by y = X. δ0 + ε , n×G G×1 n×1. n×1. X = ϒ + U, where n is the number of observations, G is the number of right-hand-side variables, ϒ is the reduced form matrix, and U is the disturbance matrix. For the asymptotic approximations, the elements of ϒ will implicitly be allowed to depend on n, although we suppress the dependence of ϒ on n for notational convenience. Estimation of δ0 will be based on an n × K matrix, Z , of instrumental variable observations with rank(Z ) = K . Let Z = (ϒ, Z ) and assume that E[ε|Z] = 0 and E[U |Z] = 0. This model allows for ϒ to be a linear combination of Z (i.e., ϒ = Z π, for some K × G matrix π). Furthermore, some columns of X may be exogenous, with the corresponding column of U being zero. The model also allows for Z to approximate the reduced form. For example, let X i , ϒi , and Z i denote the ith row (observation) for X, ϒ, and Z , respectively. We could let ϒi = f 0 (wi ) be a vector of unknown functions of a vector wi of underlying instruments and let Z i = ( p1K (wi ), . . . , p K K (wi )) for approximating functions pk K (w), such as power series or splines. In this case, linear combinations of Z i may approximate the unknown reduced form (e.g., Newey, 1990). To describe the estimators, let P = Z (Z Z )−1 Z and Pij denote the (i, j)th ele¯ −i = (Z Z − Z i Z )−1 (Z X − Z i X ) be the reduced ment of P. Additionally, let i i form coefficients obtained by regressing X on Z using all observations except the ith. The JIV1 estimator of Phillips and Hale (1977) is obtained as δ˜ =. n. ∑. i=1. −1 ¯ −i Z i X i . n. ∑ ¯ −i Z i yi .. i=1. Using standard results on recursive residuals, it follows that ¯ −i Z i = X Z (Z Z )−1 Z i − Pii X i (1 − Pii ) = ∑ Pij X j /(1 − Pii ). j=i.

(5) JIVE WITH HETEROSKEDASTICITY. 45. Then, we have that δ˜ = H˜ −1 ∑ X i Pij (1 − Pj j )−1 y j , i= j. H˜ =. ∑ X i Pij (1 − Pj j )−1 X j ,. i= j. where i= j denotes the double sum ∑i ∑ j=i . The JIV2 estimator proposed by Angrist et al. (1999), JIVE2, has a similar form, except that −i = (Z Z )−1 ¯ −i . It is given by (Z X − Z i X i ) is used in place of δˆ = Hˆ −1 ∑ X i Pij y j , i= j. Hˆ =. ∑ X i Pij X j .. i= j. To explain why JIV2 is a consistent estimator, it is helpful to consider JIV2 as a minimizer of an objective function. As usual, the limit of the minimizer will be the minimizer of the limit under appropriate regularity conditions. We focus on δˆ ˆ to simplify the discussion. The estimator δˆ satisfies δˆ = arg minδ Q(δ), where ˆ Q(δ) =. ∑ ( yi − X i δ)Pij ( yj − X j δ).. i= j. Note that the difference between the 2SLS objective function ( y − X δ)P( y − n ˆ X δ) and Q(δ) is ∑i=1 Pii ( yi − X i δ)2 . This is a weighted least squares object that is a source of bias in 2SLS because its expectation is not minimized at δ0 when X i and εi are correlated. This object does not vanish asymptotically relative ˆ to E[ Q(δ)] under many (or many weak) instruments, leading to inconsistency of 2SLS. When observations are mutually independent, the inconsistency is caused ˆ by this term, so removing it to form Q(δ) makes δˆ consistent. ˆ To explain further, consider the JIV2 objective function Q(δ). Note that for ˜ Ui (δ) = εi − Ui (δ − δ0 ) ˆ Q(δ) = Qˆ 1 (δ) + Qˆ 2 (δ) + Qˆ 3 (δ), Qˆ 2 (δ) = −2 ∑ U˜ i (δ)Pij ϒ j (δ − δ0 ), i= j. Qˆ 1 (δ) =. ∑ (δ − δ0 ) ϒi Pij ϒj (δ − δ0 ),. i= j. Qˆ 3 (δ) =. ∑ U˜ i (δ)PijU˜ j (δ).. i= j. Then by the assumptions E[U˜ i (δ)] = 0 and independence of observations, ˆ we have E[ Q(δ)|Z] = Q 1 (δ). Under the regularity conditions in Section 3, ∑i= j ϒi Pij ϒ j is positive definite asymptotically, so Q 1 (δ) is minimized at δ0 . Thus, ˆ the expectation Q 1 (δ) of Q(δ) is minimized at the true parameter δ0 ; in the terminology of Han and Phillips (2006), the many instrument “noise” term in the expected objective function is identically zero. ˆ ˆ it is also necessary that the stochastic components of Q(δ) For consistency of δ, do not dominate asymptotically. The size of Qˆ 1 (δ) (for δ = δ0 ) is proportional to the concentration parameter that we denote by rn . It√turns out that Qˆ 2 (δ) has size smaller than Qˆ 1 (δ) asymptotically but Qˆ 3 (δ) is O p ( K n ) (Lemma A1 shows that the variance of Qˆ 3 (δ) is proportional to K n ). Thus, to ensure that the expectation.

(6) 46. JOHN C. CHAO ET AL.. ˆ ˆ of √ Q(δ) dominates the stochastic part of Q(δ), it suffices to impose the restriction K n /rn → 0, which we do throughout the asymptotic theory. This condition was formulated in Chao and Swanson (2005). The estimators δ˜ and δˆ are consistent and asymptotically normal √ with heteroskedasticity under the regularity conditions we impose, including K n /rn → 0. In contrast, consistency of LIML and Fuller (1977) require K n /rn → 0 when Pii is asymptotically correlated with E[X i εi |Z]/E[εi2 |Z], as discussed in Chao and Swanson (2004) and Hausman et al. (2007). This condition is also required for consistency of the bias-corrected 2SLS estimator of Donald and Newey (2001) when Pii is asymptotically correlated with E[X i εi |Z], as discussed in Ackerberg and Devereux (2009). Thus, JIV estimators are robust to heteroskedasticity and many instruments (when K n grows as fast as rn ), whereas LIML, Fuller (1977), or B2SLS estimators are not. Hausman et al. (2007) also consider a JIV form of LIML, which is obtained by ˆ minimizing Q(δ)/[( y − X δ) ( y − X δ)]. The sum of squared residuals in the denominator makes computation somewhat more complicated; however, like LIML, it has an explicit form in terms of the smallest eigenvalue of a matrix. This JIV form of LIML is asymptotically efficient relative to δˆ and δ˜ under many weak instruments and homoskedasticity. With heteroskedasticity, δˆ and δ˜ may perform better than this estimator, as shown by Monte Carlo examples in Hausman et al.; they also propose a jackknife version of the Fuller (1977) estimator that has fewer outliers than the JIV form of LIML. ˜ note that for ξi = To motivate the form of the variance estimator for δˆ and δ, −1 (1 − Pii ) εi , substituting yi = X i δ0 + εi in the equation for δ˜ gives δ˜ = δ0 + H˜ −1 ∑ X i Pij ξ j .. (1). i= j. After appropriate normalization, the matrix H˜ −1 will converge and a central limit theorem will apply to ∑i= j X i Pij ξ j ,which leads to a sandwich form for the asymptotic variance. Here H˜ −1 can be used to estimate the outside terms in the sandwich. The inside term, which is the variance of ∑i= j X i Pij ξ j , can be estimated by dropping terms that are zero from the variance, the expectation, and removing replacing ξi with an estimate, ξ˜i = (1 − Pii )−1 yi − X i δ˜ . Using the independence of the observations, E[εi |Z] = 0, and the exclusion of the i = j terms in the double sums, it follows that E ∑ X i Pij ξ j ∑ X i Pij ξ j |Z i= j. =E. . i= j. ∑ ∑. i, j k ∈{i, / j}. Pik Pjk X i X j ξk2 + ∑ Pij2 X i ξi X j ξ j |Z . i= j. Removing the expectation and replacing ξi with ξ˜i gives ˜ =∑ . ∑. i, j k ∈{i, / j}. Pik Pjk X i X j ξ˜k2 + ∑ Pij2 X i ξ˜i X j ξ˜ j . i= j.

(7) JIVE WITH HETEROSKEDASTICITY. 47. The estimator of the asymptotic variance of δ˜ is then given by ˜ H˜ −1 . V˜ = H˜ −1 This estimator is robust to heteroskedasticity, as it allows Var(ξi |Z) and E[X i ξi |Z] to vary over i. A vectorized form of V˜ is easier to compute. Note that for X˜ i = X i /(1 − Pii ), we have H˜ = X P X˜ − ∑i X i Pii X˜ i . Also, let X¯ = P X, Z˜ = Z (Z Z )−1 , and Z i and Z˜ i equal the ith row of Z and Z˜ , respectively. Then, as shown in the proof of Theorem 4, we have ˜ = . n. ∑ ( X¯ i X¯ i − X i Pii X¯ i − X¯ i Pii X i )ξî2. i=1. K. +∑. K. ∑. k=1 =1. . n. . ∑ Z˜ ik Z˜ i X i ξî. n. ∑. i=1. Z jk Z j X j ξˆ j. .. j=1. This formula can be computed quickly by software with fast vector operations, even when n is large. An asymptotic variance estimator for δˆ can be formed in an analogous way. ˆ we can estimate the Note that Hˆ = X P X − ∑i X i Pii X i . Also for εˆ i = yi − X i δ, middle matrix of the sandwich by ˆ = . n. ∑ ( X¯ i X¯ i − X i Pii X¯ i − X¯ i Pii X i )ˆεi2. i=1. K. +∑. K. . n. ∑ ∑ Z˜ ik Z˜ i X i εî. k=1 =1. i=1. . n. ∑ Z jk Z j X j εˆ j. .. j=1. The variance estimator for δˆ is then given by ˆ Hˆ −1 . Vˆ = Hˆ −1 Here Hˆ is symmetric because P is symmetric, so a transpose is not needed for the third matrix in Vˆ . 3. MANY INSTRUMENT ASYMPTOTICS Our asymptotic theory combines the many instrument asymptotics of Kunitomo (1980), Morimune (1983), and Bekker (1994) with the many weak instrument asymptotics of Chao and Swanson (2005). All of our regularity conditions are conditional on Z = (ϒ, Z ). To state the regularity conditions, let Z i , εi ,Ui , and ϒi denote the ith row of Z , ε,U, and ϒ, respectively. Also let a.s. denote almost surely (i.e., with probability one) and a.s.n denote a.s. for n large enough (i.e., with probability one for all n large enough)..

(8) 48. JOHN C. CHAO ET AL.. Assumption 1. K = K n → ∞, Z includes among its columns a vector of ones, for some C < 1, rank(Z ) = K , and Pii ≤ C, (i = 1, . . . , n) a.s.n. In this paper, C is a generic notation for a positive constant that may be bigger or less than 1. Hence, although in Assumption 1 C is taken to be less than 1, in other parts of the paper it might not be. The restriction that rank(Z ) = K is a normalization that requires excluding redundant columns from Z . It can be verified in particular cases. For instance, when wi is a continuously distributed scalar, Z i = p K (wi ), and pk K (w) = wk−1 , it can be shown that Z Z is nonsingular with probability one for K < n.1 The condition Pii ≤ C < 1 implies that K /n ≤ C n because K /n = ∑i=1 Pii /n ≤ C. Now, let λmin (A) denote the √ smallest eigenvalue of a symmetric matrix A and for any matrix B, let B = tr(B B). √ Assumption 2. ϒi = Sn z i / n where Sn = Sñ diag (μ1n , . . . , μGn ), Sñ is G × G and bounded, and the smallest eigenvalue of Sñ Sñ is bounded away from zero. 2 √ √ min μ jn → Also, for each j, either μ jn = n or μ jn / n → 0, rn = 1≤ j≤G n. √ /n ≤ C and. ∞, and K /r → 0. Also, there is C > 0 such that z z ∑ n i i i=1. n. λmin ∑i=1 z i z i /n ≥ 1/C a.s.n. This condition is similar to Assumption 2 of Hansen, Hausman, and Newey (2008). It accommodates linear models where included instruments (e.g., a constant) have fixed reduced form coefficients and excluded instruments have coefficients that can shrink as the sample size grows. A leading example of such a model is a linear structural equation with one endogenous variable of the form δ01 + δ0G X iG + εi , yi = Z i1. (2). where Z i1 is a G 1 × 1 vector of included instruments (e.g., including a constant) and X iG is an endogenous variable. Here the number of right-hand-side variables is G 1 + 1 = G. Let the reduced form be partitioned conformably with δ, as ϒi = , ϒ ) and U = (0,U ) . Here the disturbances for the reduced form for (Z i1 iG i iG Z i1 are zero because Z i1 is taken to be exogenous. Suppose that the reduced form for X iG depends linearly on the included instrumental variables Z i1 and on an excluded instrument z iG as in

(9) X iG = ϒiG + UiG , rn /n z iG . ϒiG = π1 Z i1 + Here we normalize z iG so that rn determines how strongly δG is identified, and we absorb into z iG any other terms, such as unknown coefficients. For Assump , z ) and require that the second moment matrix of z is tion 2, we let z i = (Z i1 iG i bounded and bounded away from zero. This normalization allows rn to determine the strength of identification of δG . For example, if rn = n, then the coefficient on z iG does not shrink, which corresponds to strong identification of δG . If rn grows √ more slowly than n, then δG will be more weakly identified. Indeed, 1/ rn will.

(10) JIVE WITH HETEROSKEDASTICITY. 49. be the convergence rate for estimators of δG . We require rn → ∞ to avoid the weak instrument setting of Staiger and Stock (1997), where δG is not asymptotically identified. For this model, the reduced form is . I 0 I√0 Z i1 Z√ i1 = . ϒi = π1 1 0 rn /n π1 Z i1 + rn /nz iG z iG This reduced form is as specified in Assumption 2 with . √ √ I 0 1 ≤ j ≤ G 1, , μ jn = n, μGn = rn . Sñ = π1 1 Note how this somewhat complicated specification is needed to accommodate fixed reduced form coefficients for included instrumental variables and excluded instruments with identifying power that depend on n. We have been unable to simplify Assumption 2 while maintaining the generality needed for such important cases. We will not require that z iG be known, only that it be approximated by a lin , Z ) . Implicitly, Z ear combination of the instrumental variables Z i = (Z i1 i1 i2 and z iG are allowed to depend on n. One important case is where the excluded instrument z iG is an unknown linear combination of the instrumental variables , Z ) . For example, the many weak instrument setting of Chao and Z i = (Z i1 i2 Swanson (2005) is one where the reduced form is given by √ ϒiG = π1 Z i1 + (π2 / n) Z i2 for a K − G 1 dimensional vector Z i2 of excluded instrumental variables. This model can be folded into our framework by specifying that

(11) K − G1, rn = K − G 1 . z iG = π2 Z i2 Assumption 2 will then require that 2 /n = (K − G 1 )−1 ∑(π2 Z i2 )2 ∑ ziG i. n. i. is bounded and bounded away from zero. Thus, the second moment ∑i (π2 Z i2 )2 /n of the term in the reduced form that identifies δ0G must grow linearly √ in K , just as in Chao and Swanson (2005), leading to a convergence rate of 1/ K − G 1 = √ 1/ rn . In another important case, the excluded instrument z iG could be an unknown function that can be approximated by a linear combination of Z i . For instance, suppose that z iG = f 0 (wi ) for an unknown function f 0 (wi ) of variables wi . In this def case, the instrumental variables could include a vector p K (wi ) = ( p1K (wi ), . . . , p K −G 1 ,K (wi )) of approximating functions, such as polynomials or splines. Here.

(12) 50. JOHN C. CHAO ET AL.. , p K (w ) ) . For r = n, the vector of instrumental variables would be Z i = (Z i1 i n this example is like Newey (1990) where Z i includes approximating functions for the reduced form but the number of instruments can grow as fast as the sample size. Alternatively, if rn /n → 0, it is a modified version where δG is more weakly identified. Assumption 2 also allows for multiple endogenous variables with a different strength of identification for each one, i.e., for different convergence rates. In the preceding example, we maintained the scalar endogenous variable for simplicity. The rn can be thought of as a version of the concentration parameter; it determines the convergence rate of estimators of δ0G just as the concentration √ parameter does in other settings. For rn = n, the convergence rate will be n where Assumptions 1 and 2 permit K to grow as fast as the sample size. This corresponds to a many instrument asymptotic approximation like Kunitomo (1980), Morimune (1983), and Bekker (1994).√For rn growing more slowly than n, the convergence rate will be slower than 1/ n, which leads to an asymptotic approximation like that of Chao and Swanson (2005).. Assumption 3. There is a constant, C, such that conditional on Z = (ϒ, Z ), the observations (ε1 ,U1 ), . . . , (εn ,Un ) are independent, with E[εi |Z] = 0 for all i, E[Ui |Z] = 0 for all i, supi E[εi2 |Z] < C, and supi E[ Ui 2 |Z] ≤ C, a.s. In other words, Assumption 3 requires the second conditional moments of the disturbances to be bounded. 2 n z Assumption 4. There is a π K such that ∑i=1 i − π K Z i /n → 0 a.s.. This condition allows an unknown reduced form that is approximated by a linear combination of the instrumental variables. These four assumptions give the consistency result presented in Theorem 1. −1/2 ˜ Sn (δ −. THEOREM 1. Suppose that Assumptions 1–4 are satisfied. Then, rn p p p p −1/2 ˆ δ0 ) → 0, δ˜ → δ0 , rn Sn (δ − δ0 ) → 0, and δˆ → δ0 .. The following additional condition is useful for establishing asymptotic normality and the consistency of the asymptotic variance. n z 4 Assumption 5. There is a constant, C > 0, such that ∑i=1 /n 2 → 0, i 4 4 supi E[εi |Z] < C, and supi E[ Ui |Z] ≤ C a.s.. To give asymptotic normality results, we need to describe the asymptotic variances. We will outline results that do not depend on the convergence of various moment matrices, so we writethe asymptotic variances as a function of n (rather than as a limit). Let σi2 = E εi2 |Z where, for notational simplicity, we have suppressed the possible dependence of σi2 on Z. Moreover, let H¯ n =. n. ∑ zi zi /n,. i=1. ¯n =

(13). n. ∑ zi zi σi2 /n,. i=1.

(14) JIVE WITH HETEROSKEDASTICITY. 51. ¯ n = Sn−1 ∑ Pij2 E[Ui Ui |Z]σ j2 (1 − Pj j )−2. i= j. Hn =. + E[Ui εi |Z](1 − Pii )−1 E[ε j U j |Z](1 − Pj j )−1 Sn−1 ,. n. ∑ (1 − Pii )zi zi /n,. i=1.

(15) n =. n. ∑ (1 − Pii )2 zi zi σi2 /n,. i=1. n = Sn−1 ∑ Pij2 E[Ui Ui |Z]σ j2 + E[Ui εi |Z]E[ε j U j |Z] Sn−1 . . i= j. When K /rn is bounded, the conditional asymptotic variance given Z of Sn (δ˜ −δ0 ) is ¯ n + ¯ n ) H¯ n−1 , V¯n = H¯ n−1 (

(16) and the conditional asymptotic variance of Sn (δˆ − δ0 ) is Vn = Hn−1 (

(17) n + n )Hn−1 . To state our asymptotic normality results, let A1/2 denote a square root matrix for a positive semidefinite matrix A, satisfying A1/2 A1/2 = A. Also, for nonsingular A, let A−1/2 = (A1/2 )−1 . THEOREM 2. Suppose that Assumptions 1–5 are satisfied, σi2 ≥ C > 0 a.s., and K /rn is bounded. Then V¯n and Vn are nonsingular a.s.n, and d V¯n−1/2 Sn (δ˜ − δ0 ) → N (0, IG ),. Vn−1/2 Sn (δˆ − δ0 ) → N (0, IG ). d. The entire Sn matrix in Assumption 2 determines the convergence rate of the estimators, where Sn (δˆ − δ0 ) = diag (μ1n , . . . , μGn ) Sñ (δˆ − δ0 ) is asymptotically normal. The convergence rate of the linear combination ej Sñ (δˆ − δ0 ) will be 1/μ jn , where e j is the jth unit vector. Note that yi = X i δ0 + u i = z i diag (μ1n , . . . , μGn ) Sñ δ0 + Ui δ0 + εi . The expression following the second equality is the reduced form for yi . Thus, the linear combination of structural parameters ej Sñ δ0 is the jth reduced form. √ coefficient for yi that corresponds to the variable μ jn / n z ij . This reduced form coefficient is estimated at the rate 1/μ jn by the linear combination ej Sñ δˆ of the ˆ The minimum rate is 1/√rn , which is the instrumental variables (IV) estimator δ. inverse square root of the rate of growth of the concentration parameter. These rates will change when K grows faster than rn ..

(18) 52. JOHN C. CHAO ET AL.. The rate of convergence in Theorem 2 corresponds to the rate found by Stock and Yogo (2005) for LIML, Fuller’s modified LIML, and B2SLS when rn grows at the same rate as K and more slowly than n under homoskedasticity. ¯ n in the asymptotic variance of δ˜ and the term n in the asymptotic The term ˆ variance of δ account for the presence of many instruments. The order of these terms is K /rn , so if K /rn → 0, dropping these terms does not affect the asymptotic variance. When K /rn is bounded but does not go to zero, these terms have the same order as the other terms, and it is important to account for their presence in the standard errors. If K /rn → ∞, then these terms dominate and slow down the convergence rate√of the estimators. In this case, the conditional asymptotic variance given Z of rn /K Sn (δ˜ − δ0 ) is ¯ n H¯ n−1 , V¯n∗ = H¯ n−1 (rn /K ) and the conditional asymptotic variance of. √ rn /K Sn (δˆ − δ0 ) is. Vn∗ = Hn−1 (rn /K ) n Hn−1 . When K /rn → ∞, the (conditional) asymptotic variance matrices, V¯n∗ and Vn∗ , may be singular, especially when some components of X i are exogenous or when different identification strengths are present. To allow for this singularity, our asymptotic normality results are stated in terms of a linear combination of the estimator. Let L n be a sequence of × G matrices. THEOREM 3. Suppose that Assumptions 1–5 are satisfied and K. /rn → ∞. If L n is bounded and there is a C > 0 such that λmin L n V¯n∗ L n ≥ C a.s.n then −1/2

(19). d L n rn /K Sn (δ˜ − δ0 ) → N (0, I ). L n V¯n∗ L n. Also, if there is a C > 0 such that λmin L n Vn∗ L n ≥ C a.s.n, then. L n Vn∗ L n. −1/2. Ln.

(20) d rn /K Sn (δˆ − δ0 ) → N (0, I ).. √ Here the convergence rate is related to the size of rn /K Sn . In the simple √ case where δ is a scalar, we can take Sn = √rn , which gives a convergence rate of √. K /rn . Then the theorem states that rn / K (δ˜ − δ0 ) is asymptotically normal. √ It is interesting that K /rn → 0 is a condition for consistency in this setting and also in the context of Theorem 1. From Theorems 2 and 3, it is clear that the rates of convergence of both JIV estimators depend in general on the strength of the available instruments relative to their number, as reflected in the relative orders of magnitude of rn vis-à-vis K . Note also that, whenever rn grows √ at a slower rate than n, the rate of convergence is slower than the conventional n rate of convergence. In this case, the available.

(21) JIVE WITH HETEROSKEDASTICITY. 53. instruments are weaker than assumed in the conventional strongly identified case, where the concentration parameter is taken to grow at the rate n. When Pii = Z i (Z Z )−1 Z i goes to zero uniformly in i, the asymptotic variances n of the two JIV estimators will get close in large samples. Because ∑i=1 Pii = tr(P) = K , Pii goes to zero when K grows more slowly than n, though precise conditions for this convergence depend on the nature of Z i . As a practical matter, Pii will generally be very close to zero in applications where K is very small relative to n, making the jackknife estimators very close to each other. Under homoskedasticity, we can compare the asymptotic variances of the two JIV estimators. In this case, the asymptotic variance of δ˜ is V¯n = V¯n1 + V¯n2 ,. V¯n1 = σ 2 H¯ n−1 ,. V¯n2 = Sn−1 σ 2 E[Ui Ui ] ∑ Pij2 /(1− Pj j )2 Sn−1 i= j. + Sn−1 E[Ui εi ]E[Ui εi ]Sn−1. ∑ Pij2 (1 − Pii )−1 (1 − Pj j )−1 .. i= j. Also, the asymptotic variance of δˆ is Vn = Vn1 + Vn2 , Vn2. =. Sn−1. Vn1. =σ. 2. Hn−1. . n. ∑ (1 − Pii ). i=1. . σ. 2. . E[Ui Ui ] + E[Ui εi ]E[Ui εi ]. 2. z i z i /n. Hn−1 ,. Sn−1 ∑ Pij2 . i= j. By the fact that (1 − Pii )−1 > 1, we have that V¯n2 ≥ Vn2 in the positive semidefinite sense. Also, note that Vn1 is the variance of an IV estimator with instruments z i (1 − Pii ) whereas V¯n1 is the variance of the corresponding least squares estimator, so V¯n1 ≤ Vn1 . Thus, it appears that in general we cannot rank the asymptotic variances of the two estimators. Next, we turn to results pertaining to the consistency of the asymptotic variance estimators and to the use of these estimators in hypothesis testing. We impose the following additional conditions. Assumption 6. There exist πn and C > 0 such that a.s. maxi≤n z i −πn Z i → 0 and supi z i ≤ C. The next result shows that our estimators of the asymptotic variance are consistent after normalization. THEOREM 4. Suppose that Assumptions 1–6 are satisfied. If K /rn is bounded, p p then Sn V˜ Sn − V¯n → 0 and Sn Vˆ Sn − Vn → 0. Also, if K /rn → ∞, then p p rn Sn V˜ Sn /K − V¯n∗ → 0 and rn Sn Vˆ Sn /K − Vn∗ → 0. A primary use of asymptotic variance estimators is conducting approximate inference concerning coefficients. To that end, we introduce Theorem 5..

(22) 54. JOHN C. CHAO ET AL.. THEOREM 5. Suppose that Assumptions 1–6 are satisfied and that a(δ) is an × 1 vector of functions such that (i) a(δ) is continuously differentiable in a neighborhood of δ0 ; (ii) there is a square matrix, Bn , such that for A = ∂a(δ0 )/∂δ , Bn ASn−1 is bounded; and p , we ¯ ¯ . . . , ∂a (δ)/∂δ] (iii) for any δ¯k → δ0 , (k = 1, . . . , ) and A¯ = [∂a1 (δ)/∂δ, p −1 have Bn ( A¯ − A)Sn → 0. Also suppose that there is C > 0 such that λmin (Bn ASn−1 V¯n Sn−1 A Bn ) ≥ C if K /rn is bounded or λmin (Bn ASn−1 V¯n∗ Sn−1 A Bn ) ≥ C if K /rn → ∞ a.s.n. Then ˜ for A˜ = ∂a(δ)/∂δ, d ˜ − a(δ0 ) → N (0, I ). ( A˜ V˜ A˜ )−1/2 a(δ) If there is C ≥ 0 such that λmin (Bn ASn−1 V¯n Sn−1 A Bn ) ≥ C if K /rn is bounded or ˆ λmin (Bn ASn−1 V¯n∗ Sn−1 A Bn ) ≥ C if K /rn → ∞ a.s.n, then for Aˆ = ∂a(δ)/∂δ, d ˆ − a(δ0 ) → ( Aˆ Vˆ Aˆ )−1/2 a(δ) N (0, I ). Perhaps the most important special case of this result is a single linear combination. This case will lead to t-statistics based on the consistent variance estimator having the usual standard normal limiting distribution. The following result considers such a case. COROLLARY 1. Suppose that Assumptions 1–6 are satisfied and c and bn are such that bn c Sn−1 is bounded. If there is a C > 0 such that bn2 c Sn−1 V¯n Sn−1 c ≥ C if K /rn is bounded or bn2 c Sn−1 V¯n∗ Sn−1 c ≥ C if K /rn → ∞ a.s.n, then c (δ˜ − δ0 ) d

(23) → N (0, 1). c V˜ c Also if there is a C ≥ 0 such that bn2 c Sn−1 Vn Sn−1 c ≥ C if K /rn is bounded or bn2 c Sn−1 Vn∗ Sn−1 c ≥ C if K /rn → ∞ a.s.n, then c (δˆ − δ0 ) d

(24) → N (0, 1). c Vˆ c To show how the conditions of this result can be checked, we return to the previous example with one right-hand-side endogenous variable. The following result gives primitive conditions in that example for the conclusion of Corollary 1, i.e., for the asymptotic normality of a t-ratio. COROLLARY 2. If equation (2) holds, Assumptions 1–6 are satisfied for z i = , z ), c = 0 is a constant vector, either (Z i1 iG.

(25) JIVE WITH HETEROSKEDASTICITY. 55. (i) rn = n or (ii) K /rn is bounded and (−π1 , 1)c = 0 or 2 |Z] is bounded away from zero, and the (iii) K /rn → ∞, (−π1 , 1)c = 0, E[UiG sign of E[εi UiG |Z] is constant a.s., then c (δˆ − δ0 ) d c (δ˜ − δ0 ) d

(26) → N (0, 1),

(27) → N (0, 1). c V˜ c c Vˆ c The proof of this result shows how the hypotheses concerning bn in Corollary 1 can be checked. The conditions of Corollary 2 are quite primitive. We have previously described how Assumption 2 is satisfied in the model of equation (2). Assumptions 1 and 3–6 are also quite primitive. This result can be applied to show that t-ratios are asymptotically correct when the many instrument robust variance estimators are used. For the coefficient δG of the endogenous variable, note that c = eG , so (−π1 , 1)c = 1 = 0. Therefore, 2 |Z] is bounded away from zero and the sign of E[ε U |Z] is constant, it if E[UiG i iG follows from Corollary 2 that δˆG − δ0G d → N (0, 1). VˆGG Thus, the t-ratio for the coefficient of the endogenous variable is asymptotically correct across a wide range of different growth rates for rn and K . The analogous result holds for each coefficient δ j , j ≤ G 1 , of an included instrument as long as π1 j = 0 is not zero. If π1 j = 0, then the asymptotics are more complicated. For brevity, we will not discuss this unusual case here. The analogous results also hold for δ˜G . 4. CONCLUDING REMARKS In this paper, we derived limiting distribution results for two alternative JIV estimators. These estimators are both consistent and asymptotically normal in the presence of many instruments under heteroskedasticity of unknown form. In the same setup, LIML, 2SLS, and B2SLS are inconsistent. In the process of showing the asymptotic normality of JIV, this paper gives a central limit theorem for quadratic (and, more generally, bilinear) forms associated with an idempotent matrix. This central limit theorem has already been used in Hausman et al. (2007) to derive the asymptotic properties of the jackknife versions of the LIML and Fuller (1977) estimators and in Chao et al. (2010) to derive a moment-based test that allows for heteroskedasticity and many instruments. Moreover, this new central limit theorem is potentially useful for other analyses involving many instruments..

(28) 56. JOHN C. CHAO ET AL.. NOTE 1. The observations w1 , . . . , wn are distinct with probability one and therefore, by K < n, cannot all be roots of a K th degree polynomial. It follows that for any nonzero a there must be some i with a Z i = a p K (wi ) = 0, implying a Z Z a > 0.. REFERENCES Abadir, K.M. & J.R. Magnus (2005) Matrix Algebra. Cambridge University Press. Ackerberg, D.A. & P. Devereux (2009) Improved JIVE estimators for overidentified models with and without heteroskedasticity. Review of Economics and Statistics 91, 351–362. Angrist, J.D., G.W. Imbens, & A. Krueger (1999) Jackknife instrumental variables estimation. Journal of Applied Econometrics 14, 57–67. Bekker, P.A (1994) Alternative approximations to the distributions of instrumental variable estimators. Econometrica 62, 657–681. Bekker, P.A. & J. van der Ploeg (2005) Instrumental variable estimation based on grouped data. Statistica Neerlandica 59, 506–508. Billingsley, P. (1986) Probability and Measure, 2nd ed. Wiley. Blomquist, S. and M. Dahlberg (1999) Small sample properties of LIML and jackknife IV estimators: Experiments with weak instruments. Journal of Applied Econometrics 14, 69–88. Chao, J.C., J.A. Hausman, W.K. Newey, N.R. Swanson, & T. Woutersen (2010) Testing Overidentifying Restrictions with Many Instruments and Heteroskedasticity. Working paper, MIT. Chao, J.C. & N.R. Swanson (2004) Estimation and Testing Using Jackknife IV in Heteroskedastic Regressions with Many Weak Instruments. Working paper, Rutgers University. Chao, J.C. & N.R. Swanson (2005) Consistent estimation with a large number of weak instruments. Econometrica 73, 1673–1692. Chao, J.C. & N.R. Swanson (2006) Asymptotic normality of single-equation estimators for the case with a large number of weak instruments. In D. Corbae, S.N. Durlauf, & B.E. Hansen (eds.), Frontiers of Analysis and Applied Research: Essays in Honor of Peter C. B. Phillips, pp. 82–124. Cambridge University Press. Davidson, R. & J.G. MacKinnon (2006) The case against JIVE. Journal of Applied Econometrics 21, 827–833. Donald, S.G. & W.K. Newey (2001) Choosing the number of instruments. Econometrica 69, 1161–1191. Fuller, W.A. (1977) Some properties of a modification of the limited information estimator. Econometrica 45, 939–954. Han, C. & P.C.B. Phillips (2006) GMM with many moment conditions. Econometrica 74, 147–192. Hansen, C., J.A. Hausman, & W.K. Newey (2008) Estimation with many instrumental variables. Journal of Business & Economic Statistics 26, 398–422. Hausman, J.A., W.K. Newey, T. Woutersen, J. Chao, & N.R. Swanson (2007) IV Estimation with Heteroskedasticity and Many Instruments. Working paper, MIT. Kunitomo, N. (1980) Asymptotic expansions of distributions of estimators in a linear functional relationship and simultaneous equations. Journal of the American Statistical Association 75, 693–700. Magnus, J.R. & H. Neudecker (1988) Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley. Morimune, K. (1983) Approximate distributions of k-class estimators when the degree of overidentifiability is large compared with the sample size. Econometrica 51, 821–841. Newey, W.K. (1990) Efficient instrumental variable estimation of nonlinear models. Econometrica 58, 809–837. Phillips, G.D.A. & C. Hale (1977) The bias of instrumental variable estimators of simultaneous equation systems. International Economic Review 18, 219–228..

(29) JIVE WITH HETEROSKEDASTICITY. 57. Phillips, P.C.B. (1983) Small sample distribution theory in econometric models of simultaneous equations. In Z. Griliches & M.D. Intriligator (eds.), Handbook of Econometrics, vol. 1, pp. 449–516. North-Holland. Sawa, T. (1968) The exact sampling distribution of ordinary least squares and two-stage least squares estimators. Journal of the American Statistical Association 64, 923–937. Staiger, D. & J.H. Stock (1997) Instrumental variables regression with weak instruments. Econometrica 65, 557–586. Stock, J.H. and M. Yogo (2005) Asymptotic distributions of instrumental variables statistics with many weak instruments. In D.W.K. Andrews & J.H. Stock (eds.), Identification and Inference for Econometric Models: Essays in Honor of Thomas J. Rothenberg, pp. 109–120. Cambridge University Press.. APPENDIX A: Proofs of Theorems We define a number of notations and abbreviations that will be used in Appendixes A and B. Let C denote a generic positive constant and let M, CS, and T denote the Markov inequality, the Cauchy–Schwarz inequality, and the triangle inequality, respectively. Also, for random variables Wi , Yi , and ηi and for Z = (ϒ, Z ), let w¯ i = E[Wi |Z], W˜ i = Wi − w¯ i , y¯i = E[Yi |Z], Y˜i = Yi − y¯i , η¯ i = E[ηi |Z], η˜ i = ηi − η¯ i , y¯ = ( y¯1 , . . . ., y¯n ) , w¯ = (w¯ 1 , . . . , w¯ n ) , μ¯ Y = max | y¯i | , μ¯ η = max |η¯ i | , μ¯ W = max |w¯ i | , 1≤i≤n 1≤i≤n 1≤i≤n 2 2 σ¯ Y = max Var Yi |Z , and σ¯ W = max Var Wi |Z , i ≤n. i ≤n. σ¯ η2 = max Var ηi |Z , i ≤n. where, to simplify notation, we have suppressed dependence on Z for the various quanti2 , σ¯ 2 , and σ¯ 2 ) defined previously. Furthermore, ties (w¯ i , W˜ i , y¯i , Y˜i , η¯ i , η˜ i , μ¯ W , μ¯ Y , μ¯ η , σ¯ W η Y for random variable X , define X L 2 ,Z = E X 2 |Z . We first give four lemmas that are useful in the proofs of consistency, asymptotic normality, and consistency of the asymptotic variance estimator. We group them together here for ease of reference because they are also used in Hausman et al. (2007). LEMMA A1. If, conditional on Z = (ϒ, Z ), (Wi , Yi )(i = 1, . . . , n) are independent matrix of rank K , then a.s., Wi and Yi are scalars, and P is a symmetric, idempotent for w¯ = E (W1 , . . . , Wn ) |Z , y¯ = E (Y1 , . . . , Yn ) |Z , σ¯ Wn = maxi≤n Var (Wi |Z)1/2 , 2 σ¯ 2 + σ¯ 2 y¯ y¯ + σ¯ 2 w ¯ there exists a σ¯ Yn = maxi≤n Var (Yi |Z)1/2 , and Dn = K σ¯ W Wn Yn ¯ w, n Yn positive constant C such that 2. ∑ Pij Wi Y j − ∑ Pij w¯ i y¯ j . i= j i= j. ≤ C Dn. a.s.. L 2 ,Z. Proof. Let W˜ i = Wi − w¯ i and Y˜i = Yi − y¯i . Note that. ∑ Pij Wi Y j − ∑ Pij w¯ i y¯ j = ∑ Pij W˜ i Y˜j + ∑ Pij W˜ i y¯ j + ∑ Pij w¯ i Y˜j .. i= j. i= j. i= j. i= j. i= j.

(30) 58. JOHN C. CHAO ET AL.. 2 σ¯ 2 . Note that for i = j and k = , E W ˜ i Y˜ j W˜ k Y˜ |Z is zero unless i = k Let D1n = σ¯ W n Yn and j = or i = and j = k. Then by CS and ∑ j Pij2 = Pii , 2 E ∑i= j Pij Y˜i W˜ j |Z = ∑ ∑ Pij Pk E W˜ i Y˜ j W˜ k Y˜ |Z i= j k=. =. ∑ Pij2. i= j. ≤ 2D1n. . E[W˜ i2 |Z]E[Y˜ j2 |Z] + E[W˜ i Y˜i |Z]E[W˜ j Y˜ j |Z]. . ∑ Pij2 ≤ 2D1n ∑ Pii = 2D1n K .. i= j. i. Also, for W˜ = (W˜ 1 , . . . , W˜ n ) , we have ∑i= j Pij W˜ i y¯ j = W˜ P y¯ − ∑i Pii y¯i W˜ i . By 2 I a.s., so independence across i conditional on Z, we have E W˜ W˜ |Z ≤ σ¯ W n n 2 y¯ P y¯ ≤ σ¯ 2 y¯ y¯ , E[( y¯ P W˜ )2 |Z] = y¯ P E[W˜ W˜ |Z]P y¯ ≤ σ¯ W Wn n 2 ∑i Pii y¯i W˜ i |Z = ∑ Pii2 E[W˜ i2 |Z] y¯i2 ≤ σ¯ W2 n y¯ y¯ .. E. i. Then by T we have 2. ∑i= j Pij W˜ i y¯ j . L 2 ,Z. 2. ≤ y¯ P W˜ . L 2 ,Z. 2. + ∑i Pii y¯i W˜ i . L 2 ,Z. 2 y¯ y¯ ≤ C σ¯ W n. a.s. PZ .. 2. ≤ C σ¯ Y2n w¯ w¯ a.s. The Interchanging the roles of Yi and Wi gives ∑i= j Pij w¯ i Y˜ j L 2 ,Z n conclusion then follows by T. LEMMA A2. Suppose that, conditional on Z, the following conditions hold a.s.: (i) P = P(Z) is a symmetric, idempotent matrix with rank(P) = K and Pii ≤ C < 1; n E W W |Z (ii) (W1n ,U1 , ε1 ), . . . , (Wnn ,Un , εn ) are independent, and Dn = ∑i=1 in in satisfies Dn ≤ C a.s.n; |Z = 0, E[Ui |Z] = 0, E[εi |Z] = 0, and there exists a constant C such that (iii) E Win E[ Ui 4 |Z] ≤ C and E[εi4 |Z] ≤ C; n E W 4 |Z a.s. (iv) ∑i=1 → 0; and in (v) K → ∞ as n → ∞. Then for ¯ n def = . ∑ Pij2. i= j. E[Ui Ui |Z]E[ε 2j |Z] + E[Ui εi |Z]E[ε j U j |Z] /K. . and any sequences c1n and c2n depending on Z of conformable vectors with c1n ≤ C, D c + c ¯. c2n ≤ C, and n = c1n n 1n 2n n c2n > 1/C a.s.n, it follows that √ n d −1/2 Yn = n c1n ∑ Win + c2n ∑ Ui Pij ε j K → N (0, 1) , a.s.; i=1. a.s.. i= j. i.e., Pr(Yn ≤ y|Z) → ( y) for all y..

(31) JIVE WITH HETEROSKEDASTICITY. 59. Proof. The proof of Lemma A2 is long and is deferred to Appendix B. The next two results are helpful in proving consistency of the variance estimator. They use the same notation as Lemma A1. LEMMA A3. If, conditional on Z , (Wi , Yi )(i = 1, . . . , n) are independent and Wi and Yi are scalars, then there exists a positive constant C such that. 2. ∑i= j Pij2 Wi Y j − E ∑i= j Pij2 Wi Y j |Z . L 2 ,Z. ≤ C Bn. a.s.,. 2 σ¯ 2 + σ¯ 2 μ 2 +μ 2 σ¯ 2 . where Bn = K σ¯ W ¯ ¯ Y W Y W Y. Proof. Using the notation of the proof of Lemma A1, we have. ∑ Pij2 Wi Y j − ∑ Pij2 w¯ i y¯ j = ∑ Pij2 W˜ i Y˜j + ∑ Pij2 W˜ i y¯ j + ∑ Pij2 w¯ i Y˜j .. i= j. i= j. i= j. i= j. i= j. . . As before, for i = j and k = , E W˜ i Y˜ j W˜ k Y˜ |Z is zero unless i = k and j = or i = and j = k. Also, Pij ≤ Pii < 1 by CS and Assumption 1, so Pij4 ≤ Pij2 . Also, ∑ j Pij2 = Pii , so 2 2E W ˜ i Y˜ j W˜ k Y˜ |Z E ∑i= j Pij2 W˜ i Y˜ j |Z = ∑ ∑ Pij2 Pk i= j k=. =. ∑ Pij4. i= j. E W˜ i2 |Z E Y˜ j2 |Z +E W˜ i Y˜i |Z E W˜ j Y˜ j |Z. 2 σ¯ 2 ≤ 2σ¯ W Y. ∑ Pij4 ≤ 2K σ¯ W2 σ¯ Y2. a.s.. i= j. Also, ∑i= j Pij2 W˜ i y¯ j = W˜ P˜ y¯ − ∑i Pii2 y¯i W˜ i where P˜ij = Pij2 . By independence across i 2 I , so conditional on Z, we have E[W˜ W˜ |Z] ≤ σ¯ W n n ˜ W˜ W˜ |Z] P˜ y¯ ≤ σ¯ 2 y¯ P˜ 2 y¯ E[( y¯ P˜ W˜ )2 |Z] = y¯ PE[ Wn 2 = σ¯ W n. ∑. 2μ y¯i Pik2 Pkj2 y¯ j ≤ σ¯ W ¯ 2Y. i, j,k. 2μ = σ¯ W ¯ 2Y ∑ k. . . ∑ Pik2 i. . ∑ Pkj2 j. 2 2μ E ∑i Pii2 y¯i W˜ i |Z =∑ Pii4 E[W˜ i2 |Z] y¯i2 ≤ K σ¯ W ¯ 2Y. ∑. Pik2 Pkj2. i, j,k. 2μ 2 ≤ K σ¯ 2 μ 2 = σ¯ W ¯ 2Y ∑ Pkk W ¯Y. a.s.,. k. a.s.. i. 2. Then by T, we have ∑i= j Pij2 W˜ i y¯ j . 2. + ∑i Pii2 y¯i W˜ i L 2 ,Z L 2 ,Z L ,Z. 2 2. 2 2 2 ˜ ≤ C K σ¯ W μ¯ Y a.s. Interchanging the roles of Yi and Wi gives ∑i= j Pij w¯ i Y j L 2 ,Z 2. ≤ W˜ P˜ y¯ . ≤ C K μ¯ 2W σ¯ Y2 a.s. The conclusion then follows by T.. As a notational convention, let ∑i= j=k denote ∑i ∑ j=i ∑k ∈{i, / j} .. n.

(32) 60. JOHN C. CHAO ET AL.. LEMMA A4. Suppose that there is C > 0 such that, conditional on Z,(W1 ,Y1 , η1 ) , . . . , √ √ (Wn , Yn , ηn ) are independent with E[Wi |Z] = ai / n, E[Yi |Z] = bi / n, |ai | ≤ C, |bi | ≤ C, E[ηi2 |Z] ≤ C, Var(Wi |Z) ≤ C/rn , and Var(Yi |Z) ≤ C/rn and there exists πn such a.s. √ that maxi≤n ai − Z πn → 0 and K /rn → 0. Then i. . ∑. An = E. i= j=k. . Wi Pik ηk Pkj Y j |Z = O p (1),. ∑. i= j=k. p. Wi Pik ηk Pkj Y j − An → 0.. n. Proof. Given in Appendix B. LEMMA A5. If Assumptions 1–3 are satisfied, then (i) Sn−1 H˜ Sn−1 = ∑ z i Pij (1 − Pj j )−1 z j /n + o p (1), i= j. √ (ii) Sn−1 ∑ X i Pij (1 − Pj j )−1 ε j = O p (1 + K /rn ), i= j (iii) Sn−1 Hˆ Sn−1 = ∑ z i Pij z j /n + o p (1), i= j. √ (iv) Sn−1 ∑ X i Pij ε j = O p (1 + K /rn ). i= j Proof. Let ek denote the kth unit vector and apply Lemma A1 with Yi = ek Sn−1 X i = √ −1 for some k and . By Assumption 2, z ik / n + ek Sn−1 Ui and Wi = e Sn−1 X i (1 − Pii ) √ √. λmin (Sn ) ≥ C rn , implying Sn−1 ≤ C/ rn . Therefore a.s. √ E[Yi |Z] = z ik / n, Var(Yi |Z) ≤ C/rn , √ Var(Wi |Z) ≤ C/rn . E[Wi |Z] = z i / n(1 − Pii ), Note that a.s. √ √ K σ¯ Wn σ¯ Yn ≤ C K /rn → 0, σ¯ Yn. . √. −1/2 w¯ w¯ ≤ Crn. ∑ i. σ¯ Wn.

(33). −1/2. y¯ y¯ ≤ Crn. . ∑ zik2 /n → 0,. i −1/2 2 −2 z i (1− Pii ) /n ≤ Crn (1−max Pii )−2 i. ∑ zi 2 /n →0. i. Because ek Sn−1 H˜ Sn−1 e = ek Sn−1 ∑i= j X i Pij X j Sn−1 e /(1 − Pj j ) = ∑i= j Yi Pij W j and Pij w¯ i y¯ j = Pij z ik z j /n(1 − Pj j ), applying Lemma A1 and the conditional version of M, we deduce that for any υ > 0 and An = ek Sn−1 H˜ Sn−1 e − ∑i= j ek z i Pij (1 − Pj j )−1 a.s. z j e /n ≥ υ , P (An |Z) → 0. By the dominated convergence theorem, P (An ) = E [P (An |Z)] → 0. The preceding argument establishes the first conclusion for the (k, )th element. Doing this for every element completes the proof of the first conclusion. For the second conclusion, apply Lemma A1 with Yi = ek Sn−1 X i as before and Wi = εi /(1 − Pii ). Note that w¯ i = 0 and σ¯ Wn ≤ C. Then by Lemma A1, E[{ek Sn−1. ∑ X i Pij (1 − Pj j )−1 ε j }2 |Z] ≤ C K /rn + C.. i= j. The conclusion then follows from the fact that E[An |Z] ≤ C implies An = O p (1)..

(34) JIVE WITH HETEROSKEDASTICITY. 61. For the third conclusion, apply Lemma A1 with Yi = ek Sn−1 X i as before and Wi =. e Sn−1 X i , so a.s. √. √ K σ¯ Wn σ¯ Yn ≤ C K /rn → 0,. σ¯ Wn.

(35). y¯ y¯ ≤ Crn−1/2. . ∑ zik2 /n → 0,. √ σ¯ Yn w¯ w¯ → 0.. n. The fourth conclusion follows similarly to the second conclusion. Let H¯ n = ∑i z i z i /n and Hn = ∑i (1 − Pii )z i z i /n. LEMMA A6. If Assumptions 1–4 are satisfied, then Sn−1 H˜ Sn−1 = H¯ n + o p (1),. Sn−1 Hˆ Sn−1 = Hn + o p (1).. Proof. We use Lemma A5 and approximate the right-hand-side terms in Lemma A5 by H¯ n and Hn . Let z¯ i = ∑nj=1 Pij z j be the ith element of Pz and note that n. ∑ zi − z¯i 2 /n = (I − P)z 2 /n = tr(z (I − P)z/n)= tr[(z− Z π K n ) (I − P)(z − Z π K n )/n]. i=1. ≤ tr[(z − Z π K n ) (z − Z π K n )/n] =. n. ∑ zi − π K n Z i 2 /n → 0. a.s. PZ .. i=1. It follows that a.s.. −1 ∑(¯z i − z i )(1 − Pii ) z i /n ≤ ∑ ¯z i − z i (1 − Pii )−1 z i /n. i i . 2 2 ≤ ∑ ¯z i − z i /n ∑ (1 − Pii )−1 z i /n → 0. i. i. Then. ∑ zi Pij (1 − Pj j )−1 z j /n = ∑ zi Pij (1 − Pj j )−1 z j /n − ∑ zi Pii (1 − Pii )−1 zi /n. i= j. i, j. =∑. i. z¯ i (1 − Pii )−1 z i /n −. i. ∑ zi Pii (1 − Pii )−1 zi /n i. = H¯ n + ∑(¯z i − z i )(1 − Pii )−1 z i /n = H¯ n + oa.s. (1). i. The first conclusion then follows from Lemma A5 and T. Also, as in the last equation, we have. ∑ zi Pij z j /n = ∑ zi Pij z j /n − ∑ Pii zi zi /n = ∑ z¯i zi /n − ∑ Pii zi zi /n. i= j. i, j. i. i (¯z i − z i )z i /n = Hn + oa.s. (1),. i. = Hn + ∑ i. n so the second conclusion follows similarly to the first. . Proof of Theorem 1. First, note that by λmin Sn Sn /rn ≥ λmin S˜ S˜ ≥ C, we have. √ ˜. Sn (δ − δ0 )/ rn ≥ λmin (Sn Sn /rn )1/2 δ˜ − δ0 ≥ C δ˜ − δ0 ..

(36) 62. JOHN C. CHAO ET AL.. p p √ Therefore, Sn (δ˜ − δ0 )/ rn → 0 implies δ˜ → δ0 . Note that by Assumption 2, H¯ n is bounded and λmin ( H¯ n ) ≥ C a.s.n. For H˜ from Section 2, it follows from Lemma A6 and Assumption 2 that with probability approaching one λmin (Sn−1 H˜ Sn−1 ) ≥ C as the sample −1 size grows. Hence Sn−1 H˜ Sn−1 = O p (1). By equation (1) and Lemma A5, −1/2 ˜ Sn (δ − δ0 ) = (Sn−1 H˜ Sn−1 )−1 Sn−1. √ p rn = O p (1)o p (1) → 0.. ∑ X i Pij ξ j /. rn. i= j. All of the previous statements are conditional on Z = (ϒ, Z ) for a given sample size n, −1/2 ˜ Sn (δ −δ0 ), we have shown that for any constant v > so for the random variable Rn = rn 0, a.s. Pr( Rn ≥ v|Z) → 0. Then by the dominated convergence theorem, Pr( Rn ≥ v) = E[Pr( Rn ≥ v|Z)] → 0. Therefore, because v is arbitrary, it follows that Rn = p −1/2 ˜ Sn (δ − δ0 ) → 0. rn Next note that Pii ≤ C < 1, so in the positive semidefinite sense in large enough samples a.s., Hn = ∑(1 − Pii )z i z i /n ≥ (1 − C) H¯ n . Thus, by Assumption 2, Hn is bounded and bounded away from singularity a.s.n. Then the n rest of the conclusion follows analogously with δˆ replacing δ˜ and Hn replacing H¯ n . We now turn to the asymptotic normality results. In what follows, let ξi = εi when considering the JIV2 estimator and let ξi = εi /(1 − Pii ) when considering JIV1. Proof of Theorem 2. Define √ Yn = ∑ z i (1 − Pii )ξi n + Sn−1 i. ∑ Ui Pij ξ j .. i= j. By Assumptions 2–4, √ 2 n E ∑i=1 (z i − z¯ i ) ξi / n |Z a.s. n n = ∑i=1 z i − z¯ i 2 E ξi2 |Z n ≤ C ∑i=1 z i − z¯ i 2 /n → 0. Therefore, by M, Sn−1. n. √. ∑ X i Pij ξ j − Yn = ∑ (zi − z¯i ) ξi /. i= j. p. n → 0.. i=1. We now apply Lemma A2 to establish asymptotic normality of Yn conditional on Z. Let n = Var (Yn |Z), so n =. n. ∑ zi zi (1 − Pii )2 E[ξi2 |Z]/n + Sn−1 ∑ Pij2. i=1. . i= j. . × E[Ui Ui |Z]E[ξ j2 |Z] + E[Ui ξi |Z]E[U j ξ j |Z] Sn−1 ..

(37) JIVE WITH HETEROSKEDASTICITY Note that. 63. √ rn Sn−1 is bounded by Assumption 2 and that ∑i= j Pij2 /K ≤ 1, so by bounded-. ness of K /rn and Assumption 3, it follows that n ≤ C a.s.n. Also, E[ξi2 |Z] ≥ C > 0, so n ≥. n. n. i=1. i=1. ∑ zi zi (1 − Pii )2 E[ξi2 |Z]/n ≥ C ∑ zi zi /n.. Therefore, by Assumption 2, λmin ( n ) ≥ C > 0 a.s.n (for generic C that may be different. from before). It follows that n−1 ≤ C a.s.n. Let α be a G × 1 nonzero vector. Let Ui be defined as in Lemma A2 and ξi be √ −1/2 α, defined as εi in Lemma A2. In addition, let Win = z i (1 − Pii )ξi / n, c1n = n √ −1 −1/2 α. Note that condition (i) of Lemma A2 is satisfied. Also, by and c2n = K Sn n the boundedness of ∑i z i z i /n and E[ξi2 |Z] a.s.n, condition (ii) of Lemma A2 is satisfied; condition (iii) is satisfied by Assumptions 3 and 5. Also, by (1 − Pii )−1 ≤ C n E W 4 |Z ≤ C n z 4 /n 2 a.s. → 0, so condition (iv) is and Assumption 5, ∑i=1 ∑i=1 i in −1/2. satisfied. Finally, condition (v) is satisfied by hypothesis. Note also that c1n = n α and √. √ −1/2 K /rn rn Sn−1 n α satisfy c1n ≤ C and c2n ≤ C a.s.n. This follows c2n = √ √ from the boundedness of K /rn , rn Sn−1 , and n−1 . Moreover, the n of Lemma A2 is n = Var(c1n. n. ∑ Win + c2n ∑ Ui Pij ξ j / i= j. i=1. √. −1/2. K |Z) = Var(α n. Yn |Z) = α α. by construction. Then, applying Lemma A2, we have n √. −1/2 −1/2 d −1/2 αα α n Y n = n ∑ c1n Win + c2n ∑ Ui Pij ξ j / K → N (0, 1) i=1. −1/2. It follows that α n. a.s.. i= j. d −1/2 Yn → N 0, α α a.s., so by the Cramér–Wold device, n. d. Yn → N (0, IG ) a.s. Consider now the JIV1 estimator where ξi = εi /(1 − Pii ). Plugging this into the ex¯ n + ¯ n for

(38) ¯ n and ¯ n defined according to Assumption pression for n , we find n =

(39) −1/2 ¯ −1 1/2 ¯ Hn n 5. Let Vn also be as defined following Assumption 5 and note that Bn = V¯n −1/2 −1/2 ¯ ¯ ¯ is an orthogonal matrix because B B = V = I. Also, B is a function of V V n n n n n n. ¯ −1/2 1/2 ¯ only Z, Vn ≤ C a.s.n because λmin (Vn ) ≥ C > 0 a.s.n, and n ≤ C a.s.n. By Lemma A6, (Sn−1 H˜ Sn−1 )−1 = H¯ n−1 + o p (1). Note that if a random variable Wn sata.s.. isfies Wn ≤ C a.s.n, then Wn = O p (1) (note that 1( Wn > C) → 0 implies that E[1( Wn > C)] = Pr( Wn > C) → 0). Therefore, we have −1/2 −1 ˜ −1 −1 1/2 −1/2 ¯ −1 1/2 (Sn H Sn ) n = V¯n ( Hn + o p (1))n = Bn + o p (1). V¯n −1/2. Note that because n. d. Yn → N (0, IG ) a.s. and Bn is orthogonal to and a function d −1/2 Yn → N (0, IG ). Then by the Slutsky lemma and δ˜ = δ0 + only of Z, we have Bn n.

(40) 64. JOHN C. CHAO ET AL.. H˜ −1 ∑i= j X i Pij ξ j , for ξ j = (1 − Pj j )−1 ε j , we have −1/2 ˜ −1/2 −1 ˜ −1 −1 −1 −1 V¯n Sn (δ − δ0 ) = V¯n (Sn H Sn ) Sn. ∑ X i Pij ξ j. i= j. −1/2 −1 ˜ −1 −1 = V¯n (Sn H Sn ) [Yn + o p (1)] −1/2. = [Bn + o p (1)][n. −1/2. Yn +o p (1)] = Bn n. d. Yn +o p (1) → N (0, IG ),. which gives the first conclusion. The conclusion for JIV2 follows by a similar argument n for ξi = εi . Proof of Theorem 3. Under the hypotheses of Theorem 3, rn /K → 0, so following √ p n z (1 − P )ξ /√n → the proof of Theorem 2, we have rn /K ∑i=1 0. Then similar to i ii i √ √ √ the proof of Theorem 2, for Yn = rn Sn−1 ∑i= j Ui Pij ξ j / K , we have rn /K Sn−1 ∑i= j X i Pij ξ j = Yn + o p (1). Here let n = Var (Yn |Z) = rn Sn−1 ∑ Pij2 E[Ui Ui |Z]E[ξ j2 |Z] + E[Ui ξi |Z]E[U j ξ j |Z] Sn−1/K . i= j. Note that by Assumptions 2 and 3, n ≤ C a.s.n. Let L¯ n be any sequence of bounded −1/2. matrices with λmin ( L¯ n n L¯ n ) ≥ C > 0 a.s.n and let Y¯n = L¯ n n L¯ n L¯ n Yn . Now let α be a nonzero vector and apply Lemma A2with Win = 0, εi = ξi , c 1n = 0, and √. −1/2 √ L¯ n rn Sn−1 . We have Var c2n c2n = α L¯ n n L¯ n ∑i= j Ui Pij ξ j / K |Z = α α > 0 by construction, and the other hypotheses of Lemma A2 can be verified as in the proof of d Theorem 2. Then by the conclusion of Lemma A2, it follows that α Y¯n → N (0, α α) a.s. d. By the Cramér–Wold device, a.s. Y¯n → N (0, I ). Consider now the JIV1 estimator and let L n be specified as in the statement of the result such that λmin L n V¯n∗ L n ≥ C > 0 a.s.n. Let L¯ n = L n H¯ n−1 , so L n V¯n∗ L n = L¯ n n L¯ n . −1/2 . 1/2 Note that L¯ n n L¯ n ≤ C and n ≤ C a.s.n. By Lemma A6, (Sn−1 H˜ Sn−1 )−1 = H¯ n−1 + o p (1). Therefore, we have. −1/2 −1/2 L¯ n n L¯ n L n (Sn−1 H˜ Sn−1 )−1 = L¯ n n L¯ n L n ( H¯ n−1 + o p (1)). −1/2 ¯ = L¯ n n L¯ n L n + o p (1).. Note also that. √ rn /K Sn−1 ∑i= j X i Pij (1 − Pj j )−1 ε j = O p (1). Then we have. −1/2

(41) L n rn /K Sn (δ˜ − δ0 ) L n V¯n∗ L n

(42) −1/2. = L¯ n n L¯ n L n (Sn−1 H˜ Sn−1 )−1 rn /K Sn−1 =. . ∑ X i Pij (1 − Pj j )−1 ε j. i= j. −1/2 d L¯ n n L¯ n L¯ n + o p (1) [Yn + o p (1)] = Y¯n + o p (1) → N (0, I ) .. The conclusion for JIV2 follows by a similar argument for ξi = εi .. n.

(43) JIVE WITH HETEROSKEDASTICITY. 65. ˜ Next, we turn to the proof of Theorem 4. Let ξ˜i = ( yi − X i δ)/(1 − Pii )and ξi = εi / (1 − Pii ) for JIV1 and ξî = yi − X i δˆ and ξi = εi for JIV2. Also, let ˆ 1 = ∑ X˙ i Pik ξˆ 2 Pkj X˙ , ˆ 2 = ∑ P 2 X˙ i X˙ ξˆ 2 + X˙ i ξî ξˆ j X˙ , X˙ i = Sn−1 X i , ij k j i j j i= j=k. ˙1 = . ∑. i= j=k. X˙ i Pik ξk2 Pkj X˙ j ,. ˙2 = . ∑ Pij2. . i= j. i= j. X˙ i X˙ i ξ j2 + X˙ i ξi ξ j X˙ j .. ˆ1 − ˙ 1 = o p (1) and ˆ2 − ˙2 = LEMMA A7. If Assumptions 1–6 are satisfied, then o p (K /rn ). Proof. To show the first conclusion, we use Lemma A4. Note that for δ˙ = δˆ and X iP = p X i /(1 − Pii ) for JIV1 and δ˙ = δ˜ and X iP = X i for JIV2, we have δ˙ → δ0 and ξî2 − ξi2 = 2 −2ξi X iP (δ˙ − δ0 ) + X iP (δ˙ − δ0 ) . Let ηi be any element of −2ξi X iP or X iP X iP . Note. √ √ that Sn / n is bounded, so by CS, ϒi = Sn z i / n ≤ C. Then E[ηi2 |Z] ≤ C E[ξi2 |Z] + C E[ X i 2 |Z] ≤ C + C ϒi 2 + C E[ Ui 2 |Z] ≤ C. ˆ n denote a sequence of random variables converging to zero in probability. By Let Lemma A4, ˆ . ∑. i= j=k. p X˙ i Pik ηk Pkj X˙ j = o p (1)O p (1) → 0.. ˆ1 − ˙ 1 is a sum of terms of the From the preceding expression for ξî2 − ξi2 , we see that form p ˆ 1 − ˙1 → ˆ ∑i= j=k X˙ i Pik ηk Pkj X˙ , so T, 0. j . . . . Let di = C +|εi |+ Ui , Aˆ = (1 + δˆ ) for JIV1, Aˆ = (1 + δ˜ ) for JIV2, Bˆ = δˆ − δ0 . for JIV1, and Bˆ = δ˜ − δ0 for JIV2. By the conclusion of Theorem 1, we have Aˆ = O p (1) p. and Bˆ → 0. Also, because Pii is bounded away from 1, (1 − Pii )−1 ≤ C a.s. Hence, for both JIV1 and JIV2, ˆ. X i ≤ C + Ui ≤ di , X˙ i ≤ Crn−1/2 di , ξî − ξi ≤ C X i (δˆ − δ0 ) ≤ Cdi B, ˆ ˆ ˆ + |ξi | ≤ Cdi A, ξi ≤ C X i (δ0 − δ) ˆ2 ˆ i Bˆ ≤ Cdi2 Aˆ B, ˆ ξi − ξi2 ≤ |ξi | + ξî ξî − ξi ≤ Cdi (1 + A)d . . ˙ ˆ ˙ ˆ 2 ˆ −1/2 2 ˆ di A, X˙ i ξi ≤ Crn−1/2 di2 . X i ξi − ξi ≤ Cμ−1 n di B, X i ξi ≤ Crn Also note that because E[di2 |Z] ≤ C, E. . ∑. i= j. Pij2 di2 d j2 rn−1 | Z. ≤ Crn−1 ∑ Pij2 = Crn−1 ∑ Pii = C K /rn , i, j. i.

(44) 66. JOHN C. CHAO ET AL.. so ∑i= j Pij2 di2 d j2 rn−1 = O p (K /rn ) by M. Then it follows that. . 2 . 2 2 2 ˙ ˙ ˆ ∑ Pij X i X i ξ j − ξ j ≤ ∑ Pij2 X˙ i ξˆ j2 − ξ j2 i= j i= j ≤ Crn−1. ∑ Pij2 di2 d j2 Aˆ Bˆ = o p (K /rn ) .. i= j. We also have. . 2 ∑ Pij X˙ i ξî ξˆ j X˙ j − X˙ i ξi ξ j X˙ j ≤ ∑ Pij2 X˙ i ξî X˙ j ξˆ j − ξ j i= j i= j . . + X˙ j ξ j X˘ i ξî − ξi K ≤ Crn−1 ∑ Pij2 di2 d j2 Aˆ Bˆ = o p . r n i= j. n. The second conclusion then follows from T. LEMMA A8. If Assumptions 1–6 are satisfied, then ˙1 = ˙2 = . ∑. i= j=k. z i Pik E[ξk2 |Z]Pkj z j /n + o p (1),. ∑ Pij2 zi zi E[ξ j2 |Z]/n + Sn−1 ∑ Pij2. i= j. i= j. E[Ui Ui |Z]E[ξ j2 |Z]. + E[Ui ξi |Z]E[ξ j U j |Z] Sn−1 + o p (K /rn ).. Proof. To prove the first conclusion, apply Lemma A4 with Wi equal to an element of X˙ i , Y j equal to an element of X˙ j , and ηk = ξk2 . Next, we use Lemma A3. Note that Var(ξi2 |Z) ≤ C and rn ≤ Cn, so for u ki = ek Sn−1 Ui , 4 + X˙ 4 |Z] E[( X˙ ik X˙ i )2 |Z] ≤ C E[ X˙ ik i 4 /n 2 + Eu 4 |Z + z 4 /n 2 + Eu 4 |Z ≤ C/r 2 , ≤ C z ik n ki i i 2 2 2 2 2 ˙ E[( X ik ξi ) |Z] ≤ C E z ik ξi /n + u ki ξi |Z ≤ C/n + C/rn ≤ C/rn .. Also, if

(45) i = E[Ui Ui |Z], then E[ X˙ i X˙ i |Z] = z i z i /n + Sn−1

(46) i Sn−1 and E[ X˙ i ξi |Z] = Sn−1 E[Ui ξi |Z]. Next let Wi be X˙ ik X˙ i for some k and , so E[Wi |Z] = ek Sn−1

(47) i Sn−1 e + z ik z i /n,. |E[Wi |Z]| ≤ C/rn ,. Var(Wi |Z) ≤ E[( X˙ ik X˙ i )2 |Z] ≤ C/rn2 . Also let Yi = ξi2 and note that |E[Yi |Z]| ≤ C and Var(Wi |Z) ≤ C. Then in the notation of Lemma A3, √ √ √ K (σ¯ Wn σ¯ Yn + σ¯ Wn μ¯ Yn + μ¯ Wn σ¯ Yn ) ≤ K (C/rn + C/rn + C/rn ) ≤ C K /rn . By the conclusion of Lemma A3, for this Wi and Yi we have √ ∑ Pij2 X˙ ik X˙ i ξ j2 = ek ∑ Pij2 zi zi /n + Sn−1

(48) i Sn−1 e E[ξ j2 |Z] + O p ( K /rn ).. i= j. i= j.

(49) JIVE WITH HETEROSKEDASTICITY. 67. Consider also Lemma A3 with Wi and Yi equal to X˙ ik ξi and X˙ i ξi , respectively, so σ¯ Wn σ¯ Yn + σ¯ Wn μ¯ Yn + μ¯ Wn σ¯ Yn ≤ C/r n . Then, applying Lemma A3, we have √ ∑ Pij2 X˙ ik ξi ξ j X˙ j = ek Sn−1 ∑ Pij2 E[Ui ξi |Z]E[ξ j U j |Z]Sn−1 e + O p ( K /rn ).. i= j. i= j. √ Also, because K → ∞, we have O p ( K /rn ) = o p (K /rn ). The second conclusion then n follows by T. Proof of Theorem 4. Note that X¯ i = ∑nj=1 Pij X j , so n. ∑ ( X¯ i X¯ i − X i Pii X¯ i − X¯ i Pii X i )ξî2. i=1. n. = =. =. n. Pik Pkj X i X j ξˆk2 −. ∑. Pik Pkj X i X j ξˆk2 − ∑ Pii Pij X i X j ξî2 − ∑ Pij Pj j X i X j ξˆ j2 −2 ∑ Pii2 X i X i ξî2. i, j,k=1 n i, j,k=1. =. n. ∑. i, j=1. i, j=1. i= j. ∑. i= j=k. Pij Pj j X i X j ξˆ j2. ∑. n. i, j,k ∈{i, / j}. ∑. Pii Pij X i X j ξî2 −. ∑. i= j. i=1. n. Pik Pkj X i X j ξˆk2 −. ∑ Pii2 X i X i ξî2. i=1. Pik Pkj X i X j ξˆk2 +. n. n. i= j. i=1. ∑ Pij2 X i X i ξˆj2 − ∑ Pii2 X i X i ξî2 .. Also, for Z i and Z˜ i equal to the ith row of Z and Z˜ = Z (Z Z )−1 , we have K K n n. ∑ ∑ ∑ Z˜ ik Z˜ i X i ξî. k=1 =1. i=1 n. ∑. = =. i, j=1 n. ∑. . K. K. ∑∑. k=1 =1. ∑. Z jk Z j X j ξˆ j. j=1. . Z˜ ik Z jk Z˜ i Z j. n. X i ξî ξˆ j X j =. . ∑. i, j=1. ( Z˜ i Z j )2 X i ξî ξˆ j X j =. i, j=1. n. ∑. K. ∑. 2 Z˜ ik Z jk. X i ξî ξˆ j X j. k=1. Pij2 X i ξî ξˆ j X j .. i, j=1. Adding this equation to the previous one gives ˆ = =. ∑. Pik Pkj X i X j ξˆk2 +. ∑. Pik Pkj X i X j ξˆk2 +. i= j=k i= j=k. n. n. i=1. i, j=1. ∑ Pij2 X i X i ξˆj2 − ∑ Pii2 X i X i ξî2 + ∑. i= j. Pij2 X i ξî ξˆ j X j. ∑ Pij2 (X i X i ξˆj2 + X i ξî ξˆj X j ),. i= j. which yields the in Section 2. equality Let σ˙ i2 = E ξi2 |Z and z¯ i = ∑ j Pij z j = ei Pz. Then following the same line of argument as at the beginning of this proof, with z i replacing X i and σ˙ k2 replacing ξˆk2 ,. ∑. i= j=k. z i Pik σ˙ k2 Pkj z j /n = ∑ σ˙ i2 z¯ i z¯ i − Pii z i z¯ i − Pii z¯ i z i + Pii2 z i z i n − ∑ Pij2 z i z i σ˙ j2/n. i. i= j.

(50) 68. JOHN C. CHAO ET AL.. Also, as shown previously, Assumption 4 implies that ∑i z i − z¯ i 2 /n ≤ z (I − P)z/n → 0 a.s. Then by σ˙ i2 and Pii bounded a.s. PZ , we have a.s.. 2 ∑ σ˙ i (¯z i z¯ i − z i z i )/n ≤ ∑ σ˙ i2 (2 z i. z i − z¯ i + z i − z¯ i 2 )/n i. i 1/2 1/2 ≤C. ∑ zi 2 /n. ∑ zi − z¯i 2 /n. i. + C ∑ z i − z¯ i 2 /n → 0,. i. i. 1/2 1/2. 2 4 2 2 2 → 0. ∑ σ˙ i Pii (z i z¯ i − z i z i )/n ≤ ∑ σ˙ i Pii z i /n ∑ zi − z¯i /n i. i i It follows that. ∑. i= j=k. z i Pik σ˙ k2 Pkj z j /n = ∑ σ˙ i2 (1 − Pii )2 z i z i /n − i. ∑ Pij2 zi zi σ˙ j2 /n + oa.s. (1).. i= j. It then follows from Lemmas A7 and A8 and T that ˆ2 = ˆ 1 + . ∑. i= j=k. + Sn−1. z i Pik σ˙ k2 Pkj z j /n +. ∑ Pij2. ∑ Pij2 zi zi σ˙ j2 /n. i= j. E[Ui Ui |Z]σ˙ j2 + E[Ui ξi |Z]E[ξ j U j |Z] Sn−1. . i= j. + o p (1) + o p (K /rn ) = ∑ σ˙ i2 (1 − Pii )2 z i z i /n i. + Sn−1. ∑ Pij2. i= j. E[Ui Ui |Z]σ˙ j2 + E[Ui ξi |Z]E[ξ j U j |Z] Sn−1. . + o p (1) + o p (K /rn ) because n → 0. Then for JIV1, where ξi = εi /(1 − Pii ) and σ˙ i2 = σi2 /(1 − Pii )2 , we have ˆ2 =

(51) ¯ n + ¯ n + o p (1) + o p (K /rn ). ˆ 1 + For JIV2, where ξi = εi and σ˙ i2 = σi2 , we have ˆ 2 =

(52) n + n + o p (1) + o p (K /rn ). ˆ 1 + Consider the case where K /rn is bounded, implying o p (K /rn ) = o p (1). Then, because ¯ n + ¯ n , Hn−1 , and

(53) n + n are all bounded a.s.n, Lemma A6 implies H¯ n−1 ,

(54) −1 −1 ˆ 2 Sn−1 H˜ Sn−1 ˆ 1 + Sn V˜ Sn = Sn−1 H˜ Sn−1 . ¯ n + ¯ n + o p (1) H¯ n−1 + o p (1) = V¯n + o p (1), = H¯ n−1 + o p (1)

(55) Sn Vˆ Sn = Vn + o p (1), which gives the first conclusion..

(56) JIVE WITH HETEROSKEDASTICITY. 69. For the second result, consider the case where K /rn → ∞. Then for JIV1, where ξi = ¯ n for n sufficiently εi /(1 − Pii ) and σ˙ i2 = σi2 /(1 − Pii )2 , the almost sure boundedness of