Factor and factor loading augmented estimators for panel regression

(1)

1219

“Factor and factor loading augmented estimators for panel regression”

Jad Beyhum and Eric Gautier

May 2021

(2)

Factor and factor loading augmented estimators for panel regression

Jad B eyhum

^∗

Eric G autier

^†

Abstract

This paper considers linear panel data models where the dependence of the regressors and the unobservables is modelled through a factor structure. The asymptotic setting is such that the number of time periods and the sample size both go to infinity. Non-strong factors are allowed and the number of factors can grow to infinity with the sample size.

We study a class of two-step estimators of the regression coefficients. In the first step, factors and factor loadings are estimated. Then, the second step corresponds to the panel regression of the outcome on the regressors and the estimates of the factors and the factor loadings from the first step. Different methods can be used in the first step while the second step is unique. We derive sufficient conditions on the first-step estimator and the data generating process under which the two-step estimator is asymptotically normal. Assumptions under which using an approach based on principal components analysis in the first step yields an asymptotically normal estimator are also given. The two-step procedure exhibits good finite sample properties in simulations.

KEYWORDS: panel data, interactive fixed effects, factor models, flexible unobserved het- erogeneity, principal components analysis.

∗ORSTAT, KU Leuven,[email protected]. This work was undertaken at the Toulouse School of Economics, Universit´e Toulouse Capitole.

†Toulouse School of Economics, Universit´e Toulouse Capitole,[email protected].

Financial support from the European Research Council (ERC POEMH grant agreement No. 337665) and the French National Research Agency (grant ANR-17-EURE-0010) is gratefully acknowledged. The authors thank Domenico Giannone, Jihyun Kim, Pascal Lavergne, Thierry Magnac and Nour Meddahi for helpful comments and ideas.

(3)

1 Introduction

This paper considers inference on β ∈R^K in the following model:

Y_it =

K

X

k=1

β_kX_kit+

rN

X

j=1

λ_ijf_tjδ_j +E_it, , (1.1)

where the data consists of the outcome Y_it and the regressors X_kit for all k = 1, . . . , K, i = 1, . . . , N and t = 1. . . , T. The random vectors λ_i and f_t in R^r^N are factor loadings and factors, δ is a nonrandom vector in R^r^N,r_N is the number of factors.

This is a panel data model with interactive fixed effects (see?). It allows for flexible cross- section and serial correlation thanks to the factor structure in the regression error. Several techniques have been developed to estimate this model. ? proposes to estimate jointly the regression coefficient and the factors and factor loadings. ? and ? study a nuclear- norm penalized estimator. In contrast, the CCE estimator of ? and the factor-augmented regression estimator studied in ? and ? model the dependence between the regressors and the unobservables PrN

j=1λ_ijf_tjδ_j +E_it. They assume that, for k ∈ {1, . . . , K}, there exists λ_k1, . . . , λ_kN which are random vectors in R^r^N and mean-zero errors E₁, . . . , E_K which are N×T random matrices such that X_kit =PrN

r=1λ_kirf_tr+E_kit fork ∈ {1, . . . , K}.This means that the regressors have a factor structure with the same factors as the error term but possibly different factor loadings.

In the papers of?,? and?, a strong factor assumption is imposed. It means that the ratio of the singular values of Γ and √

N T has a finite deterministic limit as N, T → ∞, where Γ_it = PrN

r=1λ_ijf_tj. It holds if (1/N)PN

i=1λ_iλ^>_i (1/T)PT

t=1f_tf_t^> has a finite deterministic limit in probability. The number of factors is also assumed to be fixed with the sample size.

It is worth noting that some papers have sought to relax these assumptions in the context of the CCE (?) and factor augmented (?) estimators.

This paper proposes instead to model the dependence of the regressors with both the factors and the factor loadings by assuming that there exists δ_k ∈ R^r^N for k ∈ {1, . . . , K} and errors E₁, . . . , E_K which areN ×T random matrices such that

X_kit=

rN

X

j=1

λ_ijf_tjδ_kj+E_kit, k ∈ {1, . . . , K}. (1.2) The role of the vectors δ, ..., δ_K is to model the dependence between the regressors and the unobservables PrN

j=1λ_ijf_tjδ_j +E_it. The structure that we impose can be seen as the gener- alisation to dimension 3 (the third dimension being the one of variables) of the usual factor

(4)

models for matrices as in ?. Such a modelling was already introduced in the psychometrics literature in ? and ?. The mathematical foundations behind this approach lie in the tensor decomposition literature, see ? for a survey.

We study a class of two-step estimators of the proposed model ((1.1) and (1.2)). In the first step , the factors and the factor loadings are estimated. Then, in the second step the outcome is regressed on the covariates augmented by estimates of the factors and the factor loadings. We provide sufficient conditions on the first-step estimator under which the two-step estimator is asymptotically normal. We present assumptions under which a first-step estimator based on principal components analysis (henceforth PCA) satisfies these conditions. All the results are developed under an asymptotic regime where the sample size N goes to infinity and T is a function of N going to infinity with N. Moreover, the number of factors is unknown and allowed to grow (possibly to infinity) with the sample size. Factors are not assumed to be strong. The proposed principal components augmented estimator exhibits better finite sample properties than alternatives in Monte-Carlo simulations.

When a strong factor assumption is imposed and the number of factors is assumed to be fixed, the proposed two-step estimator is found to be asymptotically normal under weaker conditions on N and T than for the factor-augmented estimator in ?. This suggests that augmenting the panel regression with estimates of the factor loadings leads to improved estimation properties. The estimator of Section 4.7.1 of ? is a special case of the two-step procedure of this paper. In this other article, a first-step estimator based on hard-thresholding of a nuclear-norm penalized estimator is used. The procedure is pivotal in the sense that it does not require knowledge of the variance of the error terms and that the thresholding level is data-driven. The first-step estimator uses a penalty which level depend on the distribution of the operator norms of the errors while the approach with PCA that we develop here does not. It also relies on the fact that a compatibility constant is bounded away from 0 with probability approaching 1. Such an assumption is absent in the present paper.

This paper is organized as follows. The two-step estimator is introduced in Section 2.

Sufficient conditions for asymptotic normality are derived in Section 3. Section 4 is devoted to the analysis of the two-step procedure when PCA is used in the first step. Section 5 describes our simulations. All the proofs are deferred to the Appendix.

Preliminaries. The transpose of a N×T matrix Ais writtenA^> and its trace is tr(A). Its k^th singular value is σ_k(A) and rank(A) is its rank. A = Prank(A)

k=1 σ_k(A)u_k(A)v_k(A)^> is the singular value decomposition of A, where {u_k(A)}^rank(A)_k=1 is a family of orthonormal vectors ofR^N and{v_k(A)}^rank(A)_k=1 is a family of orthonormal vectors ofR^T. The scalar product in the space of N ×T matrices is hA, Bi = tr(A^>B). The nuclear norm is |A|_∗ =Prank(A)

k=1 σ_k(A), and the operator norm is |A|_op = σ₁(A) = max

h∈R^T s.t.|h|₂=1

|Ah|₂. For two integers, N and T,

(5)

N ∨T is the maximum of N and T , N ∧T is the minimum of N and T and bNc is the integer part of N. For N ∈N, I_N is the identity matrix of size N.

We consider sequences of data generating processes indexed by N. T is a function of N that goes to infinity with N. This paper studies an asymptotic whereN goes to infinity. For a probabilistic event A, its complement is denoted A^c and we write that A happens with probability approaching 1 or w.p.a 1 if P(A)→1.

2 The estimator

The model can be rewritten in matrix form asY = Π₀+E₀,X_k = Π_k+E_kfork ∈ {1, . . . , K}, where Π_kit = Pr_N

j=1λ_ijf_tjδ_kj for i ∈ {1, . . . , N} and t ∈ {1, . . . , T}, Π₀ = PK

k=1β_kΠ_k+ Γ, Γ_it =Pr_N

r=1λ_irf_trδ_r and E₀ =PK

k=1β_kE_k+E. Notice thatE₀ and E are different. E is the remainder term in (1.1), while E₀ is the remainder term in the expression of Y as the sum of a term with a statistical factor structure and a remainder. Remark also that we do not assume that the error terms E, E₀, . . . , E_K have mean zero, hence they can be the sum of an error term with mean zero and a small remainder as in ?.

Let Π_u = (Π₀, . . . ,Π_K), Π_v = ((Π₀)^>, . . . ,(Π_K)^>). For z = u, v, we denote by P_z the projector on the vector space spanned by the columns of Π_z and M_z the projector on the orthogonal of the vector space spanned by the columns of Π_z. Letr_z be the rank of Π_z. Note that r_z ≤r_N, by definition.

The proposed estimator is as follows. In a first step, one estimates M_u and M_v by estimators Mc_u and Mc_v. From there, the estimator of β is

βb∈argmin

b∈R^K

Eb₀−

K

X

k=1

b_kEb_k

2

, (2.1)

where Eb0 =McuYMcv and Ebk =McuXkMcv for k ∈ {1, . . . , K}.

As argued in the introduction, the estimator (2.1) can be seen as the regression of the outcome on the regressors and estimated factor loadings and factors as shown in the following lemma. Let us introduce rbu = rank

IN −Mcu

, brv = rank

IT −Mcv

and Xit = (X1it, . . . , XKit)^>.

Lemma 2.1 Let {λbi}^N_i=1 (resp. {fbt}^T_t=1) be a family of vectors in R^r^b^u (resp. R^b^r^v) such that {(bλ1j. . . ,bλN j)^>}^r_j=1^b^u (resp. {(fb1j. . . ,fbT j)^>}^b^r_j=1^v ) is a generating family of the orthogonal of

(6)

the null space of Mc_u (resp. Mc_v). Then, it holds that

βb∈argmin

b∈R^K

min

φ₁, . . . , φ_T ∈R^b^r^u, l₁, . . . , l_N ∈R^b^r^v

N

X

i=1 T

X

t=1

Y_it−X_it^>b−bλ^>_i φ_t−l^>_i fb_t2

.

3 Sufficient assumptions for asymptotic normality

In this section, we present sufficient conditions for asymptotic normality of βb and consis- tent estimation of its asymptotic variance. The first assumption concerns the asymptotic behaviour of the error matrices. For a N ×T matrix A, defineAe=M_uAM_v.

Assumption 3.1 The following holds:

(i) There exists a K × K positive definite matrix Σ such that, for k, l ∈ {1, . . . , K}, D

Eek,Eel

E

/(N T)−→^P Σkl; (ii) There existsσ >0such that

Ee

2

2/(N T)−→^P σ²andD

Ee_k,EeEK k=1/√

N T −→ N^d (0, σ²Σ). This assumption is similar to Assumption 9 (v) and (vi) in ?. The next lemma provides sufficient conditions for Assumption 3.1.

Lemma 3.1 Assume that (i) E[|P_uE|₂+|EP_v|₂] +PK

k=1E[|P_uE_k|₂+|E_kP_v|₂] =o_P(√ N T);

(ii) There exists a positive definite matrixΣsuch that, fork, l ∈ {1, . . . , K},hE_k, E_li/(N T)−→^P Σ_kl;

(iii) For k ∈ {1, . . . , K}, hE_k, P_uEi/|P_uE|₂ = O_P (1) and hE_k, M_uEP_vi/|M_uEP_v|₂ = O_P(1);

(iv) There exists σ >0 such that |E|²₂/(N T)−→^P σ² and (hE_k, Ei)^K_k=1/√

N T −→ N^d (0, σ²Σ).

Then, conditions (i) and (ii) in Assumption 3.1 hold.

The next corollary gives an example of data generating process under which Assumption3.1 holds.

Corollary 3.1 Let us assume that E, E₁, . . . , E_k are independent, r_u +r_v = o_P √

N ∧T and there exists σ, σ₁, . . . , σ_k >0 such that {E_it}_it are i.i.d. N(0, σ²) and {E_kit}_it are i.i.d.

N(0, σ_k²). If also (E, E₁, . . . , E_k) is independent of (Π_u,Π_v), then Assumption 3.1 holds.

(7)

The last set of conditions concerns the performance of the estimators of the projectors Mc_u and Mc_v. Let {u_N}_N and {v_N}_N be real-valued sequences such that

Mc_u−M_u

2 = O_P(u_N) and

Mc_v−M_v 2

=O_P(v_N). Let also {h_N}_N and {ρ_N}_N be real-valued sequences such that max

k∈{0,...,K}|Π_k+E_k|₂ = O_P(h_N) and max

k∈{0,...,K}|E_k|_op = O_P(ρ_N). The estimators satisfy the following assumption.

Assumption 3.2 The following holds:

(i) Mc_u and Mc_v are symmetric almost surely;

(ii) P(br_u =r_u)→1 and P(rb_v =r_v)→1;

(iii) u_N ∨v_N =o(1) and h²_N =O(N T);

(iv) √

2r_N(u_N ∨v_N)ρ²_N =o(N T);

(v) u_Nv_Nh²_N =o(N T);

(vi) √

2r_N(u_N ∨v_N)ρ_N =o√ N T

; (vii) u_Nv_Nρ_Nh_N =o√

N T

;

This assumption plays a similar role as conditions (i) to (iv) in Assumption 9 in ?. It is difficult to understand the strength of Assumption 3.2 without examples of u_N and v_N for specific first-step estimators. Hence, we discuss it in Section 4, where we derive the properties of Mc_u and Mc_v when they are estimated by a method relying on PCA. The next theorem constitutes the main result of this paper.

Theorem 3.1 (Asymptotic Normality) Under assumptions 3.1 and 3.2, we have

√

N T(βb−β)−→ N^d 0, σ²Σ⁻¹ .

Also, fork, l∈ {1, . . . , K}, Σb_kl =D

Eb_k,Eb_lE

/(N T)−→^P Σ_klandbσ² =

Eb₀−PK

k=1βb_kEb_k

2 2

/(N T)−→^P σ².

(8)

4 Estimation of the projectors using principal compo- nents analysis

4.1 Strength of the factors

In this section, we discuss the estimation of the projectors using a method based on PCA. We make assumptions regarding the asymptotic behaviour ofσ_j(Π_z) forz =u, v. The purpose of this subsection is to show that there exists data generating processes (henceforth DGP) that generate various asymptotic behaviours of the singular values of Π_z. When σ_j(Π_z)/√

N T has a finite deterministic limit in probability, then we say that the j^th factor is strong. The following lemma shows that there exists a wide variety of DGP under which such a strong factor assumption holds. Let Λ = (λ₁, . . . , λ_N), F = (f₁, . . . , f_N), and ∆ = (δ₀, . . . , δ_K), where δ₀ =δ+PK

k=1β_kδ_k.

Lemma 4.1 Assume that r_N is fixed and

(i) There exists a r_N ×r_N positive definite matrix Σ_Λ such that ΛΛ^>/N −→^P Σ_Λ; (ii) There exists a r_N ×r_N positive definite matrix Σ_F such that F F^>/T −→^P Σ_F; (iii) ∆∆^> does not depend onN.

Then, for z =u, v, the ratio of the singular values of Π_z and√

N T has a finite deterministic limit in probability.

If instead σ_j(Π_z)/α_jN has a finite deterministic limit for α_jN = o√ N T

, then the j^th factor is not strong. For a detailed discussion of the concept of non-strong factors, see ?.

The following lemma shows how to generate non-strong factors and a growing number of factors in the case where F, Λ and ∆ are nonrandom.

Lemma 4.2 Let {α_jN}_N for j ∈N be real-valued sequences with positive values. Maintain (i) Λ is nonrandom and (ΛΛ^>)_jj =I_r_N;

(ii) F is nonrandom and F F^> is equal to the rN × rN diagonal matrix with coefficients α²_1r_N, . . . , α²_jN;

(iii) ∆∆^> is a diagonal matrix such that α²_1N ∆∆^>

11≥ · · · ≥α²_r

NN ∆∆^>

rNrN. Then, for z =u, v and r ∈N, σj(Πz) =αjN

q

(∆∆^>)_jj.

Notice that the last two lemmas give sufficient conditions for Assumption 8 in ?.

(9)

4.2 Convergence results

The econometrician can use different methods to estimate the projectors M_u and M_v. The approach in ? relies on a nuclear-norm penalised estimator followed by hard-thresholding of the singular values. It has the advantage of being data-driven, in the sense that it does not use any knowledge of the number of factors or the variance of the errors. Another interesting and computationally advantageous procedure is the double IV estimator of ?. In this paper, we focus the theoretical presentation on yet another method, based on the PCA. r_u and r_v are estimated via the eigenvalue ratio estimator from ?. For z =u, v, let us define

br_z ∈ argmax

j∈{^1,...,b^√^N^∧Tc}

σ_j(Y_z)

σ_j+1(Y_z), (4.1)

where Y_u = (Y, X₁, . . . , X_K) and Y_v = (Y^>, X₁^>, . . . , X_K^>). It may be that there exists r ∈ n

1, . . . ,j√

N ∧Tko

, σ_j+1(Y_z) = 0. To ensure that the estimators are defined, throughout this section, we use the convention that the division of a positive number by 0 is equal to

∞. The estimator in ? is of the form br_z ∈ argmaxj∈{1,...,bd^∗(N∧T)c}σ_j(Y_z)/σ_j+1(Y_z), where d^∗ ∈ (0,1]. Therefore, the estimators in (4.1) correspond to the one in ? for a particular choice of d^∗. Our theoretical analysis is different from the one of ? because it allows for non-strong factors and a growing number of factors. Contrarily to the estimators in ?, the advantage of the eigenvalue ratio estimator is that it does not require to choose a penalty level.

To ensure consistency of the eigenvalue ratio estimator, we make the following assumption.

Let Eu = (E0, . . . , EK) and Ev = (E₀^>, . . . , E_K^>).

Assumption 4.1 (Eigenvalue Ratio) For z = u, v, it holds that r_z ≤ √

N ∧T almost surely, |E_z|_op =o_P (σ_r_z(Π_z)) and there exists C <1 such that

P

max

j∈{1,...,rz−1}

σ_j(Π_z) σ_j+1(Π_z)

∨ max

j∈{^r^z^+1,...,b^√^N^∧Tc}

|E_z|_op σ_r_z_+j(E_z)

!

≤C σ_r_z(Π_z) σ_2r_z₊₁(E_z)

!

→1.

Let us give sufficient conditions fo Assumption 4.1.

Lemma 4.3 For z =u, v, assume that rz ≤√

N ∧T, |Ez|_op =OP

√ N ∨T

, |Ez|²₂/(N T) has a finite deterministic limit in probability and there exists a sequence {zN}N such that σrz(Πz) = OP(zN),√

N ∨T =o(zN)andmaxj∈{1,...,rz−1}σj(Πz)/σj+1(Πz) = oP

zN/√

N ∨T

, then Assumption 4.1 holds.

This Lemma shows that our assumption allows for non-strong factors and a growing number of factors. The condition max_j∈{1,...,r_z_−1}σ_j(Π_z)/σ_j+1(Π_z) =o_P

z_N/√

N ∨T

implies that

(10)

the singular values of Π_z cannot decrease too quickly with j ∈ {1, . . . , r_z}. The assumption that |E_z|_op =O_P √

N ∨T

is standard in the panel data literature and holds under flexible cross-sectional and serial correlations. For a detailed discussion, see Appendix A.1 in ?. Let us now state the main result regarding the eigenvalue ratio estimator.

Lemma 4.4 Under Assumption 4.1, we have P(rb_u =r_u,br_v =r_v)→1.

Given the estimators rb_u and br_v, we set Mc_u = I_N − P_bru

j=1u_j(Y_u)u_j(Y_u)^> and Mc_v = I_T −Pr_bv

j=1u_j(Y_v)u_j(Y_v)^>. Then, we have the following theorem which states the rates of convergence of the estimators of the projectors.

Theorem 4.1 Forz =u, v, ifP(brz =rz)→1, we have

Mcz−Mz

2 =OP

√

rz|Ez|_op/σrz(Πz)

.

4.3 Examples

Let us now show how Assumption 3.2 can hold under different assumptions on the singular values of Πu,Πv, Eu and Ev. In both examples, we assume that, for z = u, v, |Ez|_op = O_P√

N ∨T

and |E_z|²₂/(N T) has a deterministic finite limit. The assumption on the errors implies that we can choose ρ_N =√

N ∨T because |E_k|_op ≤ |E_u|_op.

Example 1. In this first example, we assume that r_N is fixed and that the strong factor assumption holds, that is, for z = u, v and j ∈ {1, . . . , r}, σ_j(Π_z)/√

N T has a finite deterministic limit. This implies that we can choose h_N = √

N T because |Π_u|₂ = O_P √ N T and that Assumption 4.1 holds by Lemma 4.3. Theorem 4.1 yields u_N =v_N = 1/√

N ∧T. All conditions in Assumption 3.2 except (vii) are satisfied whatever the value of N and T. Condition (vii) holds if√

N ∨T /(N ∧T) =o(1). The latter correponds to the condition for asymptotic normality of the debiased estimator in ? and is weaker than the conditions for asymptotic normality in ?.

Example 2. In this case, we assume that r_u = r_v = r_N can grow with the sample size, and that, for z =u, v and j ∈ {1, . . . , r_N}, √

r_Nσ_j(Π_z)/√

N T has a finite deterministic limit.

This is a case with non-strong factors and a growing number of factors. This implies that we can choose h_N =√

N T because |Π_u|₂ =O_P√ N T

and that Assumption 4.1 holds by Lemma 4.3. From Theorem 4.1, we obtain uN =vN =rN/√

N ∧T. Conditions (i)-(iii) hold for any value of N, T and r_N. For (iv)- (vi) to hold, it is enough that r

3 2

N/(N ∧T) = o(1).

Finally, condition (vii) is satisfied if r_N²√

N ∨T /(N ∧T) = o(1).

(11)

5 Simulations

We consider a data generating process with a single regressor and two factors:

Y_it =X_1it+λ_i1f_t1+λ_i2f_t2+E_it, X1it = 1

2λi1ft1+λi2ft2+E1it,

where f_tl, λ_il, E_1it, and E_it for all indices are mutually independent, f_tl ∼ N(1/2,1), λ_il ∼ N(1,1) andE_1it, . . . , E_Kit and E_itare standard normals. The matrixX₁ has a statistical factor structure with a low-rank component of rank 2. Recall thatβb^LS ∈argminb ∈R|Y −bX₁|²₂ is the least-squares estimator of the linear regression of the outcome on the regressors.

βb^{F A} ∈ argminb ∈R

YMcv−bX1Mcv

2

2 is the factor augmented regression estimator where Mcv is computed as in Section4. βe⁽¹⁾ andβe⁽²⁾ are the two-stage estimators of Section 4.7.1 in

?. They are computed as in the simulations of that paper, without using within transforms.

βe⁽²⁾ uses Bai’s estimator as a second stage whileβe2 uses the approach of this paper with two projectors and a first-step based on hard-thresholding of a nuclear-norm penalized estimator.

Finally, βb^{P CA} is the estimator (2.1), using the procedure of Section 4 as the first-stage.

Tables1and2compare the performance of the estimators in terms of mean squared error (henceforth MSE), bias, standard error (henceforth std) and coverage of 95% confidence intervals, for different sample sizes. The coverage is not reported for βbLS because the latter is not asymptotically normal for the DGP that we consider. We use 7300 Monte-Carlo replications which allows for an accuracy of ±0.005 with 95% for the coverage probabilities of 95% confidence intervals. In this simulation exercise, our estimator exhibits better finite samples properties than the studied alternatives.

Table 1: N =T = 50

βb^LS βb^{F A} βe⁽¹⁾ βe⁽²⁾ βb^{P CA} MSE 0.884 0.004 0.14 0.13 0.004 bias 0.939 -0.011 0.321 0.275 0.012 std 0.055 0.191 0.023 0.234 0.063

coverage 0.75 0.22 0.37 0.90

(12)

Table 2: N =T = 150

βb^LS βb^{F A} βe⁽¹⁾ βe⁽²⁾ βb^{P CA} MSE 0.887 10⁻⁴ 4 10⁻⁵ 4 10⁻⁵ 4 10⁻⁵

bias 0.9414 -0.007 -5 10⁻⁵ -8 10⁻⁶ -5 10⁻⁵

std 0.031 0.007 0.007 0.007 0.007

coverage 0.79 0.95 0.95 0.95

Appendix

Proof of Lemma 2.1.

Let b ∈ R^k, Y_i = (Y_i1, . . . , Y_iT)^> and X_i = (X_i1, . . . , X_iT)^> for i ∈ {1, . . . , N}, Y_t = (Y_1t, . . . , Y_{N t})^> and X_t = (X_1t, . . . , X_{N t})^> for t∈ {1, . . . , T} and

ϕ(b) = min

φ₁, . . . , φ_T ∈R^b^r^u, l₁, . . . , l_N ∈R^b^r^v

N

X

i=1 T

X

t=1

Y_it−X_it^>b−bλ^>_i φ_t−l^>_i fb_t2

.

By algebra, we have

ϕ(b) = min

φ₁, . . . , φ_T ∈R^b^r^u, l₁, . . . , l_N ∈R^b^r^v

N

X

i=1

Y_i−X_ib−(φ₁, . . . , φ_T)^>bλ_i−

fb₁, . . . ,fb_T>

l_i

2

.

Then, by definition of Mc_v, it holds

ϕ(b) = min

φ₁, . . . , φ_T ∈R^r^b^u

N

X

i=1

Mc_v

Y_i−X_ib−(φ₁, . . . , φ_T)^>bλ_i

2 2.

Because Mc_v is symmetric, this implies

ϕ(b) = min

φ₁, . . . , φ_T ∈R^b^r^u

Y −

K

X

k=1

bkXk−

bλ1, . . . ,bλN

>

(φ1, . . . , φT)

! Mcv

2

. (5.1)

(13)

Next, by definition of Mc_u, we obtain

ϕ(b)≥

Mc_u Y −

K

X

k=1

b_kX_k

! Mc_v

2

.

Hence, because the value of

min φ₁, . . . , φ_T ∈R^b^r^u

Y −

K

X

k=1

b_kX_k−

bλ₁, . . . ,bλ_N^>

(φ₁, . . . , φ_T)

! Mc_v

2

is Mc_u

Y −PK

k=1b_kX_k Mc_v

2

2 when

λb₁, . . . ,bλ_N^>

φ_t =

I_N −Mc_u

(Y_t−X_tb), we get

ϕ(b) =

Mcu Y −

K

X

k=1

bkXk

! Mcv

2

.

Proof of Lemma 3.1.

Proof that Assumption 3.1 (i) holds. For k ∈ {1, . . . , K}, we have E_k−M_uE_kM_v =P_uE_k+M_uE_kP_v.

By Markov’s inequality and the fact that M_u is a projector, we have

|P_uE_k|₂+|M_uE_kP_v|₂ ≤ |P_uE_k|₂+|E_kP_v|₂ =O_P(E[|P_uE_k|₂+|E_kP_v|₂]). By condition (i) in Lemma 3.1, this yields |P_uE_k|₂+|M_uE_kP_v|₂ =o_P√

N T

. This implies

|M_uE_kM_v−E_k|₂ =o_P √ N T

. By the Cauchy-Schwarz inequality, we obtain hM_uE_kM_v, E_li − hE_k, E_li=hM_uE_kM_v −E_k, E_li=o_P(N T) because |E_k|₂ =O_P √

N T

. We get 1

N T D

Eek,Eel

E

= 1

N T hMuEkMv, Eli= 1

N T hEk, Eli+oP(1)−→^P Σkl, by condition (ii) in Lemma 3.1.

Proof that Assumption 3.1 (ii) holds. The proof of Ee

2

2/(N T) −→^P σ² is similar to the

(14)

proof that Assumption 3.1 (i) holds. By conditions (i) and (iii) in Lemma 3.1, we have hP_uE_k, Ei=O_P(1)|P_uE_k|₂ =O_P(1)o_P√

N T

=o_P √ N T

and similarly hMuEkPv, Ei=oP

√ N T

. Next, this yields

hE_k, Ei − hM_uE_kM_v, Ei =hP_uE_k, Ei+hM_uE_kP_v, Ei=o_P√ N T

. We obtain that

√1 N T

D

Ee_k,EeEK

k=1 = 1

√

N T (hM_uE_kM_v, Ei)^K_k=1 = 1

√

N T (hE_k, Ei)^K_k=1+o_P(1) −→ N^d 0, σ²Σ ,

by condition (iv) in Lemma 3.1.

Proof of Corollary 3.1.

Let us prove that the assumptions of Proposition3.1are satisfied. (ii) and (iv) in Proposition 3.1 are direct consequences of the weak law of large numbers and the central limit theorem.

Concerning (i) in Proposition3.1, fork ∈ {1, . . . , K}, by Lemma A.3 in ? and the fact that E, E₁, . . . , E_K are independent of Π_u and Π_v, we have

E

|P_uE_k|²₂

=

T

X

t=1

E

" _N X

i=1

(P_uE_k)²_it

#

=

T

X

t=1

E

"

E

" _N X

i=1

(P_uE_k)²_it

P_u

##

=T r_uσ_k² =o√ N T

Similarly, one can show that E

|E_kP_v|²₂

=o√ N T

. In the same manner, we obtain that E[|P_uE|₂+|EP_v|₂] = o(√

N T). To prove (iii), just notice that conditionally on P_uE, we have hP_uE, E_k,i/|P_uE|₂ ∼ N(0, σ²_k), hence, for any M ≥0, it holds that

P

|hP_uE_k, Ei|

|P_uE|₂σ_k > M

P_uE

≤2(1−Φ⁻¹(M)),

where Φ is the cumulative distribution function of a N(0,1) distribution. This implies P

|hP_uE_k, Ei|

|P_uE|₂σ_k > M

≤2(1−Φ⁻¹(M)).

Therefore, we obtain thathPuEk, Ei/|PuE|₂ =OP(1). The proof thathEk, MuEPvi/|MuEPv|₂ = OP(1) is the same.

(15)

Proof of Theorem 3.1.

Proof of asymptotic normality.

Because Mc_u and Mc_v are symmetric, a solution to (2.1) satisfies, for l = 1, . . . , K,

*

Mc_uX_lMc_v, Y −

K

X

k=1

βb_kX_k +

= 0, hence

*

M_uX_lM_v,+E+

K

X

k=1

β_k−βb_k X_k

+

=

*

M_u−Mc_u

X_lM_v, E+

K

X

k=1

β_k−βb_k X_k

+

* MuXl

Mv −Mcv

, E+

K

X

k=1

βk−βbk

Xk

+

−

*

M_u−Mc_u X_l

M_v−Mc_v

,Γ +E+

K

X

k=1

β_k−βb_k X_k

+ ,

so

K

X

k=1

β_k−βb_k

hM_uX_lM_v, X_ki −D

M_u−Mc_u

X_lM_v, X_kE

−D MuXl

Mv−Mcv

, Xk

E

+ D

Mu−Mcu

Xl

Mv−Mcv

, Xk

E

! +

D

Mu−Mcu

XlMv,+E E

+ D

MuXl

Mv −Mcv

,+E

E

−D

M_u −Mc_u X_l

M_v −Mc_v

,Γ +EE

. (5.2)

Let us show that hM_uX_lM_v, X_ki, which by Assumption 3.1 (i) diverges like N T, is the high-order term multiplying

β_k−βb_k

in (5.2). This also yields the consistency of the estimator of the covariance matrix. For a matrix M and r ∈ N, let us define |M|²_2,r = Pr

k=1σ_k(M)². By symmetry of the projectors, Theorem C.5 in ?, and Assumption 3.2 (ii)

(16)

which implies rank

M_u−Mc_u

≤2r_N w.p.a. 1

, we have

D

M_u−Mc_u

X_lM_v, X_kE

≤

M_u −Mc_u 2

X_lM_vX_k^>

2,2rN

≤ √

2r_N +o_P(1)

M_u−Mc_u 2

|X_lM_v|_op|X_kM_v|_op

=OP

√2rNuNρ²_N

=oP(N T) (by Assumption 3.2 (iv)).

We bound similarly D

M_uX_l

M_v −Mc_v , X_kE

, and, for the fourth term, use that

D

M_u−Mc_u X_l

M_v −Mc_v , X_kE

≤

Mu−Mcu

Xl

Mv −Mcv

2|Xk|₂

=O_P u_Nv_Nh²_N

=o_P(N T) (by Assumption3.2 (v)). (5.3) Let us consider now the quantities on the right-hand side in (5.2). Notice that because E =E₀−PK

k=1β_kE_k, it holds that |E|_op =O_P(ρ_N). Proceeding like above, we have

D

M_u−Mc_u

X_lM_v, EE

≤

M_u−Mc_u 2

X_lM_vE^>

2,2rN

≤ √

2r_N +o_P(1)

u_N|E_l|_op|E|_op

=O_P √

2r_Nu_Nρ²_N

=o_P(√

N T) (by Assumption3.2 (vi)).

and treat similarly D

MuXl

Mv−Mcv

, E

E

. With the same arguments as in (5.3), the absolute value of the last term of (5.2) is smaller than u_Nv_N|X_l|₂|Γ +E|₂, which is an O_P(u_Nv_Nρ_Nh_N) = o_P √

N T

because Γ +E =Y −PK

k=1β_kX_k.

Let us now look at the first terms on the left-hand side and on the right-hand side of (5.2).

By Assumption 3.2 (vi), for allk, l∈ {1, . . . , K}, we havehM_uX_lM_v, X_ki=hM_uE_lM_v, E_ki+ o_P(N T). Hence because of Assumption 3.1 (i), hM_uX_lM_v, X_ki are the high-order terms on the left-hand side of (5.2). Similarly, by Assumption 3.1 (ii), the high-order terms on the right-hand side of (5.2) are hM_uE_lM_v, Ei. As a result, βbis asymptotically equivalent to the ideal estimator β

β∈argmin

β∈R^K

M_u Y −

K

X

k=1

β_kX_k

! M_v

2

. (5.4)

(17)

Hence, we obtain by usual arguments that √

N T(βb−β)−→ N^d (0, σΣ⁻¹).

Proof of the consistency of σ.b We use N Tbσ² =

* Y −

K

X

k=1

βb_kX_k,Mc_u Y −

K

X

k=1

βb_kX_k

! Mc_v

+

=

* Y −

K

X

k=1

βb_kX_k,

Mc_u−M_u Y −

K

X

k=1

βb_kX_k

!

Mc_v −M_v +

+

* Y −

K

X

k=1

βbkXk,

Mcu−Mu

Y −

K

X

k=1

βbkXk

! Mv

+

* Y −

K

X

k=1

βb_kX_k, M_u Y −

K

X

k=1

βb_kX_k

!

Mc_v −M_v +

+

* Y −

K

X

k=1

βb_kX_k, M_u Y −

K

X

k=1

βb_kX_k

! M_v

+ .

Now, by the Cauchy-Schwarz inequality,

* Y −

K

X

k=1

βb_kX_k,

Mc_u−M_u Y −

K

X

k=1

βb_kX_k

!

Mc_v−M_v +

≤

Y −

K

X

k=1

βb_kX_k 2

Mc_u−M_u Y −

K

X

k=1

βb_kX_k

!

Mc_v−M_v 2

≤

Y −

K

X

k=1

βb_kX_k

2

Mc_u −M_u 2

Mc_v−M_v 2

=O_P(h²_Nu_Nv_N) (by the fact that βb−β =o_P(1))

=o_P(N T) (by Assumption 3.2 (iii))).

Similarly, we can show that

* Y −

K

X

k=1

βb_kX_k,

Mc_u−M_u Y −

K

X

k=1

βb_kX_k

! M_v

+

=o_P(N T)

(18)

and * Y −

K

X

k=1

βbkXk, Mu Y −

K

X

k=1

βbkXk

!

Mcv −Mv

+

=oP(N T).

Hence, we have N Tbσ²

=

* Y −

K

X

k=1

βb_kX_k, M_u Y −

K

X

k=1

βb_kX_k

! M_v

+

+o_P(N T)

=

* Y −

K

X

k=1

βb_k−β_k X_k−

K

X

k=1

β_kX_k, M_u Y −

K

X

k=1

βb_k−β_k X_k−

K+1

X

k=1

β_kX_k

! M_v

+

+o_P(N T)

=

* _K X

k=1

βb_k−β_k

X_k, M_u

K

X

k=1

βb_k−β_k X_k

! M_v

+ +

*

E₀, M_u

K

X

k=1

βb_k−β_k X_k

! M_v

+

* _K X

k=1

βb_k−β_k

X_k, M_uE_KM_v +

+ Ee

2

2+o_P(N T).

Now, by the Cauchy-Schwarz inequality, Assumption 3.2 and the fact that βb−β = o_P(1), one can show that

* _K X

k=1

βb_k−β_k

X_k, M_u

K

X

k=1

βb_k−β_k X_k

! M_v

+

=o_P(N T)

* E, M_u

K

X

k=1

βb_k−β_k X_k

! M_v

+

=o_P(N T);

* _K X

k=1

βb_k−β_k

X_k, M_uEM_v +

=o_P(N T).

We conclude the proof using Assumption 3.1.

Proof of Lemma 4.1.

Let Λ = (λ₁, . . . λ_N). For t ∈ {1, . . . , T} and k ∈ {0, . . . , K}, we use the notation ψ_tk = (f_t1δ_k1, . . . , f_tr_Nδ_kr_N)^>. We also introduce Ψ = (ψ₁₀, . . . , ψ_T₀, . . . , ψ_1K, . . . , ψ_{T K}). It holds that, forj, j⁰ ∈ {1, . . . , r_N}, ΨΨ^>

jj⁰/T = (∆∆^>)_jj⁰ F F^>

jj⁰/T. Therefore, ΛΛ^>ΨΨ^>/(N T) converges in probability to Σ_ΛΣ_∆F, where, forj, j⁰ ∈ {1, . . . , r_N}, (Σ_∆F)_jj0 = (∆∆^>)_jj⁰(Σ_F)_jj0. Next, let U = (u₁(Π_u)), . . . , u_r_N(Π_u)), V = (v₁(Π_u), . . . , v_r_N(Π_u)) and D be the r_N ×r_N diagonal matrix for which D_jj =σ_j(Π_u). We have U DV^>= Λ^>Ψ, which implies U D²U^> =