• Aucun résultat trouvé

Nonparametric estimation in random coefficients binary choice models

N/A
N/A
Protected

Academic year: 2021

Partager "Nonparametric estimation in random coefficients binary choice models"

Copied!
54
0
0

Texte intégral

(1)

HAL Id: hal-00403939

https://hal.archives-ouvertes.fr/hal-00403939v2

Preprint submitted on 1 Sep 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Nonparametric estimation in random coefficients binary choice models

Eric Gautier, Yuichi Kitamura

To cite this version:

Eric Gautier, Yuichi Kitamura. Nonparametric estimation in random coefficients binary choice models.

2011. �hal-00403939v2�

(2)

CHOICE MODELS

ERIC GAUTIER AND YUICHI KITAMURA

Abstract. This paper considers random coefficients binary choice models. The main goal is to estimate the density of the random coefficients nonparametrically. This is an ill-posed inverse prob- lem characterized by an integral transform. A new density estimator for the random coefficients is developed, utilizing Fourier-Laplace series on spheres. This approach offers a clear insight on the identification problem. More importantly, it leads to a closed form estimator formula that yields a simple plug-in procedure requiring no numerical optimization. The new estimator, therefore, is easy to implement in empirical applications, while being flexible about the treatment of unobserved hetero- geneity. Extensions including treatments of non-random coefficients and models with endogeneity are discussed.

1. Introduction Consider a binary choice model

(1.1) Y =I

Xβ ≥0

where I denotes the indicator function and X is a d-vector of covariates. We assume that the first element of X is 1, therefore the vector X is of the formX = (1,X˜). The vector β is random. The random element (Y,X, β) is defined on some probability space (Ω,˜ F,P), and (yi,x˜i, βi), i = 1, ..., N denote its realizations. The econometrician observes (yi,x˜i), i = 1, ..., N, but βi, i = 1, ..., N remain

Date: This Version: August 31, 2011.

Keywords: Inverse problems, Discrete Choice Models.

We thank Whitney Newey and two anonymous referees for comments that greatly improved this paper. We also thank seminar participants at Chicago, CREST, Harvard/MIT, the Henri Poincar´e Institute, Hitotsubashi, LSE, Mannheim, Minnesota, Northwestern, NYU, Paris 6, Princeton, Rochester, Simon Fraser, Tilburg, Toulouse 1 University, UBC, UCL, UCLA, UCSD, the Tinbergen Institute and the University of Tokyo, and participants of the 2008 Cowles summer econometrics conference, EEA/ESEM, FEMES, Journ´ees STAR, and SETA and 2009 CIRM Rencontres de Statistiques Math´ematiques for helpful comments. Yuhan Fang and Xiaoxia Xi provided excellent research assistance. Kitamura acknowledges financial support from the National Science Foundation via grants SES-0241770, SES-0551271 and SES- 0851759. Gautier is grateful for support from the Cowles Foundation as this research was initiated during his visit as a postdoctoral associate.

1

(3)

unobserved. The vectors ˜Xandβ correspond to observed and unobserved heterogeneity across agents, respectively. Note that the first element of β in this formulation absorbs the usual scalar stochastic shock term as well as a constant in a standard binary choice model with non-random coefficients.

This formulation is used in Ichimura and Thompson (1998), and is convenient for the subsequent development in this paper. Our basic model maintains exogeneity of the covariates ˜X:

Assumption 1.1. β is independent of X,˜

Section 5.3 considers ways to relax this assumption. Under (1.1) and Assumption 1.1, the choice probability function is given by

r(x) =P(Y = 1|X =x) (1.2)

=Eβ[I{xβ >0}].

Discrete choice models with random coefficients are useful in applied research since it is often crucial to incorporate unobserved heterogeneity in modeling the choice behavior of individuals. There is a vast and active literature on this topic. Recent contributions include Briesch, Chintagunta and Matzkin (1996), Brownstone and Train (1999), Chesher and Santos Silva (2002), Hess, Bolduc and Polak (2005), Harding and Hausman (2006), Athey and Imbens (2007), Bajari, Fox and Ryan (2007) and Train (2003). A common approach in estimating random coefficient discrete choice models is to impose parametric distributional assumptions. A leading example is the mixed Logit model, which is discussed in details by Train (2003). If one does not impose a parametric distributional assumption, the distribution of β itself is the structural parameter of interest. The goal for the econometrician is then to recover it nonparametrically from the information aboutr(x) obtained from the data.

Nonparametric treatments for unobserved heterogeneity distributions have been considered in the literature for other models. Heckman and Singer (1984) study the issue of unobserved heterogene- ity distributions in duration models and propose a treatment by a nonparametric maximum likelihood estimator (NPMLE). Elbers and Ridder (1982) also develop some identification results in such models.

Beran and Hall (1992) and Hoderlein et al. (2007) discuss nonparametric estimation of random co- efficients linear regression models. Despite the tremendous importance of random coefficient discrete choice models, as exemplified in the above references, nonparametrics in these models is relatively underdeveloped. In their important paper, Ichimura and Thompson (1998) propose an NPMLE for the CDF of β. They present sufficient conditions for identification and prove the consistency of the NPMLE. The NPMLE requires high dimensional numerical maximization and can be computationally

(4)

intensive even for a moderate sample size. Berry and Haile (2008) explore nonparametric identification problems in a random coefficients multinomial choice model that often arises in empirical IO.

This paper considers nonparametric estimation of the random coefficients distribution, using a novel approach that shares some similarities with standard deconvolution techniques. This allows us to reconsider the identifiability of the model and obtain a constructive identification result. Moreover, we develop a simple plug-in estimator for the density of β that requires no numerical optimization or integration. It is easy to implement in empirical applications, while being flexible about the treatment of unobserved heterogeneity.

Since the scale of β is not identified in the binary choice model, we normalize it so that β is a vector of Euclidean norm 1 in Rd. The vector β then belongs to thed−1 dimensional sphere Sd−1. This is not a restriction as long as the probability that β is equal to 0 is 0. Also, since only the angle between X and β matters in the binary decision I{Xβ ≥ 0}, we can replace X by X/kXk without any loss of information. We therefore assume thatX is on the sphereSd−1 as well in the subsequent analysis. Results from the directional data literature are thus relevant to our analysis. We aim to recover the joint probability density functionfβ ofβ with respect to the uniform spherical measure σ over Sd−1 from the random sample (y1, x1), . . . ,(yN, xN) of (Y, X).

The problem considered here is a linear ill-posed inverse problem. We can write

(1.3) r(x) =

Z

b∈Sd−1

I

xb≥0 fβ(b)dσ(b) = Z

H(x)

fβ(b)dσ(b) :=H(fβ) (x)

where the set H(x) is the hemisphere {b:xb≥0}. The mapping H is called the hemispherical transformation. Inversion of this mapping was first studied by Funk (1916) and later by Rubin (1999). Groemer (1996) also discusses some of its properties. H is not injective without further restrictions and conditions need to be imposed to ensure identification of fβ fromr. Even under a set of assumptions that guarantee identification, however, the inverse ofH is not a continuous mapping, making the problem ill-posed. To see this, suppose we restrict fβ to be in L2(Sd−1). Since the kernel of H is square integrable by compactness of the sphere, it is Hilbert-Schmidt and thus compact.

Therefore if the inverse ofHwere continuous, H−1Hwould map the closed unit ball in L2(Sd−1) to a compact set. But the Riesz theorem states that the unit ball is relatively compact if and only if the vector space has finite dimension. The fact that L2(Sd−1) is an infinite dimensional space contradicts this. Therefore the inverse of H cannot be continuous. In order to overcome this problem, we use a one parameter family of regularized inverses that are continuous and converge to the inverse when

(5)

the parameter goes to infinity. This is a common approach to ill-posed inverse problems in statistics (see, e.g. Carrasco et al., 2007).

Due to the particular form of its kernel that involves the scalar productxb, the operatorH is an analogue of convolution inRd, as illustrated in a simple example in Section A.1.1of Supplemental Appendix. This analogy provides a clear insight into the identification issue. In particular, our problem is closely related to the so-called boxcar deconvolution (see, e.g. Groeneboom and Jongbloed (2003) and Johnstone and Raimondo (2004)), where identifiability is often a significant problem. The connection with deconvolution is also useful in deriving an estimator based on a series expansion on the Fourier basis on S1 or its extension to higher dimensional spheres called Fourier-Laplace series.

These bases are defined via the Laplacian on the sphere, and they diagonalize the operator H on L2 Sd−1

. Such techniques are used in Healy and Kim (1996) for nonparametric empirical Bayes estimation in the case of the sphere S2. The kernel of the integral operator H, however, does not satisfy the assumptions made by Healy and Kim. Unlike Healy and Kim (1996), we make use of so- called “condensed” harmonic expansions. The approach replaces a full expansion on a Fourier-Laplace basis by an expansion in terms of the projections on the finite dimensional eigenspaces of the Laplacian on the sphere. This is useful since an explicit expression of the kernel of the projector is available.

It enables us to work in any dimension and does not require a parametrization by hyperspherical coordinates nor the actual knowledge of an orthonormal basis. This approach, to the best of our knowledge, appears to be new in the econometrics literature.

The paper is organized as follows. Section 2 provides a practical guide for our procedure, which is easy to implement. Section 3 deals with identification while introducing basic notions used throughout the paper. We derive the convergence rates of the estimators in all the Lq spaces for q ∈ [1,∞] and also prove a pointwise CLT in Section4. Some extensions, such as estimation of marginals, treatments of models with non-random coefficients, and the case with endogenous regressors are presented in Section5. Simulation results are reported in Section6. Section7concludes. Supplemental Appendix presents analysis of a toy model, technical tools used in the main text, estimators for choice probabilities that are used to construct our density estimators, and the proofs of the main results.

2. A Brief Guide for Practical Implementation

This section presents our basic estimation procedure when a random sample{(yi,x˜i)}generated from the model (1.1) is available. As noted in Section 1, normalize covariates data and define xi = (1,x˜i)/k(1,x˜i)k ∈ Sd−1, i = 1, ..., N. To estimate the joint density of the random vector β, use the

(6)

following formula:

(2.1) fˆβ(b) = max

 2

|Sd−1|

TXN−1 p=0

χ(2p+ 1,2TN)h(2p+ 1, d) λ(2p+ 1, d)C2p+1ν(d)(1)

1 N

XN i=1

(2yi−1)C2p+1ν(d)(xib) max

X(xi), mN

,0

.

The factors |Sd−1|, χ, h and λ are constants that do not depend on data and trivial to compute.

The surface area Sd−1 of Sd−1 is given by Sd−1 = Γ(d/2)d/2 where Γ denotes the Gamma function.

The constants h,ν and λare obtained via the numerical formulas h(n, d) = (2n+d−2)(n+d−2)!

n!(d−2)!(n+d−2) ,ν(d) = (d−2)/2 andλ(2p+ 1, d) = (−1)p|Sd−2|1·3···(2p−1)

(d−1)(d+1)···(d+2p−1), respectively. The functionχ is defined onN×Nand used for smoothing. This is to be chosen by the user: see Proposition A.3 as well as the numerical example reported in Section 6 for examples of χ. The truncation parameter TN needs to be chosen so that it grows with the sample size with a sufficiently slow rate. The trimming factor mN is also user-defined, and it is chosen so that it goes to zero as the sample size increases. The notation Cnν(·) signifies the Gegenbauer polynomial1; They, for example, correspond to the Chebychev polynomials of the first kind in the case of one random slope (i.e. the case withd= 2)2. The only remaining factor which needs to be calculated in the above formula is the nonparametric density estimator ˆfX forfX on Sd−1. For example, the following nonparametric estimator can be used:

(2.2) fˆX(x) = max

 1

|Sd−1|

TN

X

n=0

χ(n, TN )h(n, d) Cnν(d)(1)

1 N

XN i=1

Cnν(d)(xix)

! ,0

where TN is an another truncation parameter, playing a role similar toTN.

Our estimator ˆfβ requires neither numerical integration nor optimization. This is a clear advan- tage over existing estimators for random coefficient binary choice models, including many parametric estimators. This is our main proposal, on which the rest of the paper focuses. In Section 4we explain how the formula (2.1) is derived, and investigate its asymptotic properties.

1The Gegenbauer polynomials are given by

Cnν(t) =

[n/2]

X

l=0

(−1)l(ν)n−l

l!(n2l)! (2t)n−2l, ν >−1/2, nN

where (a)0 = 1 and for n in N\ {0}, (a)n = a(a+ 1)· · ·(a+n1) = Γ(a+n)/Γ(a). See SectionA.1.2 for further properties of the Gegenbauer polynomials.

2Whend= 2, the following relations can be used in (2.1) and (2.2)

∀p0, 1

|Sd−1|

h(2p+ 1,2)C2p+10 (xib)

λ(2p+ 1,2)C2p+10 (1) =(−1)p(2p+ 1)

π cos (2p+ 1) arccos(xib) ,

∀n0, 1

|Sd−1|

h(n,2)Cn0(xib) Cn0(1) = 1

πcos narccos(xib) .

(7)

3. Identification Analysis In this section we address the following two questions:

(Q1) Under what conditions isfβ identified?

(Q2) Does the random coefficients model impose restrictions?

To answer these questions it is useful to introduce the notion of the odd and even part of a function defined on the sphere.

Definition 3.1. We denote the odd part and the even part of a functionf by f(b) = (f(b)−f(−b))/2

and

f+(b) = (f(b) +f(−b))/2, respectively, for every binSd−1.

Let us start with the question (Q1). As noted in Section A.1.4, operating Hreduces the even part of a function to a constant 1 and therefore it is impossible to recover fβ+ from the knowledge of r, which is what observations offer. Our identification strategy is therefore as follows: (Step 1) Assume conditions that guarantee the identification of fβ; then (Step 2) Show that fβ is uniquely determined from fβ under a reasonable assumption. We first consider Step 1. DefineH+ =H(n) = {x ∈ Sd−1 : xn ≥0}, where n = (1, ,0, ...,0), that is, the northern hemisphere of Sd−1. For later use, also define its southern hemisphere H =H(−n). Since the model we consider has a constant as the first element of the covariate vector before normalization, the same vector after normalization is necessarily an element of H+. We make the following assumption, which also appears in Ichimura and Thompson (1998), and show that it achieves Step 1.

Assumption 3.1. The support of X isH+.

This assumption demands that ˜X, the vector of non-constant covariates in the original scale, is supported on the whole space Rd−1. It rules out discrete or bounded covariates; see Section 5 for a potential approach to deal with regressors with limited support. In what follows we assume that the law of X is absolutely continuous with respect to σ and denote its density by fX. Step 1 of our identification argument is to show that the knowledge of r(x) on H+, which is available under Assumption 3.1, identifies fβ. The problem at hand calls for solving r = Hfβ = 12 +Hfβ for fβ, and the inversion formula derived in (4.1) is potentially useful for the purpose. A direct application

(8)

of the formula to r is inappropriate, however, since it requires integration of r on the whole sphere Sd−1, butr is defined only onH+ even when ˜Xhas full support onRd−1. An appropriate extension of r(x), x∈H+to the entireSd−1 is in order. Using the random coefficients model (1.1) and Assumption 1.1, then noting that fβ is a probability density function, conclude

(3.1) H(fβ)(−x) = Z

H(−x)

fβ(b)dσ(b) = 1− H(fβ)(x) = 1−r(x) forx inH+. This suggests an extensionR of r to Sd−1 as follows:

(3.2) ∀x∈H+, R(x) =r(x), and ∀x∈H, R(x) = 1−r(−x) = 1−R(−x).

The function Ris well-defined on the whole sphere under Assumption3.1. Later we derive a formula forfβ in terms ofR(x), x∈Sd−1, which shows the identifiability of fβ under Assumption3.1.

Note that

R(x) =R+(x) +R(x) (3.3)

= 1

2[R(x) +R(−x)] +R(x)

= 1

2[R(x) + (1−R(x))] +R(x) by (3.2)

= 1

2 +R(x)

thusR is completely determined by its odd part and therefore, R(x) = 1

2 +H fβ

(x), or

(3.4) R=Hfβ.

We can invert this equation to obtain fβ.

Now we turn to Step 2 in our identification argument. Obviously fβ does not uniquely de- terminefβ without further assumptions. This is a fundamental identification problem in our model.

We need to identify fβ from the choice probability function r, but we can choose an appropriate even functiong so that fβ+g is a legitimate density function (see the proof of Proposition 3.1for such a construction). Thenr=H(fβ+g), and the knowledge ofridentifies fβ only up to such a functiong.

Ichimura and Thompson (1998, Theorem 1) give a set of conditions that imply the identification of the model (1.1). One of their assumptions postulates that there existsconSd−1such thatP(cβ >0) = 1.

This, in our terminology, means that:

(9)

Assumption 3.2. The support of β is a subset of some hemisphere.

As noted by Ichimura and Thompson (1998), Assumption 3.2 does not seem too stringent in many economic applications. It is often reasonable to assume that an element of the random coefficients vector, such as a price coefficient, has a known sign. If thej-th element of β has a known sign (and positive), then Assumption3.2holds withcbeing a unit vector with itsj-th element being 1.

This is a case in which the location of the hemisphere in Assumption3.2is knowna priori, though the knowledge about its location is not necessary for identification. Assumption3.2 implies the following mapping fromfβ to fβ developed in (A.24):

(3.5) fβ(b) = 2fβ(b)In

fβ(b)>0o .

This is useful because it shows that Assumption 3.2 guarantees identification if fβ is identified.

Moreover, it will be used in the next section to develop a key formula that leads to a simple and practical estimator forfβ that is guaranteed to be non-negative.

Remark 3.1. Assumption3.2is testable since it imposes restrictions onfβ, which is identified under weak conditions. For example, for values of b with fβ(b) >0,fβ(−b)<0 must hold. Or, it implies that fβ integrates to 1/(2|Sd−1|) on a hemisphere H(x) for some x, and −1/(2|Sd−1|) on the other H(−x).

The subsequent result, Proposition 3.1, answers question (Q2), and a proof is given in Supple- mental Appendix.

Notation. We use the notation L2(Sd−1) for the space of square integrable complex valued functions equipped with the hermitian product (f, g)L2(Sd−1) = R

Sd−1f(x)g(x)dσ(x), and more generally use Lp(Sd−1) forp∈[1,∞] the Banach space of p-integrable functions andk · kp the corresponding norm.

We also use the notation Wsp(Sd−1) (and Hs(Sd−1) for p = 2) to signify the corresponding Sobolev spaces with normk · kp,s defined as

kfkp,s=kfkp+ −∆Ss/2

fp

where ∆S denotes the Laplacian on the Sphere Sd−1: See Section A.1.3for further discussions.

Proposition 3.1. A [0,1]-valued function r is compatible with the random coefficients model (1.1) withfβ in L2(Sd−1) and Assumption 1.1 if and only ifr is homogeneous of degree 0 and its extension R according to (3.2) belongs to Hd/2(Sd−1).

(10)

The global smoothness assumption thatRbelongs to Hd/2(Sd−1) imposes substantial restriction on the property of observables, that is, the behavior of the choice probability function r. Note that the smoothness condition in this proposition is stated in terms ofR, and even if the choice probability functionr is sufficiently smooth on the support ofX, which isH+, it is not necessarily consistent with the random coefficients binary choice model (1.1) unless its extension is smooth globally onSd−1. In particular, the Sobolev embedding of Hs(Sd−1) into the space of continuous functions fors >(d−1)/2 implies that if the extensionRis in Hd/2(Sd−1), it has to be continuous onSd−1. This, in turn, means that the correspondingr has to satisfy certain matching conditions at a boundary pointx ofH+(i.e.

xn= 0) and its opposite point−x.

4. Nonparametric Estimation of fβ

4.1. Derivation of the closed form estimation formula. This section discusses how the closed form estimation formula (2.1) is derived. Suppose an odd function f defined on Sd−1 satisfies an integral equation f=Hgwithgsquare integrable with respect to the spherical measure. In Section A.1.4we show that the solution to this equation is given by:

(4.1) H−1(f)(y) =

X p=0

1 λ(2p+ 1, d)

Z

Sd−1

q2p+1,d(x, y)f(x)dσ(x)

where expressions for λand q are provided in PropositionA.4 and Theorem A.1, respectively. If an appropriate estimator ˆR of R is available, an application of the inversion formula (4.1) to (3.4) suggests the following estimator forfβ:

β=H−1 (4.2)

= X p=0

1 λ(2p+ 1, d)

Z

Sd−1

q2p+1,d(·, x) ˆR(x)dσ(x).

Then use the mapping (3.5) to define

(4.3) fˆβ(b) = 2 ˆfβ(b)In

β(b)>0o as an estimator for fβ.

We use the following notation in the rest of the paper:

Notation. For two sequences of positive numbers (an)n∈Nand (bn)n∈N, we writean≍bn when there exists a positive M such thatM−1bn≤an≤M bn for every positiven.

(11)

PropositionA.6implies that if ˆfβ−fβ∈Hs(Sd−1) then ˆR−R∈Hσ(Sd−1),σ =s+d2 and forv∈[0, s],

(4.4) kfˆβ−fβk2,v ≍ kRˆ−Rk2,v+d/2.

As discussed earlier, the estimation of fβ is related to deconvolution in Sd−1, and the degree of ill- posedness in our model isd/2, which is indeed the rate at which the absolute values of the eigenvalues ofH(c.f. PropositionA.4)λ(n, d), n= 2p+ 1, p∈Nconverges to zero aspgrows, as shown in (A.27).

Existing results for deconvolution problems (see, for example, Fan, 1991 and Kim and Koo, 2000) then suggest that we should be able to estimate fβ at the rate N2s+2d−1s in the L2(Sd−1) provided that fβ ∈ Hs(Sd−1). The relationship (4.4), evaluated at v = 0, implies that this can be achieved if we can estimateRat the rateN

σ−d

2σ+d−12 in thek · k2,d/2 norm. The latter is the usual nonparametric rate for estimation of densities on d−1 dimensional smooth submanifolds of Rd (see, for example, Hendriks, 1990).

The estimation formula given in (4.2) is natural and reasonable, though it typically requires numerical evaluation of integrals to implement it. Moreover, in practice one needs to evaluate the infinite sum in (4.2), for example, by truncating the series. This results in a general estimator that can be written in the following two equivalent forms

β=H−1 PT˜

N

(4.5)

=

TN

X

p=0

1 λ(2p+ 1, d)

Z

Sd−1

q2p+1,d(·, x) ˆR(x)dσ(x) for suitably chosen ˜TN that goes to infinity with N and PT˜

N defined in (A.20). The sequence H−1◦ PT˜

N, N = 1,2, ... can be interpreted as regularized inverses of H, with the spectral cut-off method often used in statistical inverse problems.

We now discuss how to obtain ˆRin the calculation of (4.5). The following choice is particularly convenient:

(4.6) Rˆ(x) = 1

N XN i=1

(2yi−1)K2T

N(xi, x) max

X(xi), mN

where mN is a trimming factor going to 0 with the sample size, K(xi,·) denotes the odd part (of the second argument) of the kernel function K(xi,·) defined in (A.23) and ˆfX is a nonparametric density estimator forfX. See SectionA.1.5of Supplemental Appendix for the derivation of the above formula. Various nonparametric estimators for fX can be used in (4.6), since estimation of densities

(12)

on compact manifolds have been studied by several authors, using histogram (Ruymgaart (1989)), projection estimators (see, e.g. Devroye and Gyorfi (1985) for the circle and Hendriks (1990) for general compact Riemannian manifolds) or kernel estimators (see, e.g. Devroye and Gyorfi (1985) for the case of the circle, and Hall et al. (1987) and Klemel¨a (2000) for higher dimensional spheres).

Note also that Baldi et al. (2009) develops an adaptive density estimator on the sphere using needlet thresholding. In the simulation experiment we use

(4.7) fˆX(x) = max 1

N XN

i=1

KT

N(xi, x),0

!

for a suitably chosen TN that depends on the sample size and the smoothness of fX and KT

N is a kernel of the form (A.23) satisfying Assumption A.1. Note that its rate of convergence in sup- norm can be obtained in the same manner as the proof of Theorem 4.1. This estimator is in the spirit of the projection estimators of Hendriks (1990), but here we are able to derive a closed form using the condensed harmonic expansions together with the Addition Formula. Note also that KTN is a smoothed projection kernel (note the factor χ in (A.23)), which is used here in order to have good approximation properties in the Lq(Sd−1) norms with arbitrary q ∈[1,∞], in particular in the L(Sd−1) norm.

Using (4.5) and (4.6) with ˜TN = 2TN, define fˆβ=H−1

=H−1

 1 N

XN i=1

(2yi−1)K2T

N(xi,·) max

X(xi), mN

.

Computing ˆfβ is straightforward. First, note that the estimator (4.6) for R resides in a finite dimensional space LTN

p=0H2p+1,d, thereforeP2TN= ˆR holds. Consequently, unlike in (4.5) where a general estimator for R is considered, we do not need to apply any additional series truncation to Rˆ prior to the inversion ofH. Second, the estimator requires no numerical integration. To see this, note the formula

H−1

K2TN(xi,·) (b) =

TXN−1 p=0

χ(2p+ 1,2TN)

λ(2p+ 1, d) q2p+1,d(xi, b), which follows from

Z

Sd−1

q2p+1,d(x, b)K2T

N(x, xi)dσ(x) = Z

Sd−1

q2p+1(x, b)

TXN−1 p=1

χ(2p+ 1,2TN)q2p+1,d(x, xi)dσ(x)

=χ(2p+ 1,2TN)q2p+1,d(b, xi).

(13)

which, in turn, can be seen by the definition ofKT in (A.23), the fact that the integral operators with q as kernels are projections and (A.16). Thus

β(b) = 1 N

XN i=1

2yi−1 max

X(xi), mN

TXN−1 p=0

χ(2p+ 1,2TN)

λ(2p+ 1, d) q2p+1,d(xi, b).

Using (4.3) and the Addition formula (Theorem A.1), we arrive at an estimator for fβ with the following explicit form:

β(b) = 2 ˆfβ(b)I{fˆβ(b)>0}, (4.8)

where ˆfβ(b) = 1

|Sd−1|

TXN−1 p=0

χ(2p+ 1,2TN)h(2p+ 1, d) λ(2p+ 1, d)C2p+1ν(d) (1)

 1 N

XN i=1

(2yi−1)C2p+1ν(d) (xib) max

X(xi), mN

.

This is equivalent to the formula (2.1) previously presented in Section2. Likewise, using the definition of the smoothing kernel (A.23) and the Addition Theorem in the above definition (4.7) of ˆfX, we obtain the formula (2.2) as well.

4.2. Rates of Convergence inLq(Sd−1)-norms. Now we analyze the rate of our estimator ˆfβ. The following assumption is weak and reasonable.

Assumption 4.1. fX ∈L.

The proofs of the following theorems and corollaries in the rest of this section are given in SectionA.1.6 of Supplemental Appendix.

Theorem 4.1 (Upper bounds in Lq(Sd−1)). Suppose Assumptions A.1, 3.1and 4.1hold, and choose TN that does not grow more than polynomially fast in N. If fβ belongs toWsq(Sd−1) with q in [1,∞] and s >0, and

(4.9) max

i=1,...,N

fX(xi)−fˆX(xi)=Op(mN), then, for any 1≤r≤q,

β −fβ

q =Op

m−1N N−1/2TN(2d−1)/2(logN)(1/2−1/q)I{q≥2}

+TN−s+TNd/2m−2N max

i=1,...,N

fX(xi)−fˆX(xi) Td/2+(d−1)(1−1/r)

n σ(fX < mN)1/q−1/r+1 . (4.10)

(14)

When there existsm >0such thatfX ≥m σ a.e.on H+, the following holds for the estimator without the trimming factor (i.e. mN = 0) when the estimator fˆX which is consistent in sup norm:

β−fβ

q=Op

N−1/2TN(2d−1)/2(logN)(1/2−1/q)I{q≥2}

+TN−s+TNd/2 max

i=1,...,N

fX(xi)−fˆX(xi)

.

The first term in (4.10) is the stochastic error, the second term is the approximation bias, the third the plug-in error and the fourth the trimming bias. Note that Theorem 4.1 imposes the mild assumption (4.9); otherwise, we need to replaceTNd/2m−2N maxi=1,...,NfX(xi)−fˆX(xi) in (4.10) with TNd/2m−2N maxi=1,...,NfX(xi)−fˆX(xi)

1 + (logN)(1/2−1/q)I{q≥2}N−1/2TN(d−1)/2 . Since

i=1,...,Nmax

fX(xi)−fˆX(xi)≤fX −fˆX

, this term can be made of order OP

N logN

−v/(2v+d−1)

when fX ∈ Wv with a suitably chosen parameter TN if we take (4.7) as an estimator. The proof of the latter statement is classical and can be obtained simplifying the proof of Theorem 4.1and Corollary 4.1. Equation (4.10) yields that, for proper choices of mN going to zero and TN to infinity, ˆfβ is consistent given that fX has some smoothness in the Sobolev scales.

Though the additional condition of fX being bounded away from 0 in the last statement of Theorem4.1is convenient, it is restrictive. To see this, consider thed= 2 case. In polar coordinates, fX(cos(θ),sin(θ)) = fX˜(tan(θ))(1 + tan2(θ), thus, assuming fX ≥m on H+, which does not require trimming, yields

∀x∈R, fX˜(x)≥ m 1 +x2.

It implies that ˜X has tails larger than Cauchy tails and all moments are infinite. The introduction of the trimming factor mN allows us to relax the assumption fX ≥m, though it introduces bias. As is clear from (4.10), the condition for the trimming bias to go to zero with N depends both on TN

and mN. The quantity σ(fX < mN) should decay to zero with N sufficiently fast. We can check, for example, that when ˜X is standard Gaussian thenσ(fX < mN) =O (−logmN)−1/2

, when it is Laplace thenσ(fX < mN) =O (−logmN)−1

and whenfX˜ is proportional to (1 +x2)−k withk >1 we obtain that σ(fX < mN) =O

m1/(2(k−1))N

. In all these cases, it is possible to adjust adequately TN and mN and to obtain rates of convergence. The upper bound on the rates become slower as the tail of fX becomes thinner.

(15)

Nonparametric estimation of the regression function with random degenerate design, in the sense that the density of regressors can be low on its support, is a difficult issue. It has been studied for the pointwise risk in Hall et al. 1997, Ga¨ıffas 2005, Ga¨ıffas 2009 and Guerre 1999. Extension to inverse problems setting is a widely open problem. We tackle this problem for our specific inverse problem. Future research includes the study of lower bounds from the minimax point of view that account for the degeneracy of the design.

Let us now return to the general case of d−1 regressors. The assumptions below allow us to obtain rates that differ slightly from the rates that we would obtain in the ideal case where fX ≥m σ a.e.for positive m onH+.

Assumption 4.2. Suppose for q in [1,∞], there exist positive τ and rX such that (i) σ(fX < h) =O(hτ) and fX ∈L,

and either (ii)

i=1,...,Nmax

fX(xi)−fˆX(xi)=Op N

(logN)(1−2/q)I{q≥2}

−rX!

or,

(iii) for some constant C, limN→∞

N

(logN)(1−2/q)I{q≥2}

rX

i=1,...,Nmax

fX(xi)−fˆX(xi)≤C a.s.

holds.

As seen before, Assumption 4.2 (ii) or (iii) are very mild. (i) holds for a reasonable class of distributions for fX. In the above example where fX˜ is proportional to (1 +x2)−k with k > 1, we have the relation τ =ρ/(2(k−1)). This allows for a higher order moment to exist for a largek.

Corollary 4.1. Assume that fβ belongs to Wsq(Sd−1) with q in [1,∞] and s > 0. Let assumptions A.1, 3.1, 4.1 and 4.2(i) and (ii) hold, and take

mN

N

(logN)(1−2/q)I{q≥2}

−ρ

, TN

N

(logN)(1−2/q)I{q≥2}

γ(ρ)

where ρ yields a maximum γ of

γ(ρ) = min

1−2ρ

2s+ 2d−1, 2ρτ

2s+d+ 2(d−1)(1−1/q),2rX −4ρ 2s+d , 1

d−1

.

(16)

We then have

(4.11) fˆβ−fβ

q=Op N

(logN)(1−2/q)I{q≥2}

−γs! .

Moreover, if, instead of Assumption 4.2 (ii), Assumption4.2 (iii) holds with q=∞, then there exists a constant C such that

(4.12) limN→∞

N logN

γsβ −fβ

≤C a.s.

The rateγs in Corollary4.1 accounts for the dimensiond−1, the degree of smoothingd/2 of the operator and features of the density of the covariates (i.e. its smoothness and tail behavior).

We now make stronger assumptions on fX and its estimate that yield, up to a logarithmic term, the convergence rate N2s+2d−1s . We need to be able to trim the estimate of fX with a term which is logarithmic in N: mN = (logN)−ρ for some positive ρ.

Assumption 4.3. Suppose for q in [1,∞], and positive rσ and rX, (i) σ(fX <(logN)−ρ) =O

N

(logN)2ρ+(1−2/q)I{q≥2}

−rσ , and either

(ii)

i=1,...,Nmax

fX(xi)−fˆX(xi)=Op (logN)−2ρ

N

(logN)2ρ+(1−2/q)I{q≥2}

−rX!

or,

(iii) for some constant C, limN→∞(logN)

N

(logN)2ρ+(1−2/q)I{q≥2}

rX

i=1,...,Nmax

fX(xi)−fˆX(xi)≤C a.s.

Corollary 4.2. Assume that fβ belongs to Wqs(Sd−1) with q in [1,∞] and s > 0. Let assumptions A.1, 3.1, 4.1 and 4.3(i)-(ii), hold, and take

TN

N

(logN)2ρ+(1−2/q)I{q≥2}

γ

where

γ = min

1

2s+ 2d−1, 2rσ

2s+d+ 2(d−1)(1−1/q), 2rX 2s+d

then we have

(4.13) fˆβ−fβ

q =Op N

(logN)2ρ+(1−2/q)I{q≥2}

−γs! .

(17)

Moreover, if, instead of Assumption 4.3 (ii), Assumption (iii) holds with q =∞, then there exists a constant C such that

(4.14) limN→∞

N (logN)2ρ+1

γsβ−fβ≤C a.s.

When fX ∈ Ws+d/2+ǫ (Sd−1) for any positive ǫ then 2s+d2rX > 2s+2d−11 and γ in Corollary 4.2 is simply min

1

2s+2d−1,2s+d+2(d−1)(1−1/q)2rσ

. Recall that the smoothness s+d/2 is related to the smoothness of R. Indeed, we have seen in Section 3 that R ∈ Ws+d/22 (Sd−1) if and only if fβ ∈ Ws2(Sd−1).

Consider now the most restrictive case where fX ≥ m σ a.e., then the estimator without the trimming factor (i.e. mN = 0) satisfies the following:

Corollary 4.3. Assume that fβ belongs to Wsq(Sd−1) with q in [1,∞] and s > 0. Let assumptions A.1, 3.1and 4.1 hold, and suppose, for positiverX,

(4.15) max

i=1,...,N

fX(xi)−fˆX(xi)=Op N

(logN)(1−2/q)I{q≥2}

−rX! .

Take

TN

N

(logN)(1−2/q)I{q≥2}

γ

where

γ = min

1

2s+ 2d−1, 2rX 2s+d

then we have

(4.16) fˆβ−fβ

q=Op N

(logN)(1−2/q)I{q≥2}

−γs! .

Moreover, if we replace (4.15) by for some positive C (4.17)

N

(logN)(1−2/q)I{q≥2}

rX

i=1,...,Nmax

fX(xi)−fˆX(xi)≤C a.s.

then

(4.18) limN→∞

N logN

γsβ −fβ≤C a.s.

When fX belongs to Ws−d/2+ǫ, for arbitrary positiveǫ,γ = 2s+2d−11 in Corollary 4.3, and we recover the L2convergence rate ofN2s+2d−1s , the rate mentioned in Section4.1. It is in accordance with the L2 rate in Healy and Kim (1996) who study deconvolution on S2 for non-degenerate kernels. Kim and Koo (2000) prove that the rate in Healy and Kim (1996) is optimal in the minimax sense. Their

(18)

statistical problem, however, involves neither a plug-in method nor trimming. Also, somewhat less importantly, it does not cover the case when the convolution kernel is given by an indicator function, which appears in our operatorH. Hoderlein et al. (2010) study a linear model of the formW =Xβ whereβis ad-vector of random coefficients. They obtain a nonparametric random coefficients density estimator that has the L2-rate N2s+2d−1s when fX ≥ mσ a.e. for positive m3 when fX is assumed to be bounded from below and thus no trimming is required. They also consider trimming but the approach is slightly different and rates of convergence are not given. Unlike the previous results, we cover Lq loss for all q∈[1,∞].

4.3. Pointwise Asymptotic Normality. This section discusses the asymptotic normality property of our estimator.

Theorem 4.2 (Asymptotic normality). Suppose fβ belongs to Ws(Sd−1) with s >0, and Assump- tions A.1, 3.1 and 4.1hold. If fˆX, fX, mN andTN satisfy

N1/2TN−(d−1)/2m−2N max

i=1,...,N

fX(xi)−fˆX(xi)=op(1), (4.19)

N−1/2TN(d−1)/2m−(1+ǫ)N =o(1) for someǫ >0, (4.20)

N1/2T

2s+2d−1 2

N =o(1),

(4.21)

N1/2TN(d−1)/2σ({fX < mN}) =o(1) (4.22)

then

(4.23) N12s−1N (b)

β(b)−fβ(b) d

→N(0,1)

holds for b such that fβ(b)6= 0, where s2N(b) := var(ZN(b)), ZN(b) = 2(2Y−1)H

−1 K2

TN(X,·) (b) max(fX(X),mN) . The standard errorsN(b) is the standard deviation of

(4.24) ZN(b) = 2

|Sd−1|

TXN−1 p=0

χ(2p+ 1,2TN)h(2p+ 1, d) λ(2p+ 1, d)C2p+1ν(d)(1)

(2Y −1)C2p+1ν(d)(Xb) max(fX(X), mN)

!

(see equation (4.8)), which can be estimated using an estimate ˆfX of fX.

The next theorem is concerned with the restrictive case where the density of the covariates is bounded from below and hence the trimming factor mN is set at zero.

3Note that the dimension of their estimator isd, whereas that of ours isd1. On the other hand, in their problem W is observable, and it is obviously more informative than our binary outcome Y, which causes difficulties both in identification and estimation.

(19)

Theorem 4.3 (Asymptotic normality when the density of the covariates is bounded from below).

Suppose fβ belongs toWs(Sd−1) with s >0, and Assumptions A.1, 3.1and 4.1hold. If fˆX, fX and TN satisfy

N1/2TN−(d−1)/2 max

i=1,...,N

fX(xi)−fˆX(xi)=op(1), (4.25)

N−1/2TN(d−1)/2 =o(1) (4.26)

N1/2T

2s+2d−1

N 2 =o(1), (4.27)

then

(4.28) N12s−1N (b)

β(b)−fβ(b) d

→N(0,1)

holds for b such that fβ(b)6= 0, where s2N(b) := var(ZN(b)), ZN(b) = 2(2Y−1)H

−1

K2TN(X,·) (b)

fX(X) .

A formula forZN for this case is obtained by replacing max(fX(X), mN) withfX(X) in (4.24).

5. Discussion

5.1. Estimation of Marginals. In Section3we have provided an expression for the estimator of the full joint density ofβ, from which an estimator for a marginal density can be obtained. Letσkdenote the surface measure andσkk/|Sk|the uniform probability measure onSk. We writeβ =

β, β

and wish to obtain the density of the marginal ofβwhich is a vector of dimensionk. Also defineP and P the projectors such that β =P β and β =P β and denote by Pσd−1 and Pσd−1 the direct image probability measures. One possibility is to define the marginal law of β as the measure PPβ, where dPβ =fβdσ. This may not be convenient, however, since the uniform distribution over Sd−1 would have U-shaped marginals. The U-shape becomes more pronounced as the dimension of β increases.

In order to obtain a flat density for the marginals of the uniform joint distribution on the sphere it is enough to consider densities with respect to the dominating measurePσd−1. Notice that samplingU uniformly onSd−1is equivalent to samplingU according toPσd−1and then givenU formingρ

U V whereV is a draw from the uniform distributionσd−1−k onSd−1−k andρ

U

= r

1−U2. Indeed givenU,U /ρ

U

is uniformly distributed onSd−1−k. Thus, whengis an element of L1(Sd−1) we can write for k in{1, . . . , d−1},

(5.1)

Z

Sd−1

g(b)dσd−1(b) = Z

Bk

Z

Sd−1−k

g ρ

b u, b

d−1−k(u)

dPσd−1 b

(20)

whereBk is thekdimensional ball of radius 1. Settingg=|Sd−1|fβ(b)In b∈Ao

forABorel set ofBk shows that the marginal density of β with respect to the dominating measurePσd−1 is given by

(5.2) f

β

b

=|Sd−1| Z

Sd−1−k

fβ ρ

b u, b

d−1−k(u).

One can use deterministic methods to compute the integral (e.g., Narcowich et al. (2006) for quadra- ture methods on the sphere) or for example one may use a Monte-Carlo method, by forming

(5.3) fˆM

β

b

= |Sd−1| M

XM j=1

β ρ

b uj, b

where uj, j = 1, ..., M are draws from independent uniform random variables onSd−1−k.

5.2. Treatment of Non-Random Coefficients. It may be useful to develop an extension of the method described in the previous sections to models that have non-random coefficients, at least for two reasons.4 First, the convergence rate of our estimator of the joint density of β slows down as the dimension d of β grows, which is a manifestation of the curse of dimensionality. Treating some coefficients as fixed parameters alleviates this problem. Second, our identification assumption in Section3 precludes covariates with discrete or bounded support. This may not be desirable as many random coefficient discrete choice models in economics involve dummy variables as covariates. As we shall see shortly, identification is possible in a model where the coefficients on covariates with limited support are non-random, provided that at least one of the covariates with “large support” has a non-random coefficient as well. More precisely, consider the model:

(5.4) Yi=I{β1i2i X2i1Z1i2Z2i≥0}

whereβ1 ∈Randβ2 ∈RdX−1are random coefficients, whereas the coefficientsα1 ∈Randα2 ∈RdZ−1 are nonrandom. The covariate vector (Z1, Z2) is in RdZ, though the (dZ −1)-subvector Z2 might have limited support: for example, it can be a vector of dummies. The covariate vector (X2, Z1) is assumed to be, among other things, continuously distributed. Normalizing the coefficients vector and the vector of covariates to be elements of the unit sphere works well for the development of our procedure, as we have seen in the previous sections. The model (5.4), however, is presented “in the original scale” to avoid confusion.

4Hoderlein et al. (2010) suggest a method to deal with non-random coefficients in their treatment of random coefficient linear regression models.

Références

Documents relatifs

Abstract: In this note it is shown that the index coefficients and location parameters in the standard triangular binary-choice model are identified under an assumption of symmetry

Adaptive estimation in the linear random coefficients model when regressors have limited variation.. Christophe Gaillac,

The parametric-nonparametric method for uncertainty quantification consists in using simultaneously in a computational model, the parametric method for con- structing stochastic

Clearly, the two nonparametric selection proce- dures, NOVAS and MPDP, have significantly greater predictive performance than the linear procedures, and NOVAS outperforms MPDP six

This paper deals with the consistency and a rate of convergence for a Nadaraya-Watson estimator of the drift function of a stochastic differen- tial equation driven by an

We also propose a Markov Chain Monte Carlo algorithm to approx- imately sample from this posterior distribution inspired from a recent Bayesian nonparametric method for hidden

The paper considers the problem of robust estimating a periodic function in a continuous time regression model with dependent dis- turbances given by a general square

Key words and phrases: nonparametric regression estimation, kernel estimators, strong consistency, fixed-design, exponential inequalities, martingale difference random fields,