Nonparametric estimation in random coefficients binary choice models

(1)

HAL Id: hal-00403939

https://hal.archives-ouvertes.fr/hal-00403939v2

Preprint submitted on 1 Sep 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Nonparametric estimation in random coeﬀicients binary choice models

Eric Gautier, Yuichi Kitamura

To cite this version:

Eric Gautier, Yuichi Kitamura. Nonparametric estimation in random coeﬀicients binary choice models.

2011. �hal-00403939v2�

(2)

CHOICE MODELS

ERIC GAUTIER AND YUICHI KITAMURA

Abstract. This paper considers random coefficients binary choice models. The main goal is to estimate the density of the random coefficients nonparametrically. This is an ill-posed inverse problem characterized by an integral transform. A new density estimator for the random coefficients is developed, utilizing Fourier-Laplace series on spheres. This approach offers a clear insight on the identification problem. More importantly, it leads to a closed form estimator formula that yields a simple plug-in procedure requiring no numerical optimization. The new estimator, therefore, is easy to implement in empirical applications, while being flexible about the treatment of unobserved heterogeneity. Extensions including treatments of non-random coefficients and models with endogeneity are discussed.

1. Introduction Consider a binary choice model

(1.1) Y =I

X^′β ≥0

where I denotes the indicator function and X is a d-vector of covariates. We assume that the first element of X is 1, therefore the vector X is of the formX = (1,X˜^′)^′. The vector β is random. The random element (Y,X, β) is defined on some probability space (Ω,˜ F,P), and (y_i,x˜_i, β_i), i = 1, ..., N denote its realizations. The econometrician observes (y_i,x˜_i), i = 1, ..., N, but β_i, i = 1, ..., N remain

Date: This Version: August 31, 2011.

Keywords: Inverse problems, Discrete Choice Models.

We thank Whitney Newey and two anonymous referees for comments that greatly improved this paper. We also thank seminar participants at Chicago, CREST, Harvard/MIT, the Henri Poincaré Institute, Hitotsubashi, LSE, Mannheim, Minnesota, Northwestern, NYU, Paris 6, Princeton, Rochester, Simon Fraser, Tilburg, Toulouse 1 University, UBC, UCL, UCLA, UCSD, the Tinbergen Institute and the University of Tokyo, and participants of the 2008 Cowles summer econometrics conference, EEA/ESEM, FEMES, Journées STAR, and SETA and 2009 CIRM Rencontres de Statistiques Mathématiques for helpful comments. Yuhan Fang and Xiaoxia Xi provided excellent research assistance. Kitamura acknowledges financial support from the National Science Foundation via grants SES-0241770, SES-0551271 and SES- 0851759. Gautier is grateful for support from the Cowles Foundation as this research was initiated during his visit as a postdoctoral associate.

1

(3)

unobserved. The vectors ˜Xandβ correspond to observed and unobserved heterogeneity across agents, respectively. Note that the first element of β in this formulation absorbs the usual scalar stochastic shock term as well as a constant in a standard binary choice model with non-random coefficients.

This formulation is used in Ichimura and Thompson (1998), and is convenient for the subsequent development in this paper. Our basic model maintains exogeneity of the covariates ˜X:

Assumption 1.1. β is independent of X,˜

Section 5.3 considers ways to relax this assumption. Under (1.1) and Assumption 1.1, the choice probability function is given by

r(x) =P(Y = 1|X =x) (1.2)

=E_β[I{x^′β >0}].

Discrete choice models with random coefficients are useful in applied research since it is often crucial to incorporate unobserved heterogeneity in modeling the choice behavior of individuals. There is a vast and active literature on this topic. Recent contributions include Briesch, Chintagunta and Matzkin (1996), Brownstone and Train (1999), Chesher and Santos Silva (2002), Hess, Bolduc and Polak (2005), Harding and Hausman (2006), Athey and Imbens (2007), Bajari, Fox and Ryan (2007) and Train (2003). A common approach in estimating random coefficient discrete choice models is to impose parametric distributional assumptions. A leading example is the mixed Logit model, which is discussed in details by Train (2003). If one does not impose a parametric distributional assumption, the distribution of β itself is the structural parameter of interest. The goal for the econometrician is then to recover it nonparametrically from the information aboutr(x) obtained from the data.

Nonparametric treatments for unobserved heterogeneity distributions have been considered in the literature for other models. Heckman and Singer (1984) study the issue of unobserved heterogeneity distributions in duration models and propose a treatment by a nonparametric maximum likelihood estimator (NPMLE). Elbers and Ridder (1982) also develop some identification results in such models.

Beran and Hall (1992) and Hoderlein et al. (2007) discuss nonparametric estimation of random coefficients linear regression models. Despite the tremendous importance of random coefficient discrete choice models, as exemplified in the above references, nonparametrics in these models is relatively underdeveloped. In their important paper, Ichimura and Thompson (1998) propose an NPMLE for the CDF of β. They present sufficient conditions for identification and prove the consistency of the NPMLE. The NPMLE requires high dimensional numerical maximization and can be computationally

(4)

intensive even for a moderate sample size. Berry and Haile (2008) explore nonparametric identification problems in a random coefficients multinomial choice model that often arises in empirical IO.

This paper considers nonparametric estimation of the random coefficients distribution, using a novel approach that shares some similarities with standard deconvolution techniques. This allows us to reconsider the identifiability of the model and obtain a constructive identification result. Moreover, we develop a simple plug-in estimator for the density of β that requires no numerical optimization or integration. It is easy to implement in empirical applications, while being flexible about the treatment of unobserved heterogeneity.

Since the scale of β is not identified in the binary choice model, we normalize it so that β is a vector of Euclidean norm 1 in R^d. The vector β then belongs to thed−1 dimensional sphere S^d−1. This is not a restriction as long as the probability that β is equal to 0 is 0. Also, since only the angle between X and β matters in the binary decision I{X^′β ≥ 0}, we can replace X by X/kXk without any loss of information. We therefore assume thatX is on the sphereS^d−1 as well in the subsequent analysis. Results from the directional data literature are thus relevant to our analysis. We aim to recover the joint probability density functionf_β ofβ with respect to the uniform spherical measure σ over S^d−1 from the random sample (y1, x1), . . . ,(yN, xN) of (Y, X).

The problem considered here is a linear ill-posed inverse problem. We can write

(1.3) r(x) =

Z

b∈S^d−1

I

x^′b≥0 f_β(b)dσ(b) = Z

H(x)

f_β(b)dσ(b) :=H(f_β) (x)

where the set H(x) is the hemisphere {b:x^′b≥0}. The mapping H is called the hemispherical transformation. Inversion of this mapping was first studied by Funk (1916) and later by Rubin (1999). Groemer (1996) also discusses some of its properties. H is not injective without further restrictions and conditions need to be imposed to ensure identification of f_β fromr. Even under a set of assumptions that guarantee identification, however, the inverse ofH is not a continuous mapping, making the problem ill-posed. To see this, suppose we restrict f_β to be in L²(S^d−1). Since the kernel of H is square integrable by compactness of the sphere, it is Hilbert-Schmidt and thus compact.

Therefore if the inverse ofHwere continuous, H⁻¹Hwould map the closed unit ball in L²(S^d−1) to a compact set. But the Riesz theorem states that the unit ball is relatively compact if and only if the vector space has finite dimension. The fact that L²(S^d−1) is an infinite dimensional space contradicts this. Therefore the inverse of H cannot be continuous. In order to overcome this problem, we use a one parameter family of regularized inverses that are continuous and converge to the inverse when

(5)

the parameter goes to infinity. This is a common approach to ill-posed inverse problems in statistics (see, e.g. Carrasco et al., 2007).

Due to the particular form of its kernel that involves the scalar productx^′b, the operatorH is an analogue of convolution inR^d, as illustrated in a simple example in Section A.1.1of Supplemental Appendix. This analogy provides a clear insight into the identification issue. In particular, our problem is closely related to the so-called boxcar deconvolution (see, e.g. Groeneboom and Jongbloed (2003) and Johnstone and Raimondo (2004)), where identifiability is often a significant problem. The connection with deconvolution is also useful in deriving an estimator based on a series expansion on the Fourier basis on S¹ or its extension to higher dimensional spheres called Fourier-Laplace series.

These bases are defined via the Laplacian on the sphere, and they diagonalize the operator H on L² S^d−1

. Such techniques are used in Healy and Kim (1996) for nonparametric empirical Bayes estimation in the case of the sphere S². The kernel of the integral operator H, however, does not satisfy the assumptions made by Healy and Kim. Unlike Healy and Kim (1996), we make use of so- called “condensed” harmonic expansions. The approach replaces a full expansion on a Fourier-Laplace basis by an expansion in terms of the projections on the finite dimensional eigenspaces of the Laplacian on the sphere. This is useful since an explicit expression of the kernel of the projector is available.

It enables us to work in any dimension and does not require a parametrization by hyperspherical coordinates nor the actual knowledge of an orthonormal basis. This approach, to the best of our knowledge, appears to be new in the econometrics literature.

The paper is organized as follows. Section 2 provides a practical guide for our procedure, which is easy to implement. Section 3 deals with identification while introducing basic notions used throughout the paper. We derive the convergence rates of the estimators in all the L^q spaces for q ∈ [1,∞] and also prove a pointwise CLT in Section4. Some extensions, such as estimation of marginals, treatments of models with non-random coefficients, and the case with endogenous regressors are presented in Section5. Simulation results are reported in Section6. Section7concludes. Supplemental Appendix presents analysis of a toy model, technical tools used in the main text, estimators for choice probabilities that are used to construct our density estimators, and the proofs of the main results.

2. A Brief Guide for Practical Implementation

This section presents our basic estimation procedure when a random sample{(y_i,x˜_i)}generated from the model (1.1) is available. As noted in Section 1, normalize covariates data and define x_i = (1,x˜^′_i)/k(1,x˜^′_i)k ∈ S^d−1, i = 1, ..., N. To estimate the joint density of the random vector β, use the

(6)

following formula:

(2.1) fˆ_β(b) = max



 2

|S^d−1|

TXN−1 p=0

χ(2p+ 1,2T_N)h(2p+ 1, d) λ(2p+ 1, d)C_2p+1^ν(d)(1)



1 N

XN i=1

(2y_i−1)C_2p+1^ν(d)(x^′_ib) max

fˆ_X(x_i), m_N



,0



.

The factors |S^d−1|, χ, h and λ are constants that do not depend on data and trivial to compute.

The surface area S^d−1 of S^d−1 is given by S^d−1 = _Γ(d/2)^2π^d/2 where Γ denotes the Gamma function.

The constants h,ν and λare obtained via the numerical formulas h(n, d) = (2n+d−2)(n+d−2)!

n!(d−2)!(n+d−2) ,ν(d) = (d−2)/2 andλ(2p+ 1, d) = ⁽⁻¹⁾^p^|^S^d−2|1·3···(2p−1)

(d−1)(d+1)···(d+2p−1), respectively. The functionχ is defined onN×Nand used for smoothing. This is to be chosen by the user: see Proposition A.3 as well as the numerical example reported in Section 6 for examples of χ. The truncation parameter T_N needs to be chosen so that it grows with the sample size with a sufficiently slow rate. The trimming factor mN is also user-defined, and it is chosen so that it goes to zero as the sample size increases. The notation C_n^ν(·) signifies the Gegenbauer polynomial¹; They, for example, correspond to the Chebychev polynomials of the first kind in the case of one random slope (i.e. the case withd= 2)². The only remaining factor which needs to be calculated in the above formula is the nonparametric density estimator ˆf_X forf_X on S^d−1. For example, the following nonparametric estimator can be used:

(2.2) fˆ_X(x) = max



 1

|S^d−1|

T_N^′

X

n=0

χ(n, T_N^′ )h(n, d) C_n^ν(d)(1)

1 N

XN i=1

C_n^ν(d)(x^′_ix)

! ,0





where T_N^′ is an another truncation parameter, playing a role similar toTN.

Our estimator ˆf_β requires neither numerical integration nor optimization. This is a clear advan- tage over existing estimators for random coefficient binary choice models, including many parametric estimators. This is our main proposal, on which the rest of the paper focuses. In Section 4we explain how the formula (2.1) is derived, and investigate its asymptotic properties.

1The Gegenbauer polynomials are given by

C_n^ν(t) =

[n/2]

X

l=0

(−1)^l(ν)n−l

l!(n−2l)! (2t)^n−2l, ν >−1/2, n∈N

where (a)0 = 1 and for n in N\ {0}, (a)n = a(a+ 1)· · ·(a+n−1) = Γ(a+n)/Γ(a). See SectionA.1.2 for further properties of the Gegenbauer polynomials.

2Whend= 2, the following relations can be used in (2.1) and (2.2)

∀p≥0, 1

|S^d−1|

h(2p+ 1,2)C_2p+1⁰ (x^′_ib)

λ(2p+ 1,2)C_2p+1⁰ (1) =(−1)^p(2p+ 1)

π cos (2p+ 1) arccos(x^′ib) ,

∀n≥0, 1

|S^d−1|

h(n,2)C_n⁰(x^′_ib) C_n⁰(1) = 1

πcos narccos(x^′_ib) .

(7)

3. Identification Analysis In this section we address the following two questions:

(Q1) Under what conditions isf_β identified?

(Q2) Does the random coefficients model impose restrictions?

To answer these questions it is useful to introduce the notion of the odd and even part of a function defined on the sphere.

Definition 3.1. We denote the odd part and the even part of a functionf by f⁻(b) = (f(b)−f(−b))/2

and

f⁺(b) = (f(b) +f(−b))/2, respectively, for every binS^d−1.

Let us start with the question (Q1). As noted in Section A.1.4, operating Hreduces the even part of a function to a constant 1 and therefore it is impossible to recover f_β⁺ from the knowledge of r, which is what observations offer. Our identification strategy is therefore as follows: (Step 1) Assume conditions that guarantee the identification of f_β⁻; then (Step 2) Show that f_β is uniquely determined from f_β⁻ under a reasonable assumption. We first consider Step 1. DefineH⁺ =H(n) = {x ∈ S^d−1 : x^′n ≥0}, where n = (1, ,0, ...,0)^′, that is, the northern hemisphere of S^d−1. For later use, also define its southern hemisphere H⁻ =H(−n). Since the model we consider has a constant as the first element of the covariate vector before normalization, the same vector after normalization is necessarily an element of H⁺. We make the following assumption, which also appears in Ichimura and Thompson (1998), and show that it achieves Step 1.

Assumption 3.1. The support of X isH⁺.

This assumption demands that ˜X, the vector of non-constant covariates in the original scale, is supported on the whole space R^d−1. It rules out discrete or bounded covariates; see Section 5 for a potential approach to deal with regressors with limited support. In what follows we assume that the law of X is absolutely continuous with respect to σ and denote its density by f_X. Step 1 of our identification argument is to show that the knowledge of r(x) on H⁺, which is available under Assumption 3.1, identifies f_β⁻. The problem at hand calls for solving r = Hf_β = ¹₂ +Hf_β⁻ for f_β⁻, and the inversion formula derived in (4.1) is potentially useful for the purpose. A direct application

(8)

of the formula to r is inappropriate, however, since it requires integration of r on the whole sphere S^d−1, butr is defined only onH⁺ even when ˜Xhas full support onR^d−1. An appropriate extension of r(x), x∈H⁺to the entireS^d−1 is in order. Using the random coefficients model (1.1) and Assumption 1.1, then noting that f_β is a probability density function, conclude

(3.1) H(f_β)(−x) = Z

H(−x)

f_β(b)dσ(b) = 1− H(f_β)(x) = 1−r(x) forx inH⁺. This suggests an extensionR of r to S^d−1 as follows:

(3.2) ∀x∈H⁺, R(x) =r(x), and ∀x∈H⁻, R(x) = 1−r(−x) = 1−R(−x).

The function Ris well-defined on the whole sphere under Assumption3.1. Later we derive a formula forf_β⁻ in terms ofR(x), x∈S^d−1, which shows the identifiability of f_β⁻ under Assumption3.1.

Note that

R(x) =R⁺(x) +R⁻(x) (3.3)

= 1

2[R(x) +R(−x)] +R⁻(x)

= 1

2[R(x) + (1−R(x))] +R⁻(x) by (3.2)

= 1

2 +R⁻(x)

thusR is completely determined by its odd part and therefore, R(x) = 1

2 +H f_β⁻

(x), or

(3.4) R⁻=Hf_β⁻.

We can invert this equation to obtain f_β⁻.

Now we turn to Step 2 in our identification argument. Obviously f_β⁻ does not uniquely de- terminef_β without further assumptions. This is a fundamental identification problem in our model.

We need to identify f_β from the choice probability function r, but we can choose an appropriate even functiong so that f_β+g is a legitimate density function (see the proof of Proposition 3.1for such a construction). Thenr=H(f_β+g), and the knowledge ofridentifies f_β only up to such a functiong.

Ichimura and Thompson (1998, Theorem 1) give a set of conditions that imply the identification of the model (1.1). One of their assumptions postulates that there existsconS^d−1such thatP(c^′β >0) = 1.

This, in our terminology, means that:

(9)

Assumption 3.2. The support of β is a subset of some hemisphere.

As noted by Ichimura and Thompson (1998), Assumption 3.2 does not seem too stringent in many economic applications. It is often reasonable to assume that an element of the random coefficients vector, such as a price coefficient, has a known sign. If thej-th element of β has a known sign (and positive), then Assumption3.2holds withcbeing a unit vector with itsj-th element being 1.

This is a case in which the location of the hemisphere in Assumption3.2is knowna priori, though the knowledge about its location is not necessary for identification. Assumption3.2 implies the following mapping fromf_β⁻ to f_β developed in (A.24):

(3.5) f_β(b) = 2f_β⁻(b)In

f_β⁻(b)>0o .

This is useful because it shows that Assumption 3.2 guarantees identification if f_β⁻ is identified.

Moreover, it will be used in the next section to develop a key formula that leads to a simple and practical estimator forf_β that is guaranteed to be non-negative.

Remark 3.1. Assumption3.2is testable since it imposes restrictions onf_β⁻, which is identified under weak conditions. For example, for values of b with f_β⁻(b) >0,f_β⁻(−b)<0 must hold. Or, it implies that f_β⁻ integrates to 1/(2|S^d−1|) on a hemisphere H(x) for some x, and −1/(2|S^d−1|) on the other H(−x).

The subsequent result, Proposition 3.1, answers question (Q2), and a proof is given in Supple- mental Appendix.

Notation. We use the notation L²(S^d−1) for the space of square integrable complex valued functions equipped with the hermitian product (f, g)_L2(S^d−1) = R

S^d−1f(x)g(x)dσ(x), and more generally use L^p(S^d−1) forp∈[1,∞] the Banach space of p-integrable functions andk · kp the corresponding norm.

We also use the notation W^s_p(S^d−1) (and H^s(S^d−1) for p = 2) to signify the corresponding Sobolev spaces with normk · k^p,s defined as

kfkp,s=kfkp+ −∆^Ss/2

f_p

where ∆^S denotes the Laplacian on the Sphere S^d−1: See Section A.1.3for further discussions.

Proposition 3.1. A [0,1]-valued function r is compatible with the random coefficients model (1.1) withf_β in L²(S^d−1) and Assumption 1.1 if and only ifr is homogeneous of degree 0 and its extension R according to (3.2) belongs to H^d/2(S^d−1).

(10)

The global smoothness assumption thatRbelongs to H^d/2(S^d−1) imposes substantial restriction on the property of observables, that is, the behavior of the choice probability function r. Note that the smoothness condition in this proposition is stated in terms ofR, and even if the choice probability functionr is sufficiently smooth on the support ofX, which isH⁺, it is not necessarily consistent with the random coefficients binary choice model (1.1) unless its extension is smooth globally onS^d−1. In particular, the Sobolev embedding of H^s(S^d−1) into the space of continuous functions fors >(d−1)/2 implies that if the extensionRis in H^d/2(S^d−1), it has to be continuous onS^d−1. This, in turn, means that the correspondingr has to satisfy certain matching conditions at a boundary pointx ofH⁺(i.e.

x^′n= 0) and its opposite point−x.

4. Nonparametric Estimation of f_β

4.1. Derivation of the closed form estimation formula. This section discusses how the closed form estimation formula (2.1) is derived. Suppose an odd function f⁻ defined on S^d−1 satisfies an integral equation f⁻=Hgwithgsquare integrable with respect to the spherical measure. In Section A.1.4we show that the solution to this equation is given by:

(4.1) H⁻¹(f⁻)(y) =

X∞ p=0

1 λ(2p+ 1, d)

Z

S^d−1

q_2p+1,d(x, y)f⁻(x)dσ(x)

where expressions for λand q are provided in PropositionA.4 and Theorem A.1, respectively. If an appropriate estimator ˆR⁻ of R⁻ is available, an application of the inversion formula (4.1) to (3.4) suggests the following estimator forf_β⁻:

fˆ_β⁻=H⁻¹ Rˆ⁻ (4.2)

= X∞ p=0

1 λ(2p+ 1, d)

Z

S^d−1

q_2p+1,d(·, x) ˆR⁻(x)dσ(x).

Then use the mapping (3.5) to define

(4.3) fˆ_β(b) = 2 ˆf_β⁻(b)In

fˆ_β⁻(b)>0o as an estimator for f_β.

We use the following notation in the rest of the paper:

Notation. For two sequences of positive numbers (a_n)_n∈Nand (b_n)_n∈N, we writea_n≍b_n when there exists a positive M such thatM⁻¹b_n≤a_n≤M b_n for every positiven.

(11)

PropositionA.6implies that if ˆf_β⁻−f_β⁻∈H^s(S^d−1) then ˆR⁻−R⁻∈H^σ(S^d−1),σ =s+^d₂ and forv∈[0, s],

(4.4) kfˆ_β⁻−f_β⁻k2,v ≍ kRˆ⁻−R⁻k2,v+d/2.

As discussed earlier, the estimation of f_β is related to deconvolution in S^d−1, and the degree of ill- posedness in our model isd/2, which is indeed the rate at which the absolute values of the eigenvalues ofH(c.f. PropositionA.4)λ(n, d), n= 2p+ 1, p∈Nconverges to zero aspgrows, as shown in (A.27).

Existing results for deconvolution problems (see, for example, Fan, 1991 and Kim and Koo, 2000) then suggest that we should be able to estimate f_β at the rate N⁻^2s+2d−1^s in the L²(S^d−1) provided that f_β ∈ H^s(S^d−1). The relationship (4.4), evaluated at v = 0, implies that this can be achieved if we can estimateR⁻at the rateN⁻

σ−d

2σ+d−12 in thek · k2,d/2 norm. The latter is the usual nonparametric rate for estimation of densities on d−1 dimensional smooth submanifolds of R^d (see, for example, Hendriks, 1990).

The estimation formula given in (4.2) is natural and reasonable, though it typically requires numerical evaluation of integrals to implement it. Moreover, in practice one needs to evaluate the infinite sum in (4.2), for example, by truncating the series. This results in a general estimator that can be written in the following two equivalent forms

fˆ_β⁻=H⁻¹ P_T_˜

N

Rˆ⁻ (4.5)

=

TN

X

p=0

1 λ(2p+ 1, d)

Z

S^d−1

q_2p+1,d(·, x) ˆR⁻(x)dσ(x) for suitably chosen ˜T_N that goes to infinity with N and P_T_˜

N defined in (A.20). The sequence H⁻¹◦ P_T_˜

N, N = 1,2, ... can be interpreted as regularized inverses of H, with the spectral cut-off method often used in statistical inverse problems.

We now discuss how to obtain ˆR⁻in the calculation of (4.5). The following choice is particularly convenient:

(4.6) Rˆ⁻(x) = 1

N XN i=1

(2y_i−1)K_2T⁻

N(x_i, x) max

fˆ_X(x_i), m_N

where m_N is a trimming factor going to 0 with the sample size, K⁻(x_i,·) denotes the odd part (of the second argument) of the kernel function K(x_i,·) defined in (A.23) and ˆf_X is a nonparametric density estimator forf_X. See SectionA.1.5of Supplemental Appendix for the derivation of the above formula. Various nonparametric estimators for f_X can be used in (4.6), since estimation of densities

(12)

on compact manifolds have been studied by several authors, using histogram (Ruymgaart (1989)), projection estimators (see, e.g. Devroye and Gyorfi (1985) for the circle and Hendriks (1990) for general compact Riemannian manifolds) or kernel estimators (see, e.g. Devroye and Gyorfi (1985) for the case of the circle, and Hall et al. (1987) and Klemel¨a (2000) for higher dimensional spheres).

Note also that Baldi et al. (2009) develops an adaptive density estimator on the sphere using needlet thresholding. In the simulation experiment we use

(4.7) fˆ_X(x) = max 1

N XN

i=1

K_T^′

N(x_i, x),0

!

for a suitably chosen T_N^′ that depends on the sample size and the smoothness of f_X and K_T^′

N is a kernel of the form (A.23) satisfying Assumption A.1. Note that its rate of convergence in sup- norm can be obtained in the same manner as the proof of Theorem 4.1. This estimator is in the spirit of the projection estimators of Hendriks (1990), but here we are able to derive a closed form using the condensed harmonic expansions together with the Addition Formula. Note also that K_T_N is a smoothed projection kernel (note the factor χ in (A.23)), which is used here in order to have good approximation properties in the L^q(S^d−1) norms with arbitrary q ∈[1,∞], in particular in the L^∞(S^d−1) norm.

Using (4.5) and (4.6) with ˜T_N = 2T_N, define fˆ_β⁻=H⁻¹

Rˆ⁻

=H⁻¹



 1 N

XN i=1

(2y_i−1)K_2T⁻

N(x_i,·) max

fˆ_X(x_i), m_N



.

Computing ˆf_β⁻ is straightforward. First, note that the estimator (4.6) for R⁻ resides in a finite dimensional space LTN

p=0H^2p+1,d, thereforeP_2T_NRˆ⁻= ˆR⁻ holds. Consequently, unlike in (4.5) where a general estimator for R⁻ is considered, we do not need to apply any additional series truncation to Rˆ⁻ prior to the inversion ofH. Second, the estimator requires no numerical integration. To see this, note the formula

H⁻¹

K_2T⁻_N(xi,·) (b) =

TXN−1 p=0

χ(2p+ 1,2TN)

λ(2p+ 1, d) q_2p+1,d(xi, b), which follows from

Z

S^d−1

q_2p+1,d(x, b)K_2T⁻

N(x, x_i)dσ(x) = Z

S^d−1

q_2p+1(x, b)

TXN−1 p^′=1

χ(2p^′+ 1,2T_N)q_2p^′_+1,d(x, x_i)dσ(x)

=χ(2p+ 1,2T_N)q_2p+1,d(b, x_i).

(13)

which, in turn, can be seen by the definition ofKT in (A.23), the fact that the integral operators with q as kernels are projections and (A.16). Thus

fˆ_β⁻(b) = 1 N

XN i=1

2y_i−1 max

fˆ_X(x_i), m_N

TXN−1 p=0

χ(2p+ 1,2T_N)

λ(2p+ 1, d) q_2p+1,d(x_i, b).

Using (4.3) and the Addition formula (Theorem A.1), we arrive at an estimator for f_β with the following explicit form:

fˆ_β(b) = 2 ˆf_β⁻(b)I{fˆ_β⁻(b)>0}, (4.8)

where ˆf_β⁻(b) = 1

|S^d−1|

TXN−1 p=0

χ(2p+ 1,2T_N)h(2p+ 1, d) λ(2p+ 1, d)C_2p+1^ν(d) (1)



 1 N

XN i=1

(2y_i−1)C_2p+1^ν(d) (x^′_ib) max

fˆ_X(x_i), m_N



.

This is equivalent to the formula (2.1) previously presented in Section2. Likewise, using the definition of the smoothing kernel (A.23) and the Addition Theorem in the above definition (4.7) of ˆf_X, we obtain the formula (2.2) as well.

4.2. Rates of Convergence inL^q(S^d−1)-norms. Now we analyze the rate of our estimator ˆf_β. The following assumption is weak and reasonable.

Assumption 4.1. fX ∈L^∞.

The proofs of the following theorems and corollaries in the rest of this section are given in SectionA.1.6 of Supplemental Appendix.

Theorem 4.1 (Upper bounds in L^q(S^d−1)). Suppose Assumptions A.1, 3.1and 4.1hold, and choose T_N that does not grow more than polynomially fast in N. If f_β⁻ belongs toW^s_q(S^d−1) with q in [1,∞] and s >0, and

(4.9) max

i=1,...,N

f_X(x_i)−fˆ_X(x_i)=O_p(m_N), then, for any 1≤r≤q,

fˆ_β −f_β

q =O_p

m⁻¹_N N^−1/2T_N^(2d−1)/2(logN)^(1/2−1/q)^I^{q≥2}

+T_N^−s+T_N^d/2m⁻²_N max

i=1,...,N

f_X(x_i)−fˆ_X(x_i) Td/2+(d−1)(1−1/r)

n σ(f_X < m_N)^1/q−1/r+1 . (4.10)

(14)

When there existsm >0such thatfX ≥m σ a.e.on H⁺, the following holds for the estimator without the trimming factor (i.e. m_N = 0) when the estimator fˆ_X which is consistent in sup norm:

fˆ_β−f_β

q=O_p

N^−1/2T_N^(2d−1)/2(logN)^(1/2−1/q)^I^{q≥2}

+T_N^−s+T_N^d/2 max

i=1,...,N

f_X(x_i)−fˆ_X(x_i)

.

The first term in (4.10) is the stochastic error, the second term is the approximation bias, the third the plug-in error and the fourth the trimming bias. Note that Theorem 4.1 imposes the mild assumption (4.9); otherwise, we need to replaceT_N^d/2m⁻²_N max_i=1,...,Nf_X(x_i)−fˆ_X(x_i) in (4.10) with T_N^d/2m⁻²_N max_i=1,...,Nf_X(x_i)−fˆ_X(x_i)

1 + (logN)^(1/2−1/q)^I^{q≥2}N^−1/2T_N^(d−1)/2 . Since

i=1,...,Nmax

f_X(x_i)−fˆ_X(x_i)≤f_X −fˆ_X

∞, this term can be made of order O_P

N logN

−v/(2v+d−1)

when f_X ∈ W^v_∞ with a suitably chosen parameter T_N^′ if we take (4.7) as an estimator. The proof of the latter statement is classical and can be obtained simplifying the proof of Theorem 4.1and Corollary 4.1. Equation (4.10) yields that, for proper choices of m_N going to zero and T_N to infinity, ˆf_β is consistent given that f_X has some smoothness in the Sobolev scales.

Though the additional condition of fX being bounded away from 0 in the last statement of Theorem4.1is convenient, it is restrictive. To see this, consider thed= 2 case. In polar coordinates, f_X(cos(θ),sin(θ)) = fX˜(tan(θ))(1 + tan²(θ), thus, assuming f_X ≥m on H⁺, which does not require trimming, yields

∀x∈R, f_X_˜(x)≥ m 1 +x².

It implies that ˜X has tails larger than Cauchy tails and all moments are infinite. The introduction of the trimming factor m_N allows us to relax the assumption f_X ≥m, though it introduces bias. As is clear from (4.10), the condition for the trimming bias to go to zero with N depends both on TN

and mN. The quantity σ(fX < mN) should decay to zero with N sufficiently fast. We can check, for example, that when ˜X is standard Gaussian thenσ(f_X < m_N) =O (−logm_N)^−1/2

, when it is Laplace thenσ(f_X < m_N) =O (−logm_N)⁻¹

and whenf_X_˜ is proportional to (1 +x²)^−k withk >1 we obtain that σ(f_X < m_N) =O

m^1/(2(k−1))_N

. In all these cases, it is possible to adjust adequately T_N and m_N and to obtain rates of convergence. The upper bound on the rates become slower as the tail of f_X becomes thinner.

(15)

Nonparametric estimation of the regression function with random degenerate design, in the sense that the density of regressors can be low on its support, is a difficult issue. It has been studied for the pointwise risk in Hall et al. 1997, Ga¨ıffas 2005, Ga¨ıffas 2009 and Guerre 1999. Extension to inverse problems setting is a widely open problem. We tackle this problem for our specific inverse problem. Future research includes the study of lower bounds from the minimax point of view that account for the degeneracy of the design.

Let us now return to the general case of d−1 regressors. The assumptions below allow us to obtain rates that differ slightly from the rates that we would obtain in the ideal case where f_X ≥m σ a.e.for positive m onH⁺.

Assumption 4.2. Suppose for q in [1,∞], there exist positive τ and rX such that (i) σ(f_X < h) =O(h^τ) and f_X ∈L^∞,

and either (ii)

i=1,...,Nmax

f_X(x_i)−fˆ_X(x_i)=O_p N

(logN)^(1−2/q)^I^{q≥2}

−rX!

or,

(iii) for some constant C, limN→∞

N

(logN)^(1−2/q)^I^{q≥2}

rX

i=1,...,Nmax

fX(xi)−fˆX(xi)≤C a.s.

holds.

As seen before, Assumption 4.2 (ii) or (iii) are very mild. (i) holds for a reasonable class of distributions for fX. In the above example where fX˜ is proportional to (1 +x²)^−k with k > 1, we have the relation τ =ρ/(2(k−1)). This allows for a higher order moment to exist for a largek.

Corollary 4.1. Assume that f_β⁻ belongs to W^s_q(S^d−1) with q in [1,∞] and s > 0. Let assumptions A.1, 3.1, 4.1 and 4.2(i) and (ii) hold, and take

m_N ≍

N

(logN)^(1−2/q)^I^{q≥2}

−ρ

, T_N ≍

N

(logN)^(1−2/q)^I^{q≥2}

γ(ρ)

where ρ yields a maximum γ of

γ(ρ) = min

1−2ρ

2s+ 2d−1, 2ρτ

2s+d+ 2(d−1)(1−1/q),2r_X −4ρ 2s+d , 1

d−1

.

(16)

We then have

(4.11) fˆ_β−f_β

q=O_p N

(logN)^(1−2/q)^I^{q≥2}

−γs! .

Moreover, if, instead of Assumption 4.2 (ii), Assumption4.2 (iii) holds with q=∞, then there exists a constant C such that

(4.12) lim_N→∞

N logN

γsfˆ_β −f_β

∞≤C a.s.

The rateγs in Corollary4.1 accounts for the dimensiond−1, the degree of smoothingd/2 of the operator and features of the density of the covariates (i.e. its smoothness and tail behavior).

We now make stronger assumptions on f_X and its estimate that yield, up to a logarithmic term, the convergence rate N⁻^2s+2d−1^s . We need to be able to trim the estimate of f_X with a term which is logarithmic in N: m_N = (logN)^−ρ for some positive ρ.

Assumption 4.3. Suppose for q in [1,∞], and positive r_σ and r_X, (i) σ(fX <(logN)^−ρ) =O

N

(logN)^{2ρ+(1−2/q)}^I^{q≥2}

−rσ , and either

(ii)

i=1,...,Nmax

f_X(x_i)−fˆ_X(x_i)=O_p (logN)^−2ρ

N

(logN)^{2ρ+(1−2/q)}^I^{q≥2}

−rX!

or,

(iii) for some constant C, lim_N→∞(logN)^2ρ

N

(logN)^{2ρ+(1−2/q)}^I^{q≥2}

rX

i=1,...,Nmax

f_X(x_i)−fˆ_X(x_i)≤C a.s.

Corollary 4.2. Assume that f_β⁻ belongs to W_q^s(S^d−1) with q in [1,∞] and s > 0. Let assumptions A.1, 3.1, 4.1 and 4.3(i)-(ii), hold, and take

T_N ≍

N

(logN)^{2ρ+(1−2/q)}^I^{q≥2}

γ

where

γ = min

1

2s+ 2d−1, 2r_σ

2s+d+ 2(d−1)(1−1/q), 2r_X 2s+d

then we have

(4.13) fˆ_β−f_β

q =O_p N

(logN)^{2ρ+(1−2/q)}^I^{q≥2}

−γs! .

(17)

Moreover, if, instead of Assumption 4.3 (ii), Assumption (iii) holds with q =∞, then there exists a constant C such that

(4.14) lim_N→∞

N (logN)^2ρ+1

γsfˆ_β−f_β_∞≤C a.s.

When f_X ∈ W^s+d/2+ǫ∞ (S^d−1) for any positive ǫ then _2s+d^2r^X > _2s+2d−1¹ and γ in Corollary 4.2 is simply min

1

2s+2d−1,2s+d+2(d−1)(1−1/q)^2r^σ

. Recall that the smoothness s+d/2 is related to the smoothness of R. Indeed, we have seen in Section 3 that R ∈ W^s+d/2₂ (S^d−1) if and only if f_β ∈ W^s₂(S^d−1).

Consider now the most restrictive case where f_X ≥ m σ a.e., then the estimator without the trimming factor (i.e. m_N = 0) satisfies the following:

Corollary 4.3. Assume that f_β⁻ belongs to W^s_q(S^d−1) with q in [1,∞] and s > 0. Let assumptions A.1, 3.1and 4.1 hold, and suppose, for positiver_X,

(4.15) max

i=1,...,N

f_X(x_i)−fˆ_X(x_i)=O_p N

(logN)^(1−2/q)^I^{q≥2}

−rX! .

Take

T_N ≍

N

(logN)^(1−2/q)^I^{q≥2}

γ

where

γ = min

1

2s+ 2d−1, 2r_X 2s+d

then we have

(4.16) fˆ_β−f_β

q=O_p N

(logN)^(1−2/q)^I^{q≥2}

−γs! .

Moreover, if we replace (4.15) by for some positive C (4.17)

N

(logN)^(1−2/q)^I^{q≥2}

rX

i=1,...,Nmax

f_X(x_i)−fˆ_X(x_i)≤C a.s.

then

(4.18) lim_N→∞

N logN

γsfˆ_β −f_β_∞≤C a.s.

When f_X belongs to W∞^s−d/2+ǫ, for arbitrary positiveǫ,γ = _2s+2d−1¹ in Corollary 4.3, and we recover the L²convergence rate ofN^2s+2d−1^s , the rate mentioned in Section4.1. It is in accordance with the L² rate in Healy and Kim (1996) who study deconvolution on S² for non-degenerate kernels. Kim and Koo (2000) prove that the rate in Healy and Kim (1996) is optimal in the minimax sense. Their

(18)

statistical problem, however, involves neither a plug-in method nor trimming. Also, somewhat less importantly, it does not cover the case when the convolution kernel is given by an indicator function, which appears in our operatorH. Hoderlein et al. (2010) study a linear model of the formW =X^′β whereβis ad-vector of random coefficients. They obtain a nonparametric random coefficients density estimator that has the L²-rate N⁻^2s+2d−1^s when f_X ≥ mσ a.e. for positive m³ when f_X is assumed to be bounded from below and thus no trimming is required. They also consider trimming but the approach is slightly different and rates of convergence are not given. Unlike the previous results, we cover L^q loss for all q∈[1,∞].

4.3. Pointwise Asymptotic Normality. This section discusses the asymptotic normality property of our estimator.

Theorem 4.2 (Asymptotic normality). Suppose f_β⁻ belongs to W^s_∞(S^d−1) with s >0, and Assump- tions A.1, 3.1 and 4.1hold. If fˆ_X, f_X, m_N andT_N satisfy

N^1/2T_N^−(d−1)/2m⁻²_N max

i=1,...,N

f_X(x_i)−fˆ_X(x_i)=o_p(1), (4.19)

N^−1/2T_N^(d−1)/2m^−(1+ǫ)_N =o(1) for someǫ >0, (4.20)

N^1/2T⁻

2s+2d−1 2

N =o(1),

(4.21)

N^1/2T_N^(d−1)/2σ({fX < mN}) =o(1) (4.22)

then

(4.23) N¹²s⁻¹_N (b)

fˆ_β(b)−f_β(b) _d

→N(0,1)

holds for b such that f_β(b)6= 0, where s²_N(b) := var(Z_N(b)), Z_N(b) = 2^(2Y^−1)H

−1 K₂⁻

TN(X,·) (b) max(fX(X),mN) . The standard errorsN(b) is the standard deviation of

(4.24) ZN(b) = 2

|S^d−1|

TXN−1 p=0

χ(2p+ 1,2T_N)h(2p+ 1, d) λ(2p+ 1, d)C_2p+1^ν(d)(1)

(2Y −1)C_2p+1^ν(d)(X^′b) max(f_X(X), m_N)

!

(see equation (4.8)), which can be estimated using an estimate ˆfX of fX.

The next theorem is concerned with the restrictive case where the density of the covariates is bounded from below and hence the trimming factor m_N is set at zero.

3Note that the dimension of their estimator isd, whereas that of ours isd−1. On the other hand, in their problem W is observable, and it is obviously more informative than our binary outcome Y, which causes difficulties both in identification and estimation.

(19)

Theorem 4.3 (Asymptotic normality when the density of the covariates is bounded from below).

Suppose f_β⁻ belongs toW^s_∞(S^d−1) with s >0, and Assumptions A.1, 3.1and 4.1hold. If fˆ_X, f_X and T_N satisfy

N^1/2T_N^−(d−1)/2 max

i=1,...,N

f_X(x_i)−fˆ_X(x_i)=o_p(1), (4.25)

N^−1/2T_N^(d−1)/2 =o(1) (4.26)

N^1/2T⁻

2s+2d−1

N 2 =o(1), (4.27)

then

(4.28) N¹²s⁻¹_N (b)

fˆ_β(b)−f_β(b) _d

→N(0,1)

holds for b such that f_β(b)6= 0, where s²_N(b) := var(Z_N(b)), Z_N(b) = 2^(2Y^−1)H

−1

K₂⁻_TN(X,·) (b)

fX(X) .

A formula forZ_N for this case is obtained by replacing max(f_X(X), m_N) withf_X(X) in (4.24).

5. Discussion

5.1. Estimation of Marginals. In Section3we have provided an expression for the estimator of the full joint density ofβ, from which an estimator for a marginal density can be obtained. Letσ_kdenote the surface measure andσ_k=σ_k/|S^k|the uniform probability measure onS^k. We writeβ =

β^′, β^′′

and wish to obtain the density of the marginal ofβwhich is a vector of dimensionk. Also defineP and P the projectors such that β =P β and β =P β and denote by P_∗σ_d−1 and P_∗σ_d−1 the direct image probability measures. One possibility is to define the marginal law of β as the measure P_∗P_β, where dP_β =f_βdσ. This may not be convenient, however, since the uniform distribution over S^d−1 would have U-shaped marginals. The U-shape becomes more pronounced as the dimension of β increases.

In order to obtain a flat density for the marginals of the uniform joint distribution on the sphere it is enough to consider densities with respect to the dominating measureP_∗σ_d−1. Notice that samplingU uniformly onS^d−1is equivalent to samplingU according toP∗σ_d−1and then givenU formingρ

U V whereV is a draw from the uniform distributionσ_d−1−k onS^d−1−k andρ

U

= r

1−U². Indeed givenU,U /ρ

U

is uniformly distributed onS^d−1−k. Thus, whengis an element of L¹(S^d−1) we can write for k in{1, . . . , d−1},

(5.1)

Z

S^d−1

g(b)dσ_d−1(b) = Z

B^k

Z

S^d−1−k

g ρ

b u, b

dσ_d−1−k(u)

dP_∗σ_d−1 b

(20)

whereB^k is thekdimensional ball of radius 1. Settingg=|S^d−1|f_β(b)In b∈Ao

forABorel set ofB^k shows that the marginal density of β with respect to the dominating measureP_∗σ_d−1 is given by

(5.2) f

β

b

=|S^d−1| Z

S^d−1−k

f_β ρ

b u, b

dσ_d−1−k(u).

One can use deterministic methods to compute the integral (e.g., Narcowich et al. (2006) for quadra- ture methods on the sphere) or for example one may use a Monte-Carlo method, by forming

(5.3) fˆ^M

β

b

= |S^d−1| M

XM j=1

fˆ_β ρ

b u_j, b

where uj, j = 1, ..., M are draws from independent uniform random variables onS^d−1−k.

5.2. Treatment of Non-Random Coefficients. It may be useful to develop an extension of the method described in the previous sections to models that have non-random coefficients, at least for two reasons.⁴ First, the convergence rate of our estimator of the joint density of β slows down as the dimension d of β grows, which is a manifestation of the curse of dimensionality. Treating some coefficients as fixed parameters alleviates this problem. Second, our identification assumption in Section3 precludes covariates with discrete or bounded support. This may not be desirable as many random coefficient discrete choice models in economics involve dummy variables as covariates. As we shall see shortly, identification is possible in a model where the coefficients on covariates with limited support are non-random, provided that at least one of the covariates with “large support” has a non-random coefficient as well. More precisely, consider the model:

(5.4) Y_i=I{β_1i+β_2i^′ X_2i+α₁Z_1i+α^′₂Z_2i≥0}

whereβ₁ ∈Randβ₂ ∈R^d^X⁻¹are random coefficients, whereas the coefficientsα₁ ∈Randα₂ ∈R^d^Z⁻¹ are nonrandom. The covariate vector (Z1, Z₂^′)^′ is in R^d^Z, though the (dZ −1)-subvector Z2 might have limited support: for example, it can be a vector of dummies. The covariate vector (X₂^′, Z₁)^′ is assumed to be, among other things, continuously distributed. Normalizing the coefficients vector and the vector of covariates to be elements of the unit sphere works well for the development of our procedure, as we have seen in the previous sections. The model (5.4), however, is presented “in the original scale” to avoid confusion.

4Hoderlein et al. (2010) suggest a method to deal with non-random coefficients in their treatment of random coefficient linear regression models.