• Aucun résultat trouvé

Multivariate bandwidth selection for local linear regression

N/A
N/A
Protected

Academic year: 2022

Partager "Multivariate bandwidth selection for local linear regression"

Copied!
23
0
0

Texte intégral

(1)

Multivariate bandwidth selection for local linear regression

Lijian Yang

Michigan State University, East Lansing, USA and Rolf Tschernig

Humboldt-Universitat, Berlin, Germany

[Received December 1997. Final revision January 1999]

Summary. The existence and properties of optimal bandwidths for multivariate local linear regression are established, using either a scalar bandwidth for all regressors or a diagonal band- width vector that has a different bandwidth for each regressor. Both involve functionals of the derivatives of the unknown multivariate regression function. Estimating these functionals is dif®cult primarily because they contain multivariate derivatives. In this paper, an estimator of the multivariate second derivative is obtained via local cubic regression with most cross-terms left out. This estimator has the optimal rate of convergence but is simpler and uses much less computing time than the full local estimator. Using this as a pilot estimator, we obtain plug-in formulae for the optimal bandwidth, both scalar and diagonal, for multivariate local linear regression. As a simpler alternative, we also provide rule-of-thumb bandwidth selectors. All these bandwidths have satisfactory performance in our simulation study.

Keywords: Asymptotic optimality; Blocked quartic ®t; Functional estimation; Partial local regression; Plug-in bandwidth; Rule-of-thumb bandwidth; Second derivatives

1. Introduction

Nonparametric estimation in general requires littlea prioriknowledge on the functions to be estimated. The estimation results, however, depend crucially on the choice of bandwidth.

Whereas choosing too large a bandwidth may introduce a large bias, selecting too small a bandwidth may cause large estimation variance. An asymptotically optimal bandwidth usually exists and can be obtained through a bias±variance trade-o€. Such an optimal bandwidth in general involves functionals of the unknown underlying functions, and the selection of the optimal bandwidth has always been a challenge. Bandwidth selection methods with good theoretical properties and practical performance exist for univariate density estimation, e.g.

Joneset al. (1996a, b), Cheng (1997) and Grund and Polzehl (1998), and univariate local least squares regression, e.g. Fan and Gijbels (1995) and Ruppertet al. (1995). Wand and Jones (1993) provided an excellent analysis of bandwidth selection for bivariate density estimation, Sainet al. (1994) looked at multivariate cross-validation for density estimation bandwidth selection and Wand and Jones (1994) developed plug-in (PI) bandwidths for multivariate kernel density estimation. Herrmannet al. (1995) studied bandwidth selection for bivariate regression with ®xed regular design, but it is unclear whether their method works for random

Address for correspondence: Lijian Yang, Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA.

E-mail: [email protected]

&1999 Royal Statistical Society 1369±7412/99/61793

61,Part4,pp.793^815

(2)

designs and higher dimensions. Ruppert (1997) proposed the empirical bias bandwidth selector (EBBS) which could be applied in a multivariate setting. However, the theoretical properties of the EBBS are unknown and its practical performance has only been studied in the univariate setting. As pointed out by Ruppert (1997), the diculty in obtaining a reliable multivariate data-driven bandwidth is essentially due to the complexity of estimating higher order multivariate derivatives. To address this diculty, we estimate multivariate second- order derivatives by a local cubic ®t, where all unnecessary terms are left out. This both solves the theoretical problem and makes a practical implementation feasible. We are not aware of any other work that makes use of the desirable properties of a partial local polynomial ®t.

Another new feature is the use of a di€erent bandwidth for each of the regressors, in contrast with the EBBS method which uses the same bandwidth for all regressors. We have established some general assumptions that ensure the existence of asymptotic optimal band- widths. This is not trivial because, when using a bandwidth vector, there is no simple closed formula for the solution of the bias±variance trade-o€, and the existence of the solution needs to be implicitly proven. This use of a `diagonal bandwidth' is a good compromise between

¯exibility and optimality on the one hand (which need more smoothing parameters, such as a bandwidth matrix) and interpretation and simplicity on the other hand (fewer parameters, such as one bandwidth for all explanatory variables). We have also studied the theoretical properties and practical performances of the latter option, which we call the `scalar optimal bandwidth'.

The computation of our PI bandwidths requires the estimation of functionals that include derivatives up to the fourth order. The idea is to estimate unknown functionals by blocked polynomial ®ts, as in HaÈrdle and Marron (1995) and Ruppert et al. (1995), but we had further to re®ne the idea. To do a quartic ®t of a function of four variables could require up to 70 parameters in each block, which not only makes it impossible to have a sucient number of blocks to adapt to local features of the function but also introduces much noise for the estimation within each block. Since the computation of one functional does not require all the derivatives included in a complete fourth-order Taylor expansion, we propose to use a partial quartic ®t in the blocks, with only 23 parameters for the same function. This advan- tage in number grows drastically if the number of variables becomes even more.

The PI bandwidths are necessarily computation intensive, so we also investigated rule-of- thumb (ROT) bandwidths as simpler alternatives.

The paper is organized as follows. In the next section, we show the existence of an asymp- totically optimal bandwidth vector. Section 3 de®nes an ROT and a PI bandwidth selector for local linear regression, and asymptotic optimality is established for the PI bandwidth. In Section 4, we describe in detail how functionals of second derivatives are estimated via partial local cubic regression, while parallel results using a scalar bandwidth selector are summarized in Appendix A. Implementation of the methods proposed is discussed in Section 5. In Section 6 we describe the results of a rather comprehensive Monte Carlo study. Section 7 concludes, while all assumptions and proofs are contained in Appendix B.

The programs that are used in the analysis can be obtained from http://www.blackwellpublishers.co.uk/rss/

2. Background

To formulate the problem, consider a multivariate regression model Yˆf…X† ‡g…X†

(3)

where Y is a scalar dependent variable, Xˆ …X1, X2, . . .,Xd† is a vector of explanatory variables,is independent ofX,E…† ˆ0 and var…† ˆ1. Let…Xi,Yi†,iˆ1, 2, . . .,n, be an independent and identically distributed sample. Then the local linear estimator of fat a given pointxˆ …x1,x2, . . .,xd†is obtained by doing a ®rst-order Taylor expansion of the function fat pointxfor all the data pointsXiand solving a least squares problem locally weighted by kernels: i.e. the estimator f^…x†is the ®rst elementcof the vector

fc,…c†144dg that minimizes

Pn

iˆ1

Yi c Pd

ˆ1c…Xi x† 2

Kh…Xi

whereKis a symmetric, compactly supported, univariate probability kernel (so thatKis non- negative and„

K…u† ˆ1) and

Kh…Xi x† ˆ Qd

ˆ1

1 h K

Xi x h

:

Denoting by p…x† the density of X, using a weight function w…x†, the weighted mean integrated squared error (MISE)

E …

ff^…x† f…x†g2w…x†p…x†dx

is a function of the bandwidth vectorhˆ …h1, . . ., hd†T, and thehthat minimizes this error is called the optimal bandwidth vector. An asymptotic formula of the MISE in this setting is given by

AMISE…h† ˆ4K 4

Pd

,ˆ1C…f†h2h2‡ kKk2d2 B…g† 1 nh1. . .hd in which2Kˆ„

u2K…u†du,kKk22ˆ„

K2…u†du and B…g† ˆ

…

g2…x†w…x†dx, C…f† ˆ

…

f…x†f…x†p…x†w…x†dx, ,ˆ1, . . .,d,

wheref…x†denotes the second derivative off…x†with respect to theth variablex. For the general formula of AMISE using a bandwidth matrix, see p. 140 of Wand and Jones (1995).

We denote byC…f†the matrix whose …,†th entry isC…f†, which is always non-negative de®nite. Also, we always haveB…g†>0 by assumption 2 in Appendix B.

We introduce a functionQ:Rd‡M‡…d† R‡ !R‡ Q…v,M,a† ˆ1

4 Pd

,ˆ1Mvv‡p a

…v1. . .vd†ˆ1

4vTMv‡p a

…v1. . .vd† …2:1†

where M‡…d† is a set of non-negative de®nite dd matrices whose de®nition is given by expression (B.1) in Appendix B,R‡ˆ …0,1† andM is the…,†th entry of a matrixM.

Withh2ˆ …h21, . . .,h2d†T, we then have

(4)

AMISE…h† ˆQ

h2,4KC…f†, kKk2d2 B…g†

n

: Simple algebra gives

Q…v,M,a† ˆa4=…d‡4†cd=…d‡4†Q…a 2=…d‡4†c2=…d‡4†v,c 1M, 1†

and therefore

v…M,a† ˆa2=…d‡4†c 2=…d‡4†v…c 1M, 1† …2:2†

wherev…M,a†denotes the vector vthat minimizesQ…v,M, a†. By theorem 6, v…M, a† is a well-de®ned and in®nitely smooth function ofManda. Using equation (2.2), we have the following theorem.

Theorem 1.Under assumptions (1)±(4), AMISE…h†is minimized by a unique vector which depends onB…g†andC…f†smoothly via

hoptˆ

kKk2d2 B…g†

n4K

1=…d‡4†

v1=2fC…f†, 1g: …2:3†

The optimal bandwidthhopt given in formula (2.3) contains unknown quantitiesB…g†and C…f†and hence cannot be directly used in the estimation off…x†. In Section 4, appropriate estimators off…x†and C…f† are given based on a partial local cubic ®t with many cross- terms left out. These estimators have very simple bias formulae and their biases and variances are of the same order as keeping the cross-terms in.

3. Bandwidth selection

A quick substitute ofhopt is the simple ROT bandwidth de®ned as hROTˆ

kKk2d2 BROT…g†

n4K

1=…d‡4†

v1=2fCROT…f†, 1g …3:1†

in whichCROT…f†andBROT…g†are based on piecewise quartic estimation off…x†, and whose formulae are given by expressions (5.1) and (5.2).

The bandwidth vectorhROTis of the same order ashoptbut does not estimatehopteciently becausefmay not always be represented by a piecewise fourth-order polynomial. The idea of such an ROT bandwidth exists in the univariate case; see Fan and Gijbels (1995) and Ruppert et al. (1995). The latter additionally used a blocking idea that we have also adopted.

To improve on the ROT idea, we propose a PI bandwidth, so called as it approximateshopt by plugging in consistent estimates of the functionalsB…g†andC…f†. Speci®cally, we de®ne

hPIˆ

kKk2d2 B…g†^ n4K

1=…d‡4†

v1=2fC^ …f†, 1g …3:2†

in whichC…^ f† ˆ fC^…f†g ˆ fC^…f,h†g where for each,ˆ1, . . .,d C^…f† ˆ1

n Pn

iˆ1

f^…Xi† f^…Xi†w…Xi† ˆ

… f^…x† f^…x†w…x†p…x†dx‡Op…n 1=2† …3:3†

and the f^…x†is a partial local cubic estimator off…x†given in equation (4.2). The estimator B…g†^ is de®ned in equation (5.3).

(5)

Using standard results such as contained in Wand and Jones (1995) or Fan and Gijbels (1996), we haveB…g†=B…g† ˆ^ 1‡Op…n 2=…d‡4††asn! 1. On the basis of this, the asymptotics ofC…^ f†as given in corollary 2 and the smooth dependence ofhoptonC…f†as in theorem 1, we have the following theorem.

Theorem 2. Under assumptions (1)±(4), for n! 1, the bandwidth de®ned in equation (3.2) is asymptotically optimal. In particular,

hPIˆhoptfId‡Op…n 2=…d‡6††g:

SincehPIis a consistent substitute for the optimal bandwidthhopt, it can be shown thathPI performs asymptotically similarly tohoptin terms of MISE.hPIwill on average work better thanhROTsince the latter is an inconsistent bandwidth estimator. Simulation results in Section 6 strongly support these expectations.

In Appendix A, we discuss the use of a scalar bandwidthhinstead ofh, i.e.hˆ …h, . . .,h†.

In that case, the corresponding ROT bandwidth hROT and PI bandwidth hPI are given by equations (A.2) and (A.3).

4. Partial local estimation

In this section, we formulate an estimator f^…x†of f…x†. In what follows, denote for any compact supported functionL

r…L† ˆ …1

1 urL…u†du;

hence in particular2…K† ˆ2K. We write4…K† ˆ4K wheredenotes the kurtosis of the kernelK, which is always greater than 1. For the derivative functions, we denote

f…x† ˆ @3

@x@x@x f…x†, etc.Denote

Zˆ ‰1,f…Xi x†2g,f…Xi x†…Xi x†g,…Xi x†,f…Xi x†…Xi x†2g,

f…Xi x†2…Xi x†g,…Xi x†3Šniˆ1, …4:1†

and then de®ne

f^…x† ˆ2eT…ZTWZ† 1ZTWY …4:2†

whereeis a (5d 1)-vector of 0s whose (‡1)-element is 1,Yˆ …Yi†n1and with a scalar bandwidthhˆ …h, . . .,h†

Wˆdiag 1

n Kh…Xi

n

iˆ1

;

see Ruppert and Wand (1994) for a general de®nition of a local polynomial estimator. We then obtain the following theorem.

Theorem 3.Under assumptions (1)±(3), for ˆ1, . . .,d, as h!0 and nhd‡4! 1, we have

(6)

p…nhd‡4†ff^…x† f…x† b…x†h2g !Nf0,2…x†g where

b…x† ˆ6…K† 6K

12… 1†4K f…x† ‡P

2K

2 f…x†, …4:3†

2…x† ˆ4f4…K2† 22…K2†2K‡ kKk224KgkKk2…d2 g2…x†

8K… 1†2p…x† : …4:4†

The formation of the matrixZ is much simpler than the corresponding matrix of a full local cubic expansion. This makes the explicit mathematical derivation of…ZTWZ† 1easier and computing expression (4.2) less costly.

On the basis of equation (4.2), the asymptotic properties ofC^…f,h†are presented in the next theorem. In the following, we denote K*…u† ˆK…u†…u2 2K†, K*K* the convolution betweenKandK*,

…K*K*†…t† ˆ …

K…t u†K*…u†du, which is also„

K…t‡u†K*…u†du by symmetry, andK…2†ˆK*K, the self-convolution ofK.

Also, we denote bythe Kronecker index, which equals 1 or 0 depending on whetherˆ or not.

Theorem 4.Under assumptions (1)±(4), ash!0 andnhd‡4! 1, we have C^…f,h† ˆC…f† ‡ fB…f† ‡B…f†gh2‡B…g†V…K†

nhd‡4 ‡‡Op…h4† ‡op 1

np hd‡8

in which

B…f† ˆ …

f…x†b…x†w…x†p…x†dx,

V…K† ˆ4kKk2…d2 f4…K2†kKk22‡ …1 †22…K2† 22K2…K2†kKk22‡4KkKk42g

8K… 1†2 ,

np

hd‡8!D Nf0,2…g,K†g where

2…g,K† ˆ32 …

g4…x†w2…x†dx 16K… 1†4

…

F…t†F…t†dt and where we de®ne for any,ˆ1, . . .,d

F…t† ˆ …1 †…K**K†…t†…K**K†…t† Q

6ˆ,K…2†…t† ‡K*…2†…t†Q

K…2†…t†:

Note here that„

g4…x†w2…x†dx>0 by assumption (2), and that therefore C^…f,h†has a standard deviation of order…nphd‡8† 1, which is smaller than one of the bias terms…nhd‡4† 1.

(7)

The two bias terms point to a trade-o€ if both are positive or cancellation if the h2-term happens to be negative. Assuming that theh2-term is never 0, we obtain the following results for estimatingC…f†optimally.

Corollary 1. Under the same assumptions as in theorem 4, the optimal bandwidth to estimateC…f†byC^…f,h†is

hC…f†,optˆ

B…g†V…K† fB…f† ‡B…f†gn

1=…d‡6†

ifB…f† ‡B…f†<0, d‡4

2

B…g†V…K† fB…f† ‡B…f†gn

1=…d‡6†

ifB…f† ‡B…f†>0, 8>

>>

><

>>

>>

:

…4:5†

which gives the standard error of ordernp

hd‡8C…f†,optˆO…n …d‡4†=2…d‡6††and the bias of order h4C…f†,optˆO…n 4=…d‡6††whenB…f† ‡B…f†<0 and of orderh2C…f†,optˆO…n 2=…d‡6††when B…f† ‡B…f†>0.

We can now complete the de®nition of C^…f†, and therefore of C…^ f†. Estimate all the derivativesf…x†etc. by blocked quartic ®ts and obtain the estimates ofB…f†as described in Section 5. Then plug them in formula (4.5) together withB…g†^ in place ofB…g†. Use the result- ing bandwidth in the estimator C^…f†. De®nition (3.3) and theorem 4 yield the following corollary.

Corollary 2.Under assumptions (1)±(4), forn! 1, theC…^ f†de®ned by equation (3.3) has the consistency property

C…^ f† ˆC…f†fId‡Op…n 2=…d‡6††g:

5. Implementation

In this section we describe how to estimate C…f† and B…g† to obtain the ROT and PI bandwidths (3.1) and (A.2), and (3.2) and (A.3) respectively.

For the ROT estimation off…Xi†andf…Xi†, Fan and Gijbels (1995) suggested the use of a quartic Taylor expansion. Ruppertet al. (1995) proposed to separate the data into equal- sized blocks and to use a quartic Taylor expansion on each block in case the function is very wiggly. Since the number of parameters per block increases dramatically with the increase in dimension and since the wiggliness of the function depends on each variable di€erently, our

¯exible blocking scheme sorts the data one variable at a time. LetNdenote the number of blocks in theth direction and collect them in Nˆ …N1, . . ., Nd†. ThenNˆdˆ1Nis the total number of blocks given the choices of N, ˆ1, 2, . . .,d. To choose the optimal blocks we follow Ruppertet al. (1995) and use Mallows's Cp,

Cp…N† ˆRSS…N†fn k…d†‰n=4k…d†Šg

minNfRSS…N†g fn 2k…d†Ng

where RSS…N†denotes the residual sum of squares based on the quartic ®t with blockingN, k…d† ˆ1‡P4

iˆ1

d‡i 1 i

is the maximum number of parameters in one block and ‰aŠdenotes the integer part of a.

(8)

Denoting the block vector chosen withCp…N†byN*, and by f^ROT,N*, f^ROT,N*, the corres- ponding function estimators, we then estimate

BROT…g† ˆ 1 n k…d†N*

Pn

iˆ1

fYi f^ROT,N*…Xi†g2w…Xi†

pROT…Xi† , …5:1†

CROT…f† ˆ 1

n Pn

iˆ1

f^ROT,N*,…Xi† f^ROT,N*,…Xi†w…Xi†

…5:2†

wherepROT…x†is the uniform density on the range of the data. These estimates are necessarily inconsistent unless the truefis a piecewise quartic function.

This problem is avoided by PI estimation described in Section 4 and Appendix A. Now we estimateB…g†by

B…g† ˆ^ 1 n

Pn

iˆ1

w…Xi†fYi f^hROT…Xi†g2

~

phS…Xi† …5:3†

where hROT is de®ned by equations (3.1), (5.1) and (5.2), hS is the d-dimensional normal reference bandwidth (Silverman (1986), pages 86±87)) andp~hS…x†is a kernel density estimator.

To estimateC…f†we use partial local cubic regression as described in Section 4. This is better done with a bandwidth that is slightly larger than the optimal pilot bandwidthhC,opt. The reason is that the bandwidthhC,optonly minimizes the absolute value of the bias, whereas it completely ignores the variance in C…f† estimation. Asymptotically this is justi®ed since the standard deviation ofC…^ f,h†is of higher order than the bias. To use a bandwidth that is not equal to hC,opt would produce in theC…f† estimate a larger bias than the minimum bias, but not signi®cantly so if the bandwidth is larger. To see this e€ect we take the bandwidthhC,optand plot in Fig. 1 the ratio of the bias against the minimum bias (achieved atˆ1) as a function r…†of2 ‰0:6, 2Š, fordˆ4. Here, we assume thatB‡B>0, and the ratio then is

r…† ˆf…d‡4†=2†g2=…d‡6†2‡ f…d‡4†=2†g …d‡4†=…d‡6† …d‡4†

f…d‡4†=2†g2=…d‡6†‡ f…d‡4†=2g …d‡4†=…d‡6† :

The broken line has the height of r…2†. We can see that r…† is about 3 or less for >1, whereas there is a dramatic increase for <1. Also it is easy to verify thatr…2†44 for any d. Meanwhile, forˆ2, the ratio of decrease in the standard deviation is …d‡8†=2, which is 1=32 ifdˆ2 and more for largerd:This decrease in standard deviation is drastic whereas the increase in bias is at most 4 times. We therefore use 2hC,opt, which should have very good

®nite sample performance.

Fig. 1. Ratio of the increase in bias when estimatingC(f) withhC(f),opt, fordˆ4

(9)

The estimation of the unknown functionalBis based on blocked quartic ®ts offifd43.

For largerd, the number of parameters can be so large that for medium sample sizes blocking or even global estimation becomes impossible. We therefore use a partial quartic ®t for the estimation of eachB, consisting of only those fourth-order derivativesf,ˆ1, . . .,d, in equation (4.3) plus all lower order derivatives that form a subset of those. As an example, such a partial quartic polynomial centred at zero reads

b0‡Pd

ˆ1bX‡Pd

ˆ1bXX‡ Pd

ˆ1,6ˆbX2‡Pd

ˆ1bXX2‡ Pd

ˆ1,bX2X‡Pd

ˆ1bX2X2: For this partial expansion the number of parametersk…d† ˆ6d 1 grows linearly ind. It also includes all terms to estimatef. Fordˆ4, the number of parameters drops to about a third of the complete expansion. The partial expansion allows not only more blocks but also the blocks to focus more on the variable directions that have more curvature. This is a new feature that is not an issue in the univariate case.

For ecient partial quartic ®ts we propose to use at least 4k…d†observations in each block.

This amounts to requiring n54k…d† ˆ4…6d 1†. For example, if dˆ4, partial blocking needsn592 observations. Ifn<4…6d 1†, we cannot recommend any of our methods with con®dence as there are just not enough data to estimatefaccurately.

To use standard software for the numerical optimization problem in equations (3.1), (3.2) or (2.3), it may be convenient to reparameterizevasvˆexp…u†, i.e.v1ˆexp…u1†, . . ., vdˆ exp…ud†whereuˆ …u1, . . ., ud†T2Rdso there is no constraint onu. The gradient and Hessian matrices are then

@Qfexp…u†,M,ag

@u ˆ1 2

exp…u1†Pd

ˆ1M1 exp…u† ...

exp…ud†Pd

ˆ1Mdexp…u† 0

BB BB BB

@

1 CC CC CC A

a

2pfexp…u1†. . . exp…ud†g 1

...

1 0 B@

1 CA

@2Qfexp…u†,M,ag

@u@uT ˆ 1

2 diagfexp…u1†, . . ., exp…ud†gMdiagfexp…u1†, . . ., exp…ud†g

‡1

2 diagfM11 exp…2u1†, . . ., Mddexp…2ud†g

‡ a

4p

fexp…u1†. . . exp…ud†g1dd,

which make it easy to ®ndu…M,a† ˆlnfv…M,a†g, theu-value that minimizesQfexp…u†,M, ag. We then obtain the following substitute for equation (2.3):

hoptˆ

kKk2d2 B…g†

n4K

1=…d‡4†

exp

ufC…f†, 1g 2

: Similar substitutes are used to compute equations (3.1) and (3.2).

6. Simulation results

In this section we investigate the ®nite sample performance of the ROT bandwidths hROT

(10)

given by equation (3.1) and hROT given by equation (A.2), the PI bandwidths hPI given by equation (3.2) andhPI given by equation (A.3) for the local linear estimation off…x†. These results are compared with those obtained by using the asymptotic optimal bandwidthshoptas in equation (2.3) andhoptas in equation (A.1), which we can compute sincef…x†,g…x†andp…x†

are explicitly given. For each pseudodata set generated, we compute the ISE

Table 1. Model 1: mean of the ISE for various bandwidth selectors

Bandwidth Kind Means for the following sample sizes:

100 250 500

Diagonal Asymptotic optimal 0.0326 0.0165 0.00893

PI 0.0409 0.0183 0.00931

ROT 0.0656 0.0233 0.0112

Scalar Asymptotic optimal 0.0338 0.0174 0.00934

PI 0.0401 0.0180 0.00928

ROT 0.0734 0.0244 0.0113

Table 2. Model 2: mean of the ISE for various bandwidth selectors

Bandwidth Kind Means for the following sample sizes:

100 250 500

Diagonal Asymptotic optimal 0.0505 0.0246 0.0140

PI 0.0523 0.0248 0.0139

ROT 0.0752 0.0301 0.0168

Scalar Asymptotic optimal 0.0505 0.0246 0.0140

PI 0.0503 0.0240 0.0136

ROT 0.0824 0.0314 0.0175

Table 3. Model 3: mean of the ISE for various bandwidth selectors

Bandwidth Kind Means for the following sample sizes:

100 250 500

Diagonal Asymptotic optimal 0.0704 0.0327 0.0186

PI 0.0711 0.0334 0.0185

ROT 0.114 0.0469 0.0242

Scalar Asymptotic optimal 0.0760 0.0361 0.0208

PI 0.0747 0.0354 0.0202

ROT 0.137 0.0560 0.0272

Table 4. Model 4: mean of the ISE for various bandwidth selectors

Bandwidth Kind Means for the following sample sizes:

100 250 500

Diagonal Asymptotic optimal 0.0765 0.0344 0.0198

PI 0.0873 0.0378 0.0219

ROT 0.230 0.0525 0.0268

Scalar Asymptotic optimal 0.147 0.0604 0.0349

PI 0.105 0.0536 0.0337

ROT 0.282 0.0823 0.0431

(11)

1 n

Ps

iˆ1ff^h…Ui† f…Ui†g2

for the diagonal bandwidths hˆhopt, hPI, hROT and scalar bandwidths hˆhopt, hPI, hROT, where theUiare taken from an equally spaced grid of roughly 2500 points. In total we consider six models of varying complexity and in all cases 100 replications are conducted. For all the models the random design matrixXis an independent sample from the uniform distribution on‰0, 1Šdwith sample sizen, and for the estimation ofC…f†we screen o€ observationsXithat are outside‰a, 1 aŠd whereaˆ …1 0:91=d†=2. The sample sizen is 100, 250 or 500 except for model 6 wherenˆ250 andnˆ500. The models are

Yˆf…X† ‡, N…0, 0:25†, where the regression function is

f…x† ˆx21‡x42 for model 1,

f…x† ˆsinf…x1‡x2†g for model 2,

f…x† ˆsin…x1† sin…2x2† for model 3,

f…x† ˆ fsin…x1† ‡sin…4x2†g=2 for model 4,

Table 5. Model 5: mean of the ISE for various bandwidth selectors

Bandwidth Kind Means for the following sample sizes:

100 250 500

Diagonal Asymptotic optimal 0.0662 0.0324 0.0193

PI 0.142 0.0573 0.0297

ROT 0.306 0.100 0.0466

Scalar Asymptotic optimal 0.150 0.0716 0.0425

PI 0.145 0.0692 0.0410

ROT 0.338 0.116 0.0606

Table 6. Model 6: mean of the ISE for various bandwidth selectors

Bandwidth Kind Means for the following sample sizes:

250 500

Diagonal Asymptotic optimal 0.459 0.291

PI 0.363 0.225

ROT 0.641 0.502

Scalar Asymptotic optimal 0.791 0.502

PI 0.447 0.308

ROT 0.793 0.747

(12)

f…x† ˆ f…x1 0:5†2‡x22g sin…2x3† for model 5 and

f…x† ˆsinf…x1‡0:5x2‡2x3‡0:5x4†g for model 6.

Table 1 reports in its ®rst three rows the mean of the ISE based onhopt,hPI,hROTfor each sample size of model 1. The last three rows present the corresponding results for the scalar bandwidthhopt, hPI and hROT. Similarly, Tables 2±6 display the results for models 2±6. In addition, Fig. 2 presents plots of the densities of the ISE based onhopt(full curve),hPI(short broken curve) and hROT (long broken curve) for all models, where the sample size is 500.

Inspecting Tables 1±6 and Fig. 2, we ®nd that the PI bandwidth is always superior to the ROT bandwidth as it delivers function estimates that have smaller ISEs. Furthermore, the mean of the ISE declines with sample size as expected.

If we useC…f†to measure complexity, the two-dimensional models 1±4 are of increasing complexity withC…f† ˆ48:8, 194:8, 608:8, 3129:3 respectively. Since the ROT bandwidth is based on piecewise quartic ®ts, we might expect the ROT function estimates to perform worse

Fig. 2. Density of the ISE for (a) model 1, (b) model 2, (c) model 3, (d) model 4, (e) model 5 and (f) model 6 (Ð,hopt; - - - - -,hPI; ± ± ±,hROT;nˆ500)

(13)

Fig. 3. Functionf(.) of model 3: (a) true function; (b) estimate withhPI,nˆ100; (c) estimate withhROT,nˆ100;

(d) estimate withhopt,nˆ500; (e) estimate withhPI,nˆ500; (f) estimate withhROT,nˆ500

(14)

than the PI estimates for complex models that are far from being piecewise quartic. This is con®rmed by Tables 1±4 and the ISE densities in Figs 2(a)±2(d). Fig. 3 shows the true function of model 3 and its estimates for one replication based onhPIandhROT,nˆ100, or hopt,hPIandhROT,nˆ500.

For higher dimension, the advantage of PI over ROT estimates becomes more signi®cant.

For the three-dimensional model 5 and four-dimensional model 6, Tables 5 and 6 show a reduction in MISE up to 60%. The ISE densities are plotted in Figs 2(e) and 2(f).

For models 1 and 2, there is practically no di€erence between the diagonal and scalar bandwidth, whereas for model 3 the diagonal has only a slight advantage. This is due to the form of f, which does not require di€erent amounts of smoothing in each direction. For models 4±6, the improvement in diagonal bandwidth becomes signi®cant as the f-function requires very di€erent amounts of smoothing for various directions. Overall, we clearly prefer the diagonal to the scalar bandwidth.

Although the function estimate is the ultimate goal, it is also informative to analyse how well the PI and ROT bandwidths approximate the asymptotically optimal bandwidthhopt. When comparing the densities of ratios, we ®nd thathPI=hoptis always to the right ofhROT=hopt. The

Fig. 4. Densities of bandwidth ratios for model 3 (- - - , PI; ± ± ±, ROT): (a) ®rst elements ofhPI=hoptand hROT=hopt,nˆ100; (b) second elements ofhPI=hoptandhROT=hopt,nˆ100; (c)hPI=hoptandhROT=hopt,nˆ100;

(d) ®rst elements ofhPI=hoptandhROT=hopt,nˆ500; (e) second elements ofhPI=hoptandhROT=hopt,nˆ500; (f) hPI=hoptandhROT=hopt,nˆ500

(15)

ratio for the PI bandwidth is always closer to 1 except for a few cases where the ROT ratio is also close to 1. The PI ratiohPI=hoptalso approaches 1 faster than the ROT ratiohROT=hoptas the sample sizen increases. All these conclusions can be drawn from inspecting the density plots. Fig. 4 presents the density plots for the complicated model 3 based on 100 (Figs 4(a)±

4(c)) and 500 (Figs 4(d)±4(f)) observations. The densities of the ®rst and second elements of hPI=hoptandhROT=hopt are depicted in Figs 4(a), 4(b), 4(d) and 4(e). Figs 4(c) and 4(f) show the densities ofhPI=hoptandhROT=hopt. Increasing complexity or dimension makes the band- width estimation more dicult.

7. Conclusion

We have shown the existence of both a diagonal and a scalar optimal bandwidth for multivariate regression and have proposed both an ROT and a PI bandwidth selector. The ROT method is simple to use but may perform poorly because of its inconsistency. The PI method is achieved by using partial local cubic regression to estimate the unknown func- tionals of second derivatives. The partial local cubic expansion does not need many terms and has simple bias and variance formulae. The pilot bandwidths for the functional estimation are determined by blocked partial quartic ®ts. In Monte Carlo simulations we investigated the performance of these automatic bandwidth selectors for models up to four dimensions and various sample sizes. The diagonal PI method is found to be the best in terms of the ISE, although the ROT method can be useful if the model is not too complicated. Furthermore, a bandwidth vector is always preferable to a scalar bandwidth. These conclusions are observed with a sample size as small as 100.

The results of this paper can be generalized to autoregressive time series under conditions that lead to geometric mixing; see, for instance, HaÈrdleet al. (1998) for such conditions. The scalar plug-in bandwidth was used to obtain a data-driven asymptotic ®nal prediction error for the non-linear time series lag selection developed by Tschernig and Yang (1999).

Acknowledgements

The authors thank Rong Chen and Michael Neumann for their many helpful discussions. The very insightful comments from the referees and the Joint Editor are also gratefully acknow- ledged. These comments stimulated a substantial extension of the method. This work was supported by Sonderforschungsbereich 373 `Quanti®kation und Simulation OÈkonomischer Prozesse' of the Deutsche Forschungsgemeinschaft, at Humboldt-UniversitaÈt zu Berlin.

Appendix A

We demonstrate that a scalar bandwidthhcan be used instead of a bandwidth vectorhˆ …h1, . . .,hd† for estimatingf…x†. Estimation based on a single bandwidthhachieves the same convergence rate as multiple bandwidths but may be less ecient when in each variable direction a di€erent amount of smoothing is appropriate. However, a single bandwidth is more easily computed.

When using a single bandwidthhto estimate the functionf…x†, the AMISE is the following function ofh:

AMISE…h† ˆ4K

4 C…f†h4‡ kKk2d2 B…g† 1 nhd

whereC…f† ˆd,ˆ1C…f†. The scalar bandwidthhthat minimizes AMISE…h†is

(16)

hoptˆ

dkKk2d2 B…g†

4n4KC…f† 1=…d‡4†

: …A:1†

DenoteC…^ f,h† ˆd,ˆ1C^…f,h†; then we have the following theorem.

Theorem 5.Under assumptions (1)±(4) in Appendix B, ash!0 andnhd‡4! 1, we have

C…^ f,h† ˆC…f† ‡2h2 Pd

,ˆ1B…f† ‡ B…g† Pd

,ˆ1V…K†

nhd‡4 ‡‡Op…h4† ‡op

1

nphd‡8

in which

np

hd‡8!D Nf0,2…g,K†g where

2…g,K† ˆ32 …

g4…x†w2…x†dx 16K… 1†4

… Pd

,ˆ1F…t†

2 dt:

The optimal bandwidth to estimateC…f†byC^…f,h†is

hf†,optˆ

B…g† Pd

,ˆ1V…K†

2 Pd

,ˆ1B…f†n 8>

>>

<

>>

>:

9>

>>

=

>>

>;

1=…d‡6†

if Pd

,ˆ1B…f†<0,

…d‡4†

B…g† Pd

,ˆ1V…K†

4 Pd

,ˆ1B…f†n 8>

>>

<

>>

>:

9>

>>

=

>>

>;

1=…d‡6†

if Pd

,ˆ1B…f†>0:

8>

>>

>>

>>

>>

>>

>>

>>

<

>>

>>

>>

>>

>>

>>

>>

>:

By using the formulae in theorem 5, we can obtain both the ROT and the PI bandwidth

hROTˆ dkKk2d2 BROT…g†

4n4K Pd

,ˆ1CROT,…f† 8>

>>

<

>>

>:

9>

>>

=

>>

>;

1=…d‡4†

, …A:2†

hPIˆ

dkKk2d2B…^ g†

4n4KC…^ f† 1=…d‡4†

…A:3†

in which BROT…g† and CROT,…f† are the same as used in equation (3.1). It is easy to verify that hPIˆhoptf1‡Op…n 2=…d‡6††g.

Appendix B

We denote bySthe support ofw…x†, and all integrals in variablexare overS. The following assump- tions are needed to prove theorems 1, 3 and 4.

(a) The densityp…x† ofXexists, is continuously di€erentiable up to order 2 and is bounded below away from 0 onS(assumption 1).

(b) The functionf…x†is continuously di€erentiable onSup to order 4 whereasg…x†is continuous on Sand positive on an open subset ofS(assumption 2).

(c) The bandwidth hˆhn is a positive number depending on nsuch thath!0 andnhd‡4! 1 (assumption 3).

(17)

Denote byM‡…d† the set of all non-negative de®nite matricesMthat are positive de®nite for non- negative vectors, i.e.

M‡…d† ˆ

M:Mnon-negative definite and inf

h2Sd‡1…hTMh† ˆc…M†>0

…B:1†

where Sd‡1ˆSd 1\Rd‡ denotes the portion of the …d 1†-dimensional unit sphere Sd 1 with non- negative co-ordinates. For the diagonal bandwidth to exist, we assume that the matrixC…f† 2M‡…d†

(assumption 4).

Assumption 4 is not trivial. Take the functionf…x1,x2† ˆexp…x1† sin…x2†; together withp…x†equal to the uniform density on‰0, 1Š2 andw…x† ˆp…x†, we have

C…f† ˆ …

‰0,1Š2 exp…2x1†sin2…x2†dx 1 2 2 4

ˆ1

4fexp…2† 1g 1 2

2 4

and therefore…2, 1†C…f†…2, 1†Tˆ0 even though…2, 1†T2R2‡.

For the proof of theorem 1, we also need lemmas 1 and 2 and theorem 6.

Lemma 1.

@Q…v,M,a†

@v ˆ1 2

Pd

ˆ1 M1v

Pd ...

ˆ1Mdv

0 BB BB BB

@

1 CC CC CC A

a 2p

…v1. . .vd† v11

...

vd1

0 BB

@ 1 CC

A, …B:2†

@2Q…v,M,a†

@v@vT ˆ1

2M‡ a

4p

…v1. . .vd†

3v12 v11v21 . . . v11vd1

v21v11 3v22 . . . v21vd1

... ... ... ...

vd1v11 vd1v21 . . . 3vd2

0 BB BB BB B@

1 CC CC CC CA

: …B:3†

Proof.Direct matrix calculation plus de®nition (2.1) yield the results.

Lemma 2.For any ®xedM2M‡…d†anda>0,Q…v,M,a†is a strictly convex function ofv2Rd‡and therefore can be minimized by at most one vectorvinRd‡.

Proof.Both terms in equation (B.3) are non-negative de®nite, whereas the second term is positive de®nite, so the Hessian ofQ…v,M,a†with respect tovis positive de®nite.

Theorem 6. For any ®xedM2M‡…d† anda>0, Q…v,M,a†has a unique minimum vectorvoptˆ v…M,a†, which is in®nitely continuously di€erentiable in bothManda.

Proof.M2M‡…d†implies that Q…v,M,a†51

4 Pd

,ˆ1Mvv5 1

4c…M†Pd

ˆ1v2ˆ1

4c…M†kvk2 for allv2Rd‡and Euclidean normkvk ˆ …dˆ1v2†1=2, so

kvk!1lim fQ…v,M,a†g51 4 lim

kvk!1fc…M†kvk2g ˆ 1 while it is also clear that we always have

kvk!0lim fQ…v,M,a†g5 1 4 lim

kvk!0

v1a. . .vd†

ˆ 1:

(18)

Thus the in®mum ofQ…v,M,a†is reached at avwhose norm is bounded away from1and 0, and by the previous lemma thisvis unique. Equation (B.2) and the implicit function theorem show thatv…M,a†

is an in®nitely continuously di€erentiable function ofManda.

To prove theorem 3, we need the dispersion matrix of the partial local cubic estimator.

Lemma 3.LetZ be as in equation (4.1) and

Hˆdiag…1,h 2I2d 1,h 1Id,h 3I2d 2,h 3†:

Asn! 1,

HTZTWZHˆdiag

1 2K11d

2K1d1 4Kf… 1†Id‡1ddg

,4KId 1,D

fp…x†I5d 1‡op…1†g …B:4†

and

…HTZTWZ1ˆdiag

… 1‡d†… 1† 1 K2… 1† 111d

K2… 1† 11d1 K4… 1† 1Id

,K4Id 1,D 1 I5d 1

p…x†‡op…1†

…B:5†

uniformly in a compact neighbourhood ofx. Here

2KId A B C AT 6KId 1 0…d 1†…d 0…d 1†1

BT 0…d 1†…d 6Kf… 1†Id 1‡1…d 1†…d g 6K1…d 1†1

CT 01…d 6K11…d 6…K†

0 BB BB

@

1 CC CC A

in which Aˆ4K

I 1 0… 1†…d †

0 01…d †

0…d †… Id 0

B@

1

CA, Bˆ4K

0… 1†…d

11…d

0…d †…d

0 B@

1

CA, Cˆ4K

0… 1†1

1 0…d †1 0 B@

1 CA:

Proof.Equation (B.4) follows by the usual array-type central limit theorem, Kbeing a symmetric compact product kernel; see, for example, Wand and Jones (1995). Equation (B.5) follows by direct veri®cation. The exact inverse ofDcan be obtained by using the tools in LuÈtkepohl (1996).

Proof of theorem 3.Given anyˆ1, . . .,dand denotingZbyZ, we ®rst note that eTH…HTZTWZH† 1HTZTWZe0f…x† ˆeT…ZTWZ† 1ZTWZe0f…x† ˆ0,

eTH…HTZTWZH† 1HTZTWZef…x† ˆf…x†, whereas for any0ˆ1, . . ., 5d 1,06ˆ,

eTH…HTZTWZH† 1HTZTWZe0ˆeT…ZTWZ† 1ZTWZe0ˆ0:

Combining these with equation (B.5) of lemma 3, we have

f^…x† f…x† ˆ2eTH…HTZTWZH† 1HTZTWY f…x†

ˆT1‡T2‡T3

where

T1ˆ 2‡op…1†

4K… 1†p…x†nh2 Pn

iˆ1Kh…Xi

…Xi x†2 h2 2K

f…Xi† f…x† …XiTrf…x†

1

2…XiTr2f…x†…Xi x† 1

3! f…x†…Xi x†3 P

3

3! f…x†…Xi x†2…Xi x† P

3

3! f…x†…Xi x†…Xi x†2

,

Références

Documents relatifs

A new approach is proposed based on sliced inverse regression (SIR) for estimating the effective dimension reduction (EDR) space without requiring a prespecified para- metric

The original data are n = 506 observations on p = 14 variables, medv median value, being the target variable. crim per capita crime rate

• Etape k , on ajoute au Mod`ele M k le r´egresseur qui augmente le plus le R 2 global parmi les r´egresseurs significatifs.. • On it`ere jusqu’`a ce qu’aucun r´egresseur

This section collects preliminary material on semi-algebraic sets, moments and moment matrices, using the notation of [8]. This material will be used to restrict our attention

Our technique is able to outperform, be it in energy consumption, be it in speed, the default Linux scheduler (the Global Task Scheduler), and CHOAMP, a state-of-the-art tool

To adapt Silverman’s rule of thumb (Silverman, 1986, page 48) to select the value of h jℓ in an iterative procedure, for each iteration of the npEM algorithm, we need an estimate of

Within each panel, the boxplot of MSE is shown, in the order from left to right, of additive models using the function gam from the R package mgcv, TPS with optimal smoothing

We conducted a series of simulation studies to investigate the performance of our proposed variable selection approach on continuous, binary and count response out- comes using