Multivariate bandwidth selection for local linear regression

(1)

Multivariate bandwidth selection for local linear regression

Lijian Yang

Michigan State University, East Lansing, USA and Rolf Tschernig

Humboldt-Universitat, Berlin, Germany

[Received December 1997. Final revision January 1999]

Summary. The existence and properties of optimal bandwidths for multivariate local linear regression are established, using either a scalar bandwidth for all regressors or a diagonal bandwidth vector that has a different bandwidth for each regressor. Both involve functionals of the derivatives of the unknown multivariate regression function. Estimating these functionals is dif®cult primarily because they contain multivariate derivatives. In this paper, an estimator of the multivariate second derivative is obtained via local cubic regression with most cross-terms left out. This estimator has the optimal rate of convergence but is simpler and uses much less computing time than the full local estimator. Using this as a pilot estimator, we obtain plug-in formulae for the optimal bandwidth, both scalar and diagonal, for multivariate local linear regression. As a simpler alternative, we also provide rule-of-thumb bandwidth selectors. All these bandwidths have satisfactory performance in our simulation study.

Keywords: Asymptotic optimality; Blocked quartic ®t; Functional estimation; Partial local regression; Plug-in bandwidth; Rule-of-thumb bandwidth; Second derivatives

1. Introduction

Nonparametric estimation in general requires littlea prioriknowledge on the functions to be estimated. The estimation results, however, depend crucially on the choice of bandwidth.

Whereas choosing too large a bandwidth may introduce a large bias, selecting too small a bandwidth may cause large estimation variance. An asymptotically optimal bandwidth usually exists and can be obtained through a bias±variance trade-o. Such an optimal bandwidth in general involves functionals of the unknown underlying functions, and the selection of the optimal bandwidth has always been a challenge. Bandwidth selection methods with good theoretical properties and practical performance exist for univariate density estimation, e.g.

Joneset al. (1996a, b), Cheng (1997) and Grund and Polzehl (1998), and univariate local least squares regression, e.g. Fan and Gijbels (1995) and Ruppertet al. (1995). Wand and Jones (1993) provided an excellent analysis of bandwidth selection for bivariate density estimation, Sainet al. (1994) looked at multivariate cross-validation for density estimation bandwidth selection and Wand and Jones (1994) developed plug-in (PI) bandwidths for multivariate kernel density estimation. Herrmannet al. (1995) studied bandwidth selection for bivariate regression with ®xed regular design, but it is unclear whether their method works for random

Address for correspondence: Lijian Yang, Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA.

E-mail: [email protected]

&1999 Royal Statistical Society 1369±7412/99/61793

61,Part4,pp.793^815

(2)

designs and higher dimensions. Ruppert (1997) proposed the empirical bias bandwidth selector (EBBS) which could be applied in a multivariate setting. However, the theoretical properties of the EBBS are unknown and its practical performance has only been studied in the univariate setting. As pointed out by Ruppert (1997), the diculty in obtaining a reliable multivariate data-driven bandwidth is essentially due to the complexity of estimating higher order multivariate derivatives. To address this diculty, we estimate multivariate second- order derivatives by a local cubic ®t, where all unnecessary terms are left out. This both solves the theoretical problem and makes a practical implementation feasible. We are not aware of any other work that makes use of the desirable properties of a partial local polynomial ®t.

Another new feature is the use of a dierent bandwidth for each of the regressors, in contrast with the EBBS method which uses the same bandwidth for all regressors. We have established some general assumptions that ensure the existence of asymptotic optimal bandwidths. This is not trivial because, when using a bandwidth vector, there is no simple closed formula for the solution of the bias±variance trade-o, and the existence of the solution needs to be implicitly proven. This use of a `diagonal bandwidth' is a good compromise between

¯exibility and optimality on the one hand (which need more smoothing parameters, such as a bandwidth matrix) and interpretation and simplicity on the other hand (fewer parameters, such as one bandwidth for all explanatory variables). We have also studied the theoretical properties and practical performances of the latter option, which we call the `scalar optimal bandwidth'.

The computation of our PI bandwidths requires the estimation of functionals that include derivatives up to the fourth order. The idea is to estimate unknown functionals by blocked polynomial ®ts, as in HaÈrdle and Marron (1995) and Ruppert et al. (1995), but we had further to re®ne the idea. To do a quartic ®t of a function of four variables could require up to 70 parameters in each block, which not only makes it impossible to have a sucient number of blocks to adapt to local features of the function but also introduces much noise for the estimation within each block. Since the computation of one functional does not require all the derivatives included in a complete fourth-order Taylor expansion, we propose to use a partial quartic ®t in the blocks, with only 23 parameters for the same function. This advantage in number grows drastically if the number of variables becomes even more.

The PI bandwidths are necessarily computation intensive, so we also investigated rule-of- thumb (ROT) bandwidths as simpler alternatives.

The paper is organized as follows. In the next section, we show the existence of an asymptotically optimal bandwidth vector. Section 3 de®nes an ROT and a PI bandwidth selector for local linear regression, and asymptotic optimality is established for the PI bandwidth. In Section 4, we describe in detail how functionals of second derivatives are estimated via partial local cubic regression, while parallel results using a scalar bandwidth selector are summarized in Appendix A. Implementation of the methods proposed is discussed in Section 5. In Section 6 we describe the results of a rather comprehensive Monte Carlo study. Section 7 concludes, while all assumptions and proofs are contained in Appendix B.

The programs that are used in the analysis can be obtained from http://www.blackwellpublishers.co.uk/rss/

2. Background

To formulate the problem, consider a multivariate regression model YfX gX

(3)

where Y is a scalar dependent variable, X X₁, X₂, . . .,X_d is a vector of explanatory variables,is independent ofX,E 0 and var 1. LetXi,Yi,i1, 2, . . .,n, be an independent and identically distributed sample. Then the local linear estimator of fat a given pointx x₁,x₂, . . .,x_dis obtained by doing a ®rst-order Taylor expansion of the function fat pointxfor all the data pointsX_iand solving a least squares problem locally weighted by kernels: i.e. the estimator f^xis the ®rst elementcof the vector

fc,c_144dg that minimizes

Pⁿ

i1

Y_i c P^d

1cX_i x ₂

K_hX_i x

whereKis a symmetric, compactly supported, univariate probability kernel (so thatKis non- negative and

Ku 1) and

K_hX_i x Q^d

1

1 h K

X_i x h

:

Denoting by px the density of X, using a weight function wx, the weighted mean integrated squared error (MISE)

E

ff^x fxg²wxpxdx

is a function of the bandwidth vectorh h₁, . . ., h_d^T, and thehthat minimizes this error is called the optimal bandwidth vector. An asymptotic formula of the MISE in this setting is given by

AMISEh ⁴_K 4

P^d

,1Cfh²h² kKk^2d₂ Bg 1 nh₁. . .h_d in which²_K

u²Kudu,kKk²₂

K²udu and Bg

g²xwxdx, Cf

fxfxpxwxdx, ,1, . . .,d,

wherefxdenotes the second derivative offxwith respect to theth variablex. For the general formula of AMISE using a bandwidth matrix, see p. 140 of Wand and Jones (1995).

We denote byCfthe matrix whose ,th entry isCf, which is always non-negative de®nite. Also, we always haveBg>0 by assumption 2 in Appendix B.

We introduce a functionQ:R^dMd R !R Qv,M,a 1

4 P^d

,1Mvvp a

v₁. . .v_d1

4v^TMvp a

v₁. . .v_d 2:1

where Md is a set of non-negative de®nite dd matrices whose de®nition is given by expression (B.1) in Appendix B,R 0,1 andM is the,th entry of a matrixM.

Withh² h²₁, . . .,h²_d^T, we then have

(4)

AMISEh Q

h²,⁴_KCf, kKk^2d₂ Bg

n

: Simple algebra gives

Qv,M,a a^4=d4c^d=d4Qa ^2=d4c^2=d4v,c ¹M, 1

and therefore

vM,a a^2=d4c ^2=d4vc ¹M, 1 2:2

wherevM,adenotes the vector vthat minimizesQv,M, a. By theorem 6, vM, a is a well-de®ned and in®nitely smooth function ofManda. Using equation (2.2), we have the following theorem.

Theorem 1.Under assumptions (1)±(4), AMISEhis minimized by a unique vector which depends onBgandCfsmoothly via

h_opt

kKk^2d₂ Bg

n⁴_K

_1=d4

v¹⁼²fCf, 1g: 2:3

The optimal bandwidthh_opt given in formula (2.3) contains unknown quantitiesBgand Cfand hence cannot be directly used in the estimation offx. In Section 4, appropriate estimators offxand Cf are given based on a partial local cubic ®t with many cross- terms left out. These estimators have very simple bias formulae and their biases and variances are of the same order as keeping the cross-terms in.

3. Bandwidth selection

A quick substitute ofh_opt is the simple ROT bandwidth de®ned as h_ROT

kKk^2d₂ B_ROTg

n⁴K

_1=d4

v¹⁼²fC_ROTf, 1g 3:1

in whichC_ROTfandB_ROTgare based on piecewise quartic estimation offx, and whose formulae are given by expressions (5.1) and (5.2).

The bandwidth vectorh_ROTis of the same order ash_optbut does not estimateh_opteciently becausefmay not always be represented by a piecewise fourth-order polynomial. The idea of such an ROT bandwidth exists in the univariate case; see Fan and Gijbels (1995) and Ruppert et al. (1995). The latter additionally used a blocking idea that we have also adopted.

To improve on the ROT idea, we propose a PI bandwidth, so called as it approximatesh_opt by plugging in consistent estimates of the functionalsBgandCf. Speci®cally, we de®ne

h_PI

kKk^2d₂ Bg^ n⁴_K

_1=d4

v¹⁼²fC^ f, 1g 3:2

in whichC^ f fC^fg fC^f,hg where for each,1, . . .,d C^f 1

n Pⁿ

i1

f^X_i f^X_iwX_i

f^x f^xwxpxdxO_pn ¹⁼² 3:3

and the f^xis a partial local cubic estimator offxgiven in equation (4.2). The estimator Bg^ is de®ned in equation (5.3).

(5)

Using standard results such as contained in Wand and Jones (1995) or Fan and Gijbels (1996), we haveBg=Bg ^ 1Opn ^2=d4asn! 1. On the basis of this, the asymptotics ofC^ fas given in corollary 2 and the smooth dependence ofh_optonCfas in theorem 1, we have the following theorem.

Theorem 2. Under assumptions (1)±(4), for n! 1, the bandwidth de®ned in equation (3.2) is asymptotically optimal. In particular,

h_PIh_optfI_dO_pn ^2=d6g:

Sinceh_PIis a consistent substitute for the optimal bandwidthh_opt, it can be shown thath_PI performs asymptotically similarly toh_optin terms of MISE.h_PIwill on average work better thanh_ROTsince the latter is an inconsistent bandwidth estimator. Simulation results in Section 6 strongly support these expectations.

In Appendix A, we discuss the use of a scalar bandwidthhinstead ofh, i.e.h h, . . .,h.

In that case, the corresponding ROT bandwidth h_ROT and PI bandwidth h_PI are given by equations (A.2) and (A.3).

4. Partial local estimation

In this section, we formulate an estimator f^xof fx. In what follows, denote for any compact supported functionL

_rL ₁

1 u^rLudu;

hence in particular₂K ²_K. We write₄K ⁴_K wheredenotes the kurtosis of the kernelK, which is always greater than 1. For the derivative functions, we denote

fx @³

@x@x@x fx, etc.Denote

Z 1,fX_i x²g,fX_i xX_i xg₆,X_i x,fX_i xX_i x²g₆,

fX_i x²X_i xg₆,X_i x³ⁿ_i1, 4:1

and then de®ne

f^x 2e^TZ^TWZ ¹Z^TWY 4:2

whereeis a (5d 1)-vector of 0s whose (1)-element is 1,Y Y_i_n1and with a scalar bandwidthh h, . . .,h

Wdiag 1

n K_hX_i x

_n

i1

;

see Ruppert and Wand (1994) for a general de®nition of a local polynomial estimator. We then obtain the following theorem.

Theorem 3.Under assumptions (1)±(3), for 1, . . .,d, as h!0 and nh^d4! 1, we have

(6)

pnh^d4ff^x fx bxh²g !Nf0,²xg where

bx ₆K ⁶_K

12 1⁴_K fx P

6

²_K

2 fx, 4:3

²x 4f4K² 22K²²K kKk²₂⁴KgkKk^2d2 ¹g²x

⁸K 1²px : 4:4

The formation of the matrixZ is much simpler than the corresponding matrix of a full local cubic expansion. This makes the explicit mathematical derivation ofZ^TWZ ¹easier and computing expression (4.2) less costly.

On the basis of equation (4.2), the asymptotic properties ofC^f,hare presented in the next theorem. In the following, we denote K*u Kuu² ²_K, K*K* the convolution betweenKandK*,

K*K*t

Kt uK*udu, which is also

KtuK*udu by symmetry, andK²K*K, the self-convolution ofK.

Also, we denote bythe Kronecker index, which equals 1 or 0 depending on whether or not.

Theorem 4.Under assumptions (1)±(4), ash!0 andnh^d4! 1, we have C^f,h Cf fBf Bfgh²BgVK

nh^d4 O_ph⁴ o_p 1

np h^d8

in which

Bf

fxbxwxpxdx,

VK 4kKk^2d₂ ²f₄K²kKk²₂ 1 ²₂K² 2²_K₂K²kKk²₂⁴_KkKk⁴₂g

⁸K 1² ,

np

h^d8!^D Nf0,²g,Kg where

²g,K 32

g⁴xw²xdx ¹⁶_K 1⁴

FtFtdt and where we de®ne for any,1, . . .,d

Ft 1 K**KtK**Kt Q

6,K²t K*²tQ

6K²t:

Note here that

g⁴xw²xdx>0 by assumption (2), and that therefore C^f,hhas a standard deviation of ordernph^d8 ¹, which is smaller than one of the bias termsnh^d4 ¹.

(7)

The two bias terms point to a trade-o if both are positive or cancellation if the h²-term happens to be negative. Assuming that theh²-term is never 0, we obtain the following results for estimatingCfoptimally.

Corollary 1. Under the same assumptions as in theorem 4, the optimal bandwidth to estimateCfbyC^f,his

h_C_f_,opt

BgVK fBf Bfgn

_1=d6

ifBf Bf<0, d4

2

BgVK fBf Bfgn

_1=d6

ifBf Bf>0, 8>

>>

><

>>

:

4:5

which gives the standard error of ordernp

h^d8Cf,optOn d4=2d6and the bias of order h⁴_C_f_,optOn ^4=d6whenBf Bf<0 and of orderh²_C_f_,optOn ^2=d6when Bf Bf>0.

We can now complete the de®nition of C^f, and therefore of C^ f. Estimate all the derivativesfxetc. by blocked quartic ®ts and obtain the estimates ofBfas described in Section 5. Then plug them in formula (4.5) together withBg^ in place ofBg. Use the result- ing bandwidth in the estimator C^f. De®nition (3.3) and theorem 4 yield the following corollary.

Corollary 2.Under assumptions (1)±(4), forn! 1, theC^ fde®ned by equation (3.3) has the consistency property

C^ f CffI_dO_pn ^2=d6g:

5. Implementation

In this section we describe how to estimate Cf and Bg to obtain the ROT and PI bandwidths (3.1) and (A.2), and (3.2) and (A.3) respectively.

For the ROT estimation offX_iandfX_i, Fan and Gijbels (1995) suggested the use of a quartic Taylor expansion. Ruppertet al. (1995) proposed to separate the data into equal- sized blocks and to use a quartic Taylor expansion on each block in case the function is very wiggly. Since the number of parameters per block increases dramatically with the increase in dimension and since the wiggliness of the function depends on each variable dierently, our

¯exible blocking scheme sorts the data one variable at a time. LetNdenote the number of blocks in theth direction and collect them in N N₁, . . ., N_d. ThenN^d₁Nis the total number of blocks given the choices of N, 1, 2, . . .,d. To choose the optimal blocks we follow Ruppertet al. (1995) and use Mallows's C_p,

C_pN RSSNfn kdn=4kdg

min_NfRSSNg fn 2kdNg

where RSSNdenotes the residual sum of squares based on the quartic ®t with blockingN, kd 1P⁴

i1

di 1 i

is the maximum number of parameters in one block and adenotes the integer part of a.

(8)

Denoting the block vector chosen withC_pNbyN*, and by f^_ROT,N*, f^_ROT,N*, the corresponding function estimators, we then estimate

B_ROTg 1 n kdN*

Pⁿ

i1

fY_i f^_ROT,N*X_ig²wX_i

p_ROTX_i , 5:1

C_ROTf 1

n Pⁿ

i1

f^_ROT,N*,X_i f^_ROT,N*,X_iwX_i

5:2

wherep_ROTxis the uniform density on the range of the data. These estimates are necessarily inconsistent unless the truefis a piecewise quartic function.

This problem is avoided by PI estimation described in Section 4 and Appendix A. Now we estimateBgby

Bg ^ 1 n

Pⁿ

i1

wX_ifY_i f^_h_ROTX_ig²

~

p_h_SX_i 5:3

where h_ROT is de®ned by equations (3.1), (5.1) and (5.2), h_S is the d-dimensional normal reference bandwidth (Silverman (1986), pages 86±87)) andp~_h_Sxis a kernel density estimator.

To estimateCfwe use partial local cubic regression as described in Section 4. This is better done with a bandwidth that is slightly larger than the optimal pilot bandwidthh_C_,opt. The reason is that the bandwidthh_C_,optonly minimizes the absolute value of the bias, whereas it completely ignores the variance in Cf estimation. Asymptotically this is justi®ed since the standard deviation ofC^ f,his of higher order than the bias. To use a bandwidth that is not equal to h_C_,opt would produce in theCf estimate a larger bias than the minimum bias, but not signi®cantly so if the bandwidth is larger. To see this eect we take the bandwidthh_C_,optand plot in Fig. 1 the ratio of the bias against the minimum bias (achieved at1) as a function rof2 0:6, 2, ford4. Here, we assume thatBB>0, and the ratio then is

r fd4=2g^2=d6² fd4=2g d4=d6 ^d4

fd4=2g^2=d6 fd4=2g d4=d6 :

The broken line has the height of r2. We can see that r is about 3 or less for >1, whereas there is a dramatic increase for <1. Also it is easy to verify thatr244 for any d. Meanwhile, for2, the ratio of decrease in the standard deviation is ^d8=2, which is 1=32 ifd2 and more for largerd:This decrease in standard deviation is drastic whereas the increase in bias is at most 4 times. We therefore use 2h_C_,opt, which should have very good

®nite sample performance.

Fig. 1. Ratio of the increase in bias when estimatingC(f) withhC(f),opt, ford4

(9)

The estimation of the unknown functionalBis based on blocked quartic ®ts offifd43.

For largerd, the number of parameters can be so large that for medium sample sizes blocking or even global estimation becomes impossible. We therefore use a partial quartic ®t for the estimation of eachB, consisting of only those fourth-order derivativesf,1, . . .,d, in equation (4.3) plus all lower order derivatives that form a subset of those. As an example, such a partial quartic polynomial centred at zero reads

b₀P^d

1bXP^d

1bXX P^d

1,6bX²P^d

1bXX² P^d

1,6bX²XP^d

1bX²X²: For this partial expansion the number of parameterskd 6d 1 grows linearly ind. It also includes all terms to estimatef. Ford4, the number of parameters drops to about a third of the complete expansion. The partial expansion allows not only more blocks but also the blocks to focus more on the variable directions that have more curvature. This is a new feature that is not an issue in the univariate case.

For ecient partial quartic ®ts we propose to use at least 4kdobservations in each block.

This amounts to requiring n54kd 46d 1. For example, if d4, partial blocking needsn592 observations. Ifn<46d 1, we cannot recommend any of our methods with con®dence as there are just not enough data to estimatefaccurately.

To use standard software for the numerical optimization problem in equations (3.1), (3.2) or (2.3), it may be convenient to reparameterizevasvexpu, i.e.v₁expu₁, . . ., v_d expu_dwhereu u₁, . . ., u_d^T2R^dso there is no constraint onu. The gradient and Hessian matrices are then

@Qfexpu,M,ag

@u 1 2

expu₁P^d

1M₁ expu ...

expu_dP^d

1M_dexpu 0

BB BB BB

@

1 CC CC CC A

a

2pfexpu₁. . . expu_dg 1

...

1 0 B@

1 CA

@²Qfexpu,M,ag

@u@u^T 1

2 diagfexpu₁, . . ., expu_dgMdiagfexpu₁, . . ., expu_dg

1

2 diagfM₁₁ exp2u₁, . . ., M_ddexp2u_dg

a

4p

fexpu₁. . . expu_dg1_dd,

which make it easy to ®nduM,a lnfvM,ag, theu-value that minimizesQfexpu,M, ag. We then obtain the following substitute for equation (2.3):

h_opt

kKk^2d₂ Bg

n⁴_K

_1=d4

exp

ufCf, 1g 2

: Similar substitutes are used to compute equations (3.1) and (3.2).

6. Simulation results

In this section we investigate the ®nite sample performance of the ROT bandwidths h_ROT

(10)

given by equation (3.1) and h_ROT given by equation (A.2), the PI bandwidths h_PI given by equation (3.2) andh_PI given by equation (A.3) for the local linear estimation offx. These results are compared with those obtained by using the asymptotic optimal bandwidthsh_optas in equation (2.3) andhoptas in equation (A.1), which we can compute sincefx,gxandpx

are explicitly given. For each pseudodata set generated, we compute the ISE

Table 1. Model 1: mean of the ISE for various bandwidth selectors

Bandwidth Kind Means for the following sample sizes:

100 250 500

Diagonal Asymptotic optimal 0.0326 0.0165 0.00893

PI 0.0409 0.0183 0.00931

ROT 0.0656 0.0233 0.0112

Scalar Asymptotic optimal 0.0338 0.0174 0.00934

PI 0.0401 0.0180 0.00928

ROT 0.0734 0.0244 0.0113

100 250 500

PI 0.0523 0.0248 0.0139

ROT 0.0752 0.0301 0.0168

PI 0.0503 0.0240 0.0136

ROT 0.0824 0.0314 0.0175

100 250 500

PI 0.0711 0.0334 0.0185

ROT 0.114 0.0469 0.0242

PI 0.0747 0.0354 0.0202

ROT 0.137 0.0560 0.0272

100 250 500

PI 0.0873 0.0378 0.0219

ROT 0.230 0.0525 0.0268

PI 0.105 0.0536 0.0337

ROT 0.282 0.0823 0.0431

(11)

1 n

P^s

i1ff^_hU_i fU_ig²

for the diagonal bandwidths hh_opt, h_PI, h_ROT and scalar bandwidths hh_opt, h_PI, h_ROT, where theU_iare taken from an equally spaced grid of roughly 2500 points. In total we consider six models of varying complexity and in all cases 100 replications are conducted. For all the models the random design matrixXis an independent sample from the uniform distribution on0, 1^dwith sample sizen, and for the estimation ofCfwe screen o observationsX_ithat are outsidea, 1 a^d wherea 1 0:9^1=d=2. The sample sizen is 100, 250 or 500 except for model 6 wheren250 andn500. The models are

YfX , N0, 0:25, where the regression function is

fx x²₁x⁴₂ for model 1,

fx sinfx₁x₂g for model 2,

fx sinx₁ sin2x₂ for model 3,

fx fsinx₁ sin4x₂g=2 for model 4,

100 250 500

PI 0.142 0.0573 0.0297

ROT 0.306 0.100 0.0466

PI 0.145 0.0692 0.0410

ROT 0.338 0.116 0.0606

250 500

Diagonal Asymptotic optimal 0.459 0.291

PI 0.363 0.225

ROT 0.641 0.502

Scalar Asymptotic optimal 0.791 0.502

PI 0.447 0.308

ROT 0.793 0.747

(12)

fx fx₁ 0:5²x²₂g sin2x₃ for model 5 and

fx sinfx₁0:5x₂2x₃0:5x₄g for model 6.

Table 1 reports in its ®rst three rows the mean of the ISE based onh_opt,h_PI,h_ROTfor each sample size of model 1. The last three rows present the corresponding results for the scalar bandwidthh_opt, h_PI and h_ROT. Similarly, Tables 2±6 display the results for models 2±6. In addition, Fig. 2 presents plots of the densities of the ISE based onh_opt(full curve),h_PI(short broken curve) and h_ROT (long broken curve) for all models, where the sample size is 500.

Inspecting Tables 1±6 and Fig. 2, we ®nd that the PI bandwidth is always superior to the ROT bandwidth as it delivers function estimates that have smaller ISEs. Furthermore, the mean of the ISE declines with sample size as expected.

If we useCfto measure complexity, the two-dimensional models 1±4 are of increasing complexity withCf 48:8, 194:8, 608:8, 3129:3 respectively. Since the ROT bandwidth is based on piecewise quartic ®ts, we might expect the ROT function estimates to perform worse

Fig. 2. Density of the ISE for (a) model 1, (b) model 2, (c) model 3, (d) model 4, (e) model 5 and (f) model 6 (Ð,hopt; - - - - -,hPI; ± ± ±,hROT;n500)

(13)

Fig. 3. Functionf(.) of model 3: (a) true function; (b) estimate withhPI,n100; (c) estimate withhROT,n100;

(d) estimate withhopt,n500; (e) estimate withhPI,n500; (f) estimate withhROT,n500

(14)

than the PI estimates for complex models that are far from being piecewise quartic. This is con®rmed by Tables 1±4 and the ISE densities in Figs 2(a)±2(d). Fig. 3 shows the true function of model 3 and its estimates for one replication based onh_PIandh_ROT,n100, or h_opt,h_PIandh_ROT,n500.

For higher dimension, the advantage of PI over ROT estimates becomes more signi®cant.

For the three-dimensional model 5 and four-dimensional model 6, Tables 5 and 6 show a reduction in MISE up to 60%. The ISE densities are plotted in Figs 2(e) and 2(f).

For models 1 and 2, there is practically no dierence between the diagonal and scalar bandwidth, whereas for model 3 the diagonal has only a slight advantage. This is due to the form of f, which does not require dierent amounts of smoothing in each direction. For models 4±6, the improvement in diagonal bandwidth becomes signi®cant as the f-function requires very dierent amounts of smoothing for various directions. Overall, we clearly prefer the diagonal to the scalar bandwidth.

Although the function estimate is the ultimate goal, it is also informative to analyse how well the PI and ROT bandwidths approximate the asymptotically optimal bandwidthhopt. When comparing the densities of ratios, we ®nd thath_PI=h_optis always to the right ofh_ROT=h_opt. The

Fig. 4. Densities of bandwidth ratios for model 3 (- - - , PI; ± ± ±, ROT): (a) ®rst elements ofhPI=hoptand hROT=hopt,n100; (b) second elements ofhPI=hoptandhROT=hopt,n100; (c)hPI=hoptandhROT=hopt,n100;

(d) ®rst elements ofhPI=hoptandhROT=hopt,n500; (e) second elements ofhPI=hoptandhROT=hopt,n500; (f) hPI=hoptandhROT=hopt,n500

(15)

ratio for the PI bandwidth is always closer to 1 except for a few cases where the ROT ratio is also close to 1. The PI ratiohPI=hoptalso approaches 1 faster than the ROT ratiohROT=hoptas the sample sizen increases. All these conclusions can be drawn from inspecting the density plots. Fig. 4 presents the density plots for the complicated model 3 based on 100 (Figs 4(a)±

4(c)) and 500 (Figs 4(d)±4(f)) observations. The densities of the ®rst and second elements of h_PI=h_optandh_ROT=h_opt are depicted in Figs 4(a), 4(b), 4(d) and 4(e). Figs 4(c) and 4(f) show the densities ofh_PI=h_optandh_ROT=h_opt. Increasing complexity or dimension makes the bandwidth estimation more dicult.

7. Conclusion

We have shown the existence of both a diagonal and a scalar optimal bandwidth for multivariate regression and have proposed both an ROT and a PI bandwidth selector. The ROT method is simple to use but may perform poorly because of its inconsistency. The PI method is achieved by using partial local cubic regression to estimate the unknown functionals of second derivatives. The partial local cubic expansion does not need many terms and has simple bias and variance formulae. The pilot bandwidths for the functional estimation are determined by blocked partial quartic ®ts. In Monte Carlo simulations we investigated the performance of these automatic bandwidth selectors for models up to four dimensions and various sample sizes. The diagonal PI method is found to be the best in terms of the ISE, although the ROT method can be useful if the model is not too complicated. Furthermore, a bandwidth vector is always preferable to a scalar bandwidth. These conclusions are observed with a sample size as small as 100.

The results of this paper can be generalized to autoregressive time series under conditions that lead to geometric mixing; see, for instance, HaÈrdleet al. (1998) for such conditions. The scalar plug-in bandwidth was used to obtain a data-driven asymptotic ®nal prediction error for the non-linear time series lag selection developed by Tschernig and Yang (1999).

Acknowledgements

The authors thank Rong Chen and Michael Neumann for their many helpful discussions. The very insightful comments from the referees and the Joint Editor are also gratefully acknow- ledged. These comments stimulated a substantial extension of the method. This work was supported by Sonderforschungsbereich 373 `Quanti®kation und Simulation OÈkonomischer Prozesse' of the Deutsche Forschungsgemeinschaft, at Humboldt-UniversitaÈt zu Berlin.

Appendix A

We demonstrate that a scalar bandwidthhcan be used instead of a bandwidth vectorh h1, . . .,hd for estimatingfx. Estimation based on a single bandwidthhachieves the same convergence rate as multiple bandwidths but may be less ecient when in each variable direction a dierent amount of smoothing is appropriate. However, a single bandwidth is more easily computed.

When using a single bandwidthhto estimate the functionfx, the AMISE is the following function ofh:

AMISEh ⁴K

4 Cfh⁴ kKk^2d2 Bg 1 nh^d

whereCf ^d,1Cf. The scalar bandwidthhthat minimizes AMISEhis

(16)

hopt

dkKk^2d2 Bg

4n⁴KCf 1=d4

: A:1

DenoteC^ f,h ^d,1C^f,h; then we have the following theorem.

Theorem 5.Under assumptions (1)±(4) in Appendix B, ash!0 andnh^d4! 1, we have

C^ f,h Cf 2h² P^d

,1Bf Bg P^d

,1VK

nh^d4 Oph⁴ op

1

nph^d8

in which

np

h^d8!^D Nf0,²g,Kg where

²g,K 32

g⁴xw²xdx ¹⁶K 1⁴

P^d

,1Ft

₂ dt:

The optimal bandwidth to estimateCfbyC^f,his

hCf,opt

Bg P^d

,1VK

2 P^d

,1Bfn 8>

>>

<

>>

>:

9>

>>

=

>>

>;

1=d6

if P^d

,1Bf<0,

d4

Bg P^d

,1VK

4 P^d

,1Bfn 8>

>>

<

>>

>:

9>

>>

=

>>

>;

1=d6

if P^d

,1Bf>0:

8>

>>

<

>>

>:

By using the formulae in theorem 5, we can obtain both the ROT and the PI bandwidth

hROT dkKk^2d₂ B_ROTg

4n⁴K P^d

,1C_ROT,f 8>

>>

<

>>

>:

9>

>>

=

>>

>;

1=d4

, A:2

hPI

dkKk^2d2B^ g

4n⁴KC^ f _1=d4

A:3

in which BROTg and CROT,f are the same as used in equation (3.1). It is easy to verify that hPIhoptf1Opn ^2=d6g.

Appendix B

We denote bySthe support ofwx, and all integrals in variablexare overS. The following assumptions are needed to prove theorems 1, 3 and 4.

(a) The densitypx ofXexists, is continuously dierentiable up to order 2 and is bounded below away from 0 onS(assumption 1).

(b) The functionfxis continuously dierentiable onSup to order 4 whereasgxis continuous on Sand positive on an open subset ofS(assumption 2).

(c) The bandwidth hhn is a positive number depending on nsuch thath!0 andnh^d4! 1 (assumption 3).

(17)

Denote byMd the set of all non-negative de®nite matricesMthat are positive de®nite for non- negative vectors, i.e.

Md

M:Mnon-negative definite and inf

h2S^d¹h^TMh cM>0

B:1

where S^d¹S^d ¹\R^d denotes the portion of the d 1-dimensional unit sphere S^d ¹ with non- negative co-ordinates. For the diagonal bandwidth to exist, we assume that the matrixCf 2Md

(assumption 4).

Assumption 4 is not trivial. Take the functionfx1,x2 expx1 sinx2; together withpxequal to the uniform density on0, 1² andwx px, we have

Cf

0,1² exp2x1sin²x2dx 1 ² ² ⁴

1

4fexp2 1g 1 ²

² ⁴

and therefore², 1Cf², 1^T0 even though², 1^T2R².

For the proof of theorem 1, we also need lemmas 1 and 2 and theorem 6.

Lemma 1.

@Qv,M,a

@v 1 2

P^d

1 M1v

P^d ...

1Mdv

0 BB BB BB

@

1 CC CC CC A

a 2p

v1. . .vd v1¹

...

vd¹

0 BB

@ 1 CC

A, B:2

@²Qv,M,a

@v@v^T 1

2M a

4p

v1. . .vd

3v1² v1¹v2¹ . . . v1¹vd¹

v2¹v1¹ 3v2² . . . v2¹vd¹

... ... ... ...

vd¹v1¹ vd¹v2¹ . . . 3vd²

0 BB BB BB B@

1 CC CC CC CA

: B:3

Proof.Direct matrix calculation plus de®nition (2.1) yield the results.

Lemma 2.For any ®xedM2Mdanda>0,Qv,M,ais a strictly convex function ofv2R^dand therefore can be minimized by at most one vectorvinR^d.

Proof.Both terms in equation (B.3) are non-negative de®nite, whereas the second term is positive de®nite, so the Hessian ofQv,M,awith respect tovis positive de®nite.

Theorem 6. For any ®xedM2Md anda>0, Qv,M,ahas a unique minimum vectorv_opt vM,a, which is in®nitely continuously dierentiable in bothManda.

Proof.M2Mdimplies that Qv,M,a51

4 P^d

,1Mvv5 1

4cMP^d

1v²1

4cMkvk² for allv2R^dand Euclidean normkvk ^d1v²¹⁼², so

kvk!1lim fQv,M,ag51 4 lim

kvk!1fcMkvk²g 1 while it is also clear that we always have

kvk!0lim fQv,M,ag5 1 4 lim

kvk!0

pv1a. . .vd

1:

(18)

Thus the in®mum ofQv,M,ais reached at avwhose norm is bounded away from1and 0, and by the previous lemma thisvis unique. Equation (B.2) and the implicit function theorem show thatvM,a

is an in®nitely continuously dierentiable function ofManda.

To prove theorem 3, we need the dispersion matrix of the partial local cubic estimator.

Lemma 3.LetZ be as in equation (4.1) and

Hdiag1,h ²I2d 1,h ¹Id,h ³I2d 2,h ³:

Asn! 1,

H^TZ^TWZHdiag

1 ²K11d

²K1d1 ⁴Kf 1Id1ddg

,⁴KId 1,D

fpxI5d 1op1g B:4

and

H^TZ^TWZH ¹diag

1d 1 ¹ K² 1 ¹11d

K² 1 ¹1d1 K⁴ 1 ¹Id

,_K⁴I_d ₁,D ¹ I5d 1

pxo_p1

B:5

uniformly in a compact neighbourhood ofx. Here

D

²KId A B C A^T ⁶KId 1 0d 1d 1 0d 11

B^T 0d 1d 1 ⁶Kf 1Id 11d 1d 1g ⁶K1d 11

C^T 0_1d ₁ ⁶_K1_1d ₁ ₆K

0 BB BB

@

1 CC CC A

in which A⁴K

I 1 0 1d

01 1 01d

0_d ₁ I_d 0

B@

1

CA, B⁴K

0 1d 1

11d 1

0_d _d ₁

0 B@

1

CA, C⁴K

0 11

1 0_d ₁ 0 B@

1 CA:

Proof.Equation (B.4) follows by the usual array-type central limit theorem, Kbeing a symmetric compact product kernel; see, for example, Wand and Jones (1995). Equation (B.5) follows by direct veri®cation. The exact inverse ofDcan be obtained by using the tools in LuÈtkepohl (1996).

Proof of theorem 3.Given any1, . . .,dand denotingZbyZ, we ®rst note that e^THH^TZ^TWZH ¹H^TZ^TWZe0fx e^TZ^TWZ ¹Z^TWZe0fx 0,

e^THH^TZ^TWZH ¹H^TZ^TWZefx fx, whereas for any⁰1, . . ., 5d 1,⁰6,

e^THH^TZ^TWZH ¹H^TZ^TWZe⁰e^TZ^TWZ ¹Z^TWZe⁰0:

Combining these with equation (B.5) of lemma 3, we have

f^x fx 2e^THH^TZ^TWZH ¹H^TZ^TWY fx

T1T2T3

where

T1 2op1

⁴K 1pxnh² Pⁿ

i1KhXi x

Xi x² h² ²K

fXi fx Xi x^Trfx

1

2Xi x^Tr²fxXi x 1

3! fxXi x³ P

6

3

3! fxXi x²Xi x P

6

3

3! fxXi xXi x²

,