Nonparametric method for sparse conditional density estimation in moderately large dimensions

(1)

HAL Id: hal-01688664

https://hal.archives-ouvertes.fr/hal-01688664

Preprint submitted on 19 Jan 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Nonparametric method for sparse conditional density estimation in moderately large dimensions

Minh-Lien Jeanne Nguyen

To cite this version:

Minh-Lien Jeanne Nguyen. Nonparametric method for sparse conditional density estimation in moderately large dimensions. 2018. �hal-01688664�

(2)

Nonparametric method for sparse conditional density estimation in moderately large dimensions

Minh-Lien Jeanne Nguyen Laboratoire de Mathématiques d’Orsay

Univ. Paris-Sud, CNRS, Université Paris-Saclay, 91405 Orsay, France January 19, 2018

Abstract: In this paper, we consider the problem of estimating a conditional density in moderately large dimensions. Much more informative than regression functions, conditional densities are of main interest in re- cent methods, particularly in the Bayesian framework (studying the posterior distribution, finding its modes...).

Considering a recently studied family of kernel estimators, we select a pointwise multivariate bandwidth by revis- iting the greedy algorithmRodeo(Regularisation Of Derivative Expectation Operator). The method addresses several issues: being greedy and computationally efficient by an iterative procedure, avoiding the curse of high dimensionality under some suitably defined sparsity conditions by early variable selection during the procedure, converging at a quasi-optimal minimax rate.

Keywords:conditional density, high dimension, minimax rates, kernel density estimators, greedy algorithm, sparsity, nonparametric inference.

1 Introduction

1.1 Motivations

In this paper, we consider the problem of the conditional density estimation. We observe an-sample of a couple (X, Y), in whichY is the vector of interest whileXgathers auxiliary variables. We denotedthe joint dimension.

In particular we are interested in the inference of thed-dimensional conditional densityf ofY conditionally to X.

There is a growing demand for methods of conditional density estimation in a wide spectrum of applications such as Economy[Hall et al. 2004], Cosmology [Izbicki and Lee 2016], Medicine [Takeuchi et al. 2009], Actuar- ies [Efromovich 2010b], Meteorology [Jeon and Taylor 2012] among others. It can be explained by the double role of the conditional density estimation: deriving the underlying distribution of a dataset and determining the impact of the vectorXof auxiliary variables on the vector of interestY. In this aspect, the conditional density estimation is richer than both the unconditional density estimation and the regression problem. In particular, in the regression framework, only the conditional meanE[Y|X]are estimated instead of the full conditional density, which can be especially poorly informative in case of an asymmetric or multi-modal conditional density.

Conversely, from the conditional density estimators, one can, e.g., derive the conditional quantiles [Takeuchi et al. 2006] or give accurate predictive intervals [Fernández-Soto et al. 2002]. Furthermore, since the posterior distribution in the Bayesian framework is actually a conditional density, the present paper also offers an alterna- tive method to the ABC methodology (for Approximate Bayesian Computation) [Beaumont et al. 2002 ; Marin et al. 2012 ; Biau et al. 2015] in the case of an intractable-yet-simulable model.

The challenging issue in conditional density estimation is to circumvent the "curse of dimensionality". The problem is twofold: theoretical and practical. In theory, it is stigmatized by the minimax approach, stating that in ad-dimensional space the best convergence rate for the pointwise risk over ap-regular class of functions is O(n⁻^2p+d^p ): in particular, the larger isd, the slower is the rate. In practice, the larger the dimension is, the larger the sample size is needed to control the estimation error. In order to maintain reasonable running times

(3)

in moderately large dimensions, methods have to be designed especially greedy.

Furthermore, one interesting question is how to retrieve the eventualrelevantcomponents in case of sparsity structure on the conditional densityf. For example, if we have at disposal plenty of auxiliary variables without any indication on their dependency with our vector of interestY, the ideal procedure will take in input the whole dataset and still achieve a running time and a minimax rate as fast as if only the relevant components were given and considered for the estimation. More precisely, two goals are simultaneously addressed : converging at rate O(n⁻^2p+r^2p )withrthe relevant dimension,i.e.the number of components that influence the conditional density f, and detect the irrelevant components at an early stage of the procedure in order to afterwards only work on the relevant data and thus speed up the running time.

1.2 Existing methodologies

Several nonparametric methods have been proposed to estimate conditional densities: kernel density estimators [Rosenblatt 1969 ; Hyndman et al. 1996 ; Bertin et al. 2016] and various methodologies for the selection of the associated bandwidth [Bashtannyk and Hyndman 2001 ; Fan and Yim 2004 ; Hall et al. 2004]; local poly- nomial estimators [Fan et al. 1996 ; Hyndman and Yao 2002]; projection series estimators [Efromovich 1999;

2007]; piecewise constant estimator [Györfi and Kohler 2007 ; Sart 2017]; copula [Faugeras 2009]. But while most of the aforementioned works are only defined for bivariate data or at least when eitherXorY is univariate, they are also computationally intractable as soon asd >3.

It is in particular the case for the kernel density methodologies (Hall, Racine ,Li 2004, Bertin et al. 2016): they achieve the optimal minimax rate, and even the detection of the relevant components, thanks to an adequate choice of the bandwidth (for the two aforementioned methods by cross validation and Goldenshluger-Lepski methodology), but the computational cost of these bandwidth selections is prohibitive even for moderate sizes ofnandd. To the best of our knowledge, only two kernel density methods have been proposed to handle large datasets. [Holmes et al. 2010] propose a fast method of approximated cross-validation, based on a dual-tree speed-up, but they do not establish any rate of convergence and only show the consistency of their method. For scalarY, [Fan et al. 2009] proposed to perform a prior step of dimension reduction onXto bypass the curse of dimensionality, then they estimate the bivariate approximated conditional density by kernel estimators. But the proved convergence raten⁻¹³ is not the optimal minimax raten⁻³⁸ for the estimation of a bivariate function of assumed regularity3. Moreover, the step of dimension reduction restricts the dependency ofXto a linear combination of its components, which may induce a significant loss of information.

Projection series methods for scalarY have also been proposed. [Efromovich 2010a] extends his previous work [Efromovich 2007] to a multivariateX. Theoretically the method achieves an oracle inequality, thus the optimal minimax rate. Moreover it performs an automatic dimension reduction onXwhen there exists a smaller intrinsic dimension. To our knowledge, it is the only method which addresses datasets of dimension larger than3with reasonable running times and does not pay its numerical performance with non optimal minimax rates. However the computation cost is prohibitive when bothnanddare large. More recently, Izbicki and Lee have proposed two methodologies using orthogonal series estimators [Izbicki and Lee 2016; 2017]. The first method is particularly fast and can handle very largeX(with more than1000covariates). Moreover the convergence rate adapts to an eventual smaller unknown intrinsic dimension of the support of the conditional density. The second method originally proposes to convert successful high dimensional regression methods into the conditional density estimation, interpreting the coefficients of the orthogonal series estimator as regression functions, which allows to adapt to all kind of figures (mixed data, smaller intrinsic dimension, relevant variables) in function of the regression method. However both methods converge slower than the optimal minimax rate. Moreover their optimal tunings depend in fact on the unknown intrinsic dimension.

For multivariateXandY, [Otneim and Tjøstheim 2017] propose a new semiparametric method, called Locally Gaussian Density Estimator: they rewrite the conditional density as a product of a function depending on the marginal distribution functions (easily estimated since univariate, then plug-in), and a term which measures the dependency between the components, which is approximated by a centred Gaussian whose covariance is para- metrically estimated. Numerically, the methodology seems robust to addition of covariates ofXindependent ofY, but it is not proved. Moreover they only establish the asymptotic normality of their method.

(4)

1.3 Our strategy and contributions

The challenge in this paper is to handle large datasets, thus we assume at our disposal a sample of large sizen and of moderately large dimension. Then our work is motivated by the following three objectives:

(i) achieving the optimal minimax rate (up to a logarithm term);

(ii) being greedy, meaning that the procedure must have reasonable running times for largenand moderately large dimensions, in particular whend >3;

(iii) adapting to a potential sparsity structure off. More precisely, in the case wheref locally depends only on a numberr of itsdcomponents, r can be seen as the localrelevant dimension. Then the desired convergence rate has to adapt to the unknown relevant dimensionr: under this sparsity assumption, the benchmark for the estimation of ap-regular function is to achieve a convergence rate of the order O(n⁻^2p+r^2p ), which is the optimal minimax rate if the relevant components were given by an oracle.

Our strategy is based on kernel density estimators. The considered family has been recently introduced and studied in [Bertin et al. 2016]. This family is especially designed for conditional densities and is better adapted for the objective (iii) than the intensively studied estimator built as the ratio of a kernel estimator of the joint density over one of the marginal density ofX. For example, a relevant component for the joint density and the marginal density ofXmay be irrelevant for the conditional density and it is the case if a component ofX is independent ofY. Note though that many more cases of irrelevance exist since we define the relevance as a local property.

The main issue with kernel density estimators is the selection of the bandwidthh ∈ R^d+, and in our case, we also want to complete the objective (ii), since the pre-existing methodologies of bandwidth selection does not satisfy this restriction and thus cannot handle large datasets. In this paper, it is performed by an algorithm we call CDRodeo, which is derived from the algorithmRodeo[Lafferty and Wasserman 2008 ; Liu et al. 2007], which has respectively been applied for the regression and the unconditional density estimation. The greediness of the algorithm allows us to address datasets of large sizes while keeping a reasonable running time (see Section 3.5 for further details). We give a simulated example with a sample of sizen = 10⁵ and of dimensiond = 5 in Section 4. Moreover,Rodeo-type algorithms ensure an early detection of irrelevant component, and thus achieve the objective (iii) while improving the objective (ii).

From the theoretical point of view, if the regularity offis known, our method achieves an optimal minimax rate (up to a logarithmic factor), which is adaptive to the unknown sparsity off. The last property is mostly due to theRodeo-type procedures. The improvement of our method in comparison to the paper [Liu et al. 2007]

which estimates theunconditionaldensity withRodeois twofold. First, our result is extended to any regularity p ∈ N>0, whereas [Liu et al. 2007] fixedp = 2. Secondly, our notion of relevance is both less restrictive and more natural. In [Liu et al. 2007], they studied theL2-risk of their estimator, therefore they have to consider a notion of global relevance, whereas we consider a pointwise approach, which allows us to define a local property of relevance, which can be applied to a broader class of functions. Moreover, their notion of relevance is not intrinsic to the unknown density, but in fact depends on a tuning of the method, a prior chosenbaseline density which has no connexion with the density, which limits the interpretation of therelevance.

1.4 Overview

Our paper is organized as follows. We introduce theCDRodeomethod in Section 2. The theoretical results are in Section 3, in which we specify the assumptions and the tunings of the procedure from which are derived the convergence rate and the complexity cost of the method. A numerical example is presented in Section 4. The proofs are in the last section.

2 CDRodeo method

LetW1, . . . , Wnbe a sample of a couple(X, Y)of multivariate random vectors: fori= 1, . . . , n, W_i= (X_i, Y_i),

(5)

withX_ivalued inR^d¹ andY_iinR^d². We denoted:=d₁+d₂the joint dimension.

We assume that the marginal distribution ofXand the conditional distribution ofY givenX are absolutely continuous with respect to the Lebesgue measure, and we definef :R^d→Rsuch as for anyx∈R^d¹,f(x,·) is the conditional density ofY conditionally toX =x. We denotef_X the marginal density ofX.

Our method estimatesf pointwisely : let us fixw= (x, y)∈R^dthe point of interest.

Kernel estimators. Our method is based on kernel density estimators. More specifically, we consider the family proposed in [Bertin et al. 2016], which is especially designed for the conditional density estimation. Let K :R→Rbe a kernel function,ie:R

RK(t)dt= 1, then for any bandwidthh∈(R^∗₊)^d, the estimator off(w) is defined by:

fˆh(w) := 1 n

n

X

i=1

1 f˜_X(X_i)

d

Y

j=1

h⁻¹_j K

wj−Wij

hj

, (1)

wheref˜X is an estimator offX, built from another sampleXeofX. We denote bynX the sample size ofX.e The choices ofKandf˜X are specified later (see section 3.2).

Bandwidth selection. In kernel density estimation, selecting the bandwidth is a critical choice which can be viewed as a bias-variance trade-off. In [Bertin et al. 2016], it is performed by the Goldenshluger-Lepski methodology (see [Goldenshluger and Lepski 2011]) and requires an optimization over an exhaustive grid of cou- ples(h, h⁰)of bandwidths, which leads to intractable running time when the dimension exceeds3(and large dataset).

That is why we focus in a method which excludes optimization over an exhaustive grid of bandwidths to rather propose a greedy algorithm derived from the algorithmRodeo. First introduced in the regression framework [Wasserman and Lafferty 2006 ; Lafferty and Wasserman 2008], a variation ofRodeowas proposed in [Liu et al.

2007] for the density estimation. Our method we calledCDRodeo(for Conditional DensityRodeo) addresses the more general problem of conditional density estimation.

LikeRodeo(which means Regularisation Of Derivative Expectation Operator), theCDRodeoalgorithm gener- ates an iterative path of decreasing bandwidths, based on tests on the partial derivatives of the estimator with respect to the components of the bandwidth. Note that the greediness of the procedure leans on the selection of this path of bandwidths, which enables us to address high dimensional problems of functional inference.

Let us be more precise: we take a kernel K of classC¹ and consider the statisticsZhj forh ∈ (R^∗+)^dand j= 1 :d, defined by:

Z_hj := ∂

∂h_jfˆ_h(w).

Z_hj is easily computable, since it can be expressed by:

Zhj = −1 nh²_j

n

X

i=1

1 f˜_X(X_i)J

_w

j−W_ij hj

Y^d

k6=j

h⁻¹_k K

wk−Wik

h_k

, (2)

whereJ :R→Ris the function defined by:

t7→K(t) +tK⁰(t). (3)

The details of theCDRodeoprocedure are described inAlgorithm 1and can be summed up in one sentence:

for a well-chosen threshold λhj (specified in Section 3.3), the algorithm performs at each iteration the test

|Z_hj| > λ_hj to determine if the component j of the current bandwidth must be shrunk or not. It can be interpreted by the following principle: the bandwidth of a kernel estimator quantifies within which distance of the point of interestwand at which degree an observationWi helps in the estimation. Heuristically, the larger the variation off is, the smaller the bandwidth is required for an accurate estimation. The statistics Z_hj = _∂h^∂

j

fˆ_h(w)are used as a proxy of _∂w^∂_jf(w)to quantify the variation off in the directionw_j. Note in particular that since the partial derivatives vanish for irrelevant components, this bandwidth selection leads to an implicit variable selection, and thus to avoid the curse of dimensionality under sparsity assumptions.

(6)

Algorithm 1CDRodeoalgorithm

1. Input: the point of interestw, the dataW, β ∈ (0,1)the bandwidth decreasing factor, h0 > 0the bandwidth initialization value, a parametera >1.

2. Initialization:

(a) Initialize the bandwidth:forj= 1 :d,h_j ←h₀. (b) Activate all the variables:A ← {1, . . . , d}. 3. While(A 6=∅) & (Q^d

k=1

h_k≥ ^log_nⁿ):

for allj∈ A:

(a) UpdateZ_hj andλ_hj.

(b) If|Z_hj| ≥λhj: updatehj ←βhj. else:removejfromA.

4. Output:h(andfˆh(w)).

3 Theoretical results

This section gathers the theoretical results of our method.

3.1 Assumptions

We considerK a compactly supported kernel. For any bandwidthh ∈ (R^∗+)^d, we define the neighbourhood U_h(u)ofu∈R^d

0(typically,u=xorw, andd⁰ =d1ord) as follows:

U_h(u) :=n

u⁰ ∈R^d

0 :∀j= 1 :d⁰, u⁰_j =u_j−h_jz_j,withz∈(supp(K))^d⁰o . Then we denote theCDRodeoinitial bandwidthh⁽⁰⁾=

1

logn, . . . ,_log¹_n

and for short,U_n(u) :=U_h(0)(u).

We also introduce the notationk · k_∞,_U for the supremum norm over a set U.

The following first assumption ensures a certain amount of observations in the neighbourhood of our point of interestw.

Assumption 1(f_X bounded away of0). We assumeδ:= inf

u∈ U_n(x)f_X(u)>0.

Note that if the neighbourhood U_n(x)does not contain any observationX_i, the estimation of the conditional distribution ofY given the eventX=xis obviously intractable.

The second assumption specifies the notions of "sparse function" and "relevant component", under which the curse of high dimensionality can be avoided.

Assumption 2(Sparsity condition). There exists a subsetR ∈ {1, . . . , d}such that for any fixed{z_j}_j∈R, the function{z_k}_k∈Rc 7→f(z₁, . . . , z_d)is constant on U_n(w).

In other words, if we denoterthe cardinal ofR, Assumption 2 means thatflocally depends on onlyrof itsd variables. We callrelevantany component inR. The notion of relevant component depends on the point where f is estimated. For example, a componentw_j which behaves as1[0,1](w_j)in the conditional density is only relevant in the neighbourhood of0and1. Note that this local property addresses a broader class of functions, which extends the application field of Theorem 2 and improves the convergence rate of the method.

Finally, the conditional density is required to be regular enough.

(7)

Assumption 3(Regularity off). There exists a known integerpsuch thatfis of classC^pon U_n(w)and such that

∂_j^pf(w)6= 0for allj∈ R.

3.2 Conditions on the estimator off_X

Given the definition of the estimator (1), we need an estimatorf˜X offX.

IffX is known. We takef˜X ≡ fX. This case is not completely obvious. In particular, it tackles the case of unconditional density estimation, if we set by conventiond1 = 0andfX ≡1.

IffX is unknown. We need an estimatorf˜X which satisfies the following two conditions:

(i) a positive lower bound:δe_X := inf

u∈ Un(x)

f˜_X(u)>0

(ii) a concentration inequality in local sup norm: there exists a constantM_X >0such that:

P sup

u∈ Un(x)

fX(u)−f˜X(u) f˜_X(u)

> M_X(logn)^d² n¹²

!

≤exp(−(logn)⁵⁴).

The following proposition proves these conditions are feasible. Furthermore, the provided estimator offX (see the proof in Section 5.3.1) is easily implementable and does not need any optimisation.

Proposition 1. Given a sampleXewith same distribution asXand of sizen_X =n^cwithc > 1, iff_X is of classC^p⁰ withp⁰ ≥ _2(c−1)^d¹ , there exists an estimatorf˜X which satisfies (i) and (ii).

3.3 CDRodeoparameters choice.

KernelK. We choose the kernel functionK :R→Rof classC¹, with compact support and of orderp,i.e.:

for`= 1, . . . , p−1,R

Rt^`K(t)dt= 0, andR

Rt^pK(t)dt6= 0.

Note that considering a compactly supported kernel is fundamental for the local approach. In particular, it relaxes the assumptions by restricting them to a neighbourhood ofw.

Taking a kernel of orderpis usual for the control of the bias of the estimator.

Parameterβ. Letβ ∈ (0,1)be the decreasing factor of the bandwidth. The largerβ, the more accurate the procedure, but the longer the computational time. From the theoretical point of view, it remains of little importance, as it only affects the constant terms. In practice, we set it close to1.

Bandwidth initialization. We recall that we seth₀ := _log¹_n (and the initial bandwidth as

1

logn, . . . ,_log¹_n ).

Thresholdλ_h,j. For any bandwidthh∈(R^∗+)^dand forj= 1 :d, we set the threshold as follows:

λ_hj :=Cλ

s (logn)^a nh²_jQ_d

k=1h_k, (4)

with Cλ := 4kJk₂kKk^d−1₂ (whereJis defined in (3)) anda > 1. The expression is obtained by using concentration inequalities onZ_hj. For the proof, the parameterahas to be tuned such that:

(logn)^a−1> kfk_∞,_U_n_(w)

δ , (5)

which is satisfied fornlarge enough. The influence of this parameter is discussed in the next section, once the theoretical results are stated.

Hereafter, unless otherwise specified, the parameters are chosen as described in this section.

(8)

3.4 Mains results

Let us denoteˆhthe bandwidth selected byCDRodeo. In Theorem 2, we introduce a setH_hp of bandwidths which containsˆhwith high probability, which leads to an upper bound of the pointwise estimation error with high probability. In Corollary 3, we deduce the convergence rate ofCDRodeofrom Theorem 2.

More precisely, in Theorem 2, we determine lower and upper bounds (with high probability) for the stopping iteration of each bandwidth component. We set:

τn:= 1

(2p+r) log_β¹ log

n Cτ(logn)^2p+d+a

, (6)

and

Tn:=τn+ log C⁻¹_T

(2p+ 1) log_β¹, (7)

where

Cτ :=







4(p−1)!Cλ

minj∈R∂_j^pf(w)

R

Rt^pK(t)dt







2

,CT :=





minj∈R|∂_j^pf(w)|

24 max

j∈R |∂_j^pf(w)|





2

.

Then we define the set of bandwidthsH_hpby:

H_hp :=

h∈R^d₊ :hj = β^θ^j

logn, withθj ∈ {bτ_nc+ 1, . . . ,bT_nc}ifj∈ R, elseθj = 0

.

Theorem 2. Assume thatf˜Xsatisfies Conditions (i) and (ii) of section 3.2 and Assumptions 1 to 3 are satisfied. Then, the bandwidthˆhselected byCDRodeobelongs toH_hpwith high probability. More precisely, for anyq > 0and forn large enough:

P

ˆh∈ H_hp

≥1−n^−q. (8)

Moreover, with probability larger than1−2n^−q, theCDRodeoestimatorfˆˆh(w)verifies:

fˆˆh(w)−f(w)

≤C(logn)

p

2p+r(d−r+a)

n⁻

p

2p+r (9)

with

C:= 2rC

p 2p+r

τ

Z

t∈R

|t^p

p!K(t)|dt×max

k∈Rk∂^p_kfk_∞,_U_n_(w)+ 4kKk^d₂kfk

1 2

∞,Un(w)δ⁻¹²C

−r 2(2p+1)

T C

−r 2(2p+r)

τ .

Corollary 3. Under the assumptions of Theorem 2, for anyq ≥1:

E

h

fˆˆh(w)−f(w)

qi1/q

≤C(logn)

p

2p+r(d−r+a)

n⁻

p

2p+r +o n⁻¹ .

Corollary 3 presents a generalization of the previous works onRodeo[Lafferty and Wasserman 2008] and [Liu et al. 2007] whose results are restricted to the regularityp= 2and to simpler problems, namely regression and density estimation.

We compare the convergence rate ofCDRodeowith the optimal minimax rate. In particular, our benchmark is the pointwise minimax rate, which is of orderO

n⁻^2p+d^p

, for the problem ofp-regulard-dimensional density estimation, obtained by [Donoho and Low 1992].

Without sparsity structure (r=d),CDRodeoachieves the optimal minimax rate, up to a logarithmic factor. The exponent of this factor depends on the parametera. For the proofs, we needa >1in order to satisfy (5), but if an upper bound (or a pre-estimator) of ^kfk^∞,_δ^Uⁿ^(w) were known, we could obtain the similar result witha= 1 and a modified constant term. Note that the logarithmic factor is a small price to pay for a computationally- tractable procedure for high-dimensional functional inference, in particular see section 3.5 for the computational gain of our procedure.

(9)

Under sparsity assumptions, we avoid the curse of high dimensionality and our procedure achieves the desired raten⁻^2p+r^p (up to a logarithmic term), which is optimal if the relevant components were known. Note that some additional logarithmic factors could be unavoidable due to the unknown sparsity structure, which needs to be estimated. Identifying the exact order of the logarithm term in the optimal minimax rate for the sparse case remains an open challenging question.

3.5 Complexity

We now discuss the complexity ofCDRodeowithout taking into account the pre-computation cost off˜_X at the pointsX_i,i= 1 :n(used for computing theZ_hj), but a fast procedure forf˜_X is required, to avoid losing CDRodeocomputational advantages.

ForCDRodeo, the main cost lies in the computation of theZ_hj’s along the path of bandwidths.

The condition Q^d

k=1

h_k ≥ ^log_nⁿ restricts to at mostlog_β⁻¹nupdates of the bandwidth across all components, leading to a worst-case complexity of orderO(d.nlogn).

But as shown in Theorem 2, with high probability,ˆh ∈ H_hp, in which only the relevant components are active after the first iteration. In first iteration, theZ_h(0)j’s computation costsO(d.n)operations, while the product kernel enables us to compute theZhj’s in following iteration with onlyO(r.n)operations, which leads to the complexityO(d.n+r.nlogn).

In order to grasp the advantage ofCDRodeogreediness, we compare its complexity with optimization over an exhaustive bandwidth grid withlognvalues for each component of the bandwidth (which is often the case in others methods: Cross validation, Lepski methods...): for each bandwidth of(logn)^d-sized grid, the computation of a statistic from thed.n-sized dataset needs at leastO(d.n)operation, which leads to a complexity of orderO(d.n(log_β−1n)^d). Using the parameters used in the simulated example in section 4 (n = 2.10⁵, d= 5,r= 3,β = 0.95), the ratio of complexities is d.n(logn)^d

r.nlogn ≈5.10⁹, and even without sparsity structure:

d.n(logn)^d

d.nlogn ≈3.10⁹. It means thatCDRodeorun is a billion times faster on this data set.

4 Simulations

In this section, we test the practical performances of our method. In particular, we studyCDRodeo on a 5- dimensional example. The major purpose of this section is to assess if the numerical performances of our procedure. Let us describe the example. We setd₁ = 4andd₂ = 1and simulate an i.i.d sample{(X_i, Y_i)}ⁿ_i=1with the following distribution: for anyi= 1, . . . , n:

- the first componentX_i1ofX_ifollows a uniform distribution on[−1,1],

- the other componentsX_ij,j= 2 : 4, are independent standard normal and are independent ofX_i1, - Yiis independent ofXi1,Xi3andXi4and the conditional distribution ofYigivenXi2is exponential with

survival parameterX_i2².

The estimated conditional density function is then defined by:

f : (x, y)7→1[−1,1](x₁) 1 x²₂e⁻

y x2 2.

This example enables us to test several criteria: sparsity detection, behaviour when fonctions are not continuous, bimodality estimation, robustness whenfXtakes small values.

In the following simulations, if not stated explicitly otherwise,Rodeois run with sample sizen = 200,000, product Gaussian kernel, initial bandwidth valueh₀= 0.4, bandwidth decreasing factorβ= 0.95and parametera= 1.1andf˜X ≡fX.

Figure 1 illustratesCDRodeobandwidth selection. In which, the boxplots of each selected bandwidth component are built from 200 runs ofCDRodeoat the pointw= (0,1,0,0,1). This figure reflects the specificity of

(10)

CDRodeoto capture the relevance degree of each component, and one could compare it with variable selection (as done in [Lafferty and Wasserman 2008]). The componentsx₃andx₄ are irrelevant and for this point of interest, the componentsx2andyare clearly relevant while the componentx1is barely relevant asfis constant in the directionx₁ in near neighbourhood ofx₁ = 0. As expected, the irrelevanth₃andh₄are mostly deactivated at the first iteration, while the relevanth₂ andh₅are systematically shrunk. The relevance degree ofx1is also well detected as the values ofh1are smaller thanh0, but significantly larger thanh2andh5.

Figure 1: Boxplots of each component of200CDRodeoselected bandwidths at the pointw= (0,1,0,0,1). Figure 2 givesCDRodeoestimation offfrom onen-sample. The functionf is well estimated. In particular, irre- vance, jumps and bi-modality are features which are well detected by our method. As expected, main estimation errors are made on points of discontinuity forx1andyor at the boundaries forx2,x3andx4. Note that thef_X values are particularly small at the boundaries of the plots in function ofx, leading to lack of observations for the estimation. Note however that null value forfXdoes not deteriorate the estimation (cf top left plot), since the estimate off vanishes automatically when there is no observation near the point of interest.

Running time. The simulations are implemented in R on a Macintosh laptop with a3,1GHz Intel Core i7 processor. In the Figure Figure 1, the200runs ofCDRodeotake2952.735seconds (around50minutes), or 14.8seconds per run.

5 Proofs

We first give the outlines of the proofs in Section 5.1. To facilitate the lecture of the proof, we have divided the proofs of the main results (Proposition 1, Theorem 2 and Corollary 3) into intermediate results which are stated in Section 5.2 and proved in Section 5.4. The proof of the main results are in Section 5.3.

5.1 Outlines of the proofs

We first prove Proposition 1 by constructing an estimator off_X with the wanted properties. In this proof, we use some usual properties of a kernel density estimator (control of the bias, concentration inequality), which are gathered in Lemma 4.

Theorem 2 states two results: the bandwidth selection (8) and the estimation error of the procedure (9). For the proof of the bandwidth selection (8), Proposition 8 mades explicit the highly probable behaviour ofCDRodeo along a run, and thus the final selected bandwidth. In particular, the proof leans on an analysis ofZ_hj, which is made in two steps. We first considerZ¯_hj, a simpler version ofZ_hj in which we substitute the estimator of

(11)

Figure 2:CDRodeoestimator (red or blue dashed lines) VS the true density (black solid line) in function of each component, the others component being fixed following(x, y) = (0,1,0,0,1).

fX byfX itself, and we detail its behaviour in Lemma 6. Then we control the differenceZhj−Z¯hj(see in 1. of Lemma 7) to ensureZ¯_hj behaves likeZ_hj.

To control the estimation error of the procedure (9), we similarly analysefˆ_h(w)in two parts: in Lemma 5, we describe the behaviour off¯h(w), the simpler version offˆh(w)in which we substitute the estimator offX by fX itself, and in 2. of Lemma 7, we bound the differencef_h−f¯_h(w). Then the bandwidth selection (8) leads to the upper bound with high probability of the estimation error offˆ_h(w)(9).

Finally, we obtain the expected error offˆ_h(w)stated in Corollary 3 by controlling the error on the residual event.

5.2 Intermediate results

For any bandwidthh_X ∈R^∗+, we define the kernel density estimatorf˜_X^Kby: for anyu∈R^d¹, f˜_X^K(u) := 1

n_X.h^d_X¹

nX

X

i=1 d1

Y

j=1

K_X u_j−Xe_ij h_X

!

, (10)

whereKX :R→Ra kernel which is compactly supported, of classC¹, of orderpX ≥ _2(c−1)^d¹ ,where we recall thatc >1is defined byn_X =n^c.

We also introduce the neighbourhood

U_n⁰(x) :={u⁰=u−h_Xz:u∈ U_n(x), z∈supp(K_X)}. (11) Lemma 4(f˜_X^Kbehaviour). We assumef_X isC^p⁰on U_n⁰(x)withp⁰ ≤p_X, then for any bandwidthh_X ∈R^∗+,

1. if we denote C_bias_X := ^kK^X^k

d1−1

1 k·^p⁰KX(·)k1

p⁰! d₁ max

k=1:d1

k∂_k^p⁰f_Xk_∞,_U0

n(x), then

E

hf˜_X^Ki

−f_X _∞,_U

n(x)

≤C_bias_Xh^p_X⁰.

2. If the condition

Cond_X(h_X) : h^d_X¹ ≥ 4kK_Xk^2d_∞¹ 9kK_Xk^2d₂ ¹kf_Xk_∞,_U⁰

n(x)

(logn)³² n_X

(12)

is satisfied, then forλX :=

s

4kKXk^2d₂ ¹kfXk_∞,_{U 0}

n(x)

h^d_X¹nX

(logn)³² and for anyu∈ U_n(x):

P

f˜_X^K(u)−E

hf˜_X^K(u)i

> λ_X

≤2 exp

−(logn)³² .

Lemma 5(f¯h(w) behaviour). For any bandwidthh ∈ (0, h0]^d, and anyi = 1 : n, let us denotef¯hi(w) :=

K_h(w−Wi)

fX(Xi) . Then, ifKis chosen as in section 3.3, under Assumptions 1 to 3, 1. Let C¯E :=kfk_∞,_U_n_(w)kKk^d₁. Then

Ef¯h1(w) ≤E

f¯h1(w)

≤C¯E. Besides, if we denoteBh :=Ef¯h(w)

−f(w)the bias off¯h(w) := _n¹

n

P

i=1

f¯hi(w), then:

|B_h| ≤C_biasX

k∈R

h^p_k,

with C_bias:= ^2|

R

t∈Rt^pK(t)dt|

p! max

k∈R |∂_k^pf(w)|.

2. LetB_h:={|f¯_h(w)−E[ ¯f_h(w)]| ≤σ_h}, whereσ_h :=Cσ

s(logn)^a n

d

Q

k=1

hk

with Cσ = ^2kKk

d 2kfk

12

∞,Un(w)

δ¹² . If Cond(h):

d

Q

k=1

h_k≥ ^4kKk^2d^∞

9δ²C²σ

(logn)^a

n is satisfied, then:

P B^c_h

≤2e^−(logⁿ⁾^a

3. LetB_|f|h¯ :={|¹_n Pⁿ

i=1

|f¯_hi(w)| −E[|f¯_h(w)|]| ≤C¯E}. Then

P

B_|^c_f|h_¯

≤2e^−C^γ|f^|ⁿ^Q^d^k=1^h^k,

with C_γ|f|:= min _C2

¯E

C²σ;_4kKk^3δC^¯^Ed

∞

.

Lemma 6 (Z¯_hj behaviour). For any j ∈ {1, . . . , d} and any bandwidth h ∈ (0, h₀]^d, we define Z¯_hij := _f ¹

X(Xi)

∂

∂h_j _d

Q

k=1

h⁻¹_k K

wk−W_ik hk

, andZ¯_hj := ¹_nP_n

i=1Z¯_hij. IfKis chosen as in section 3.3, and under Assumptions 1 to 3,

1. Under Assumptions 1 to 3, forj /∈ R:

EZ¯hj

= 0.

whereas, forj∈ R, fornlarge enough, 1

2C_EZ,j¯ h^p−1_j ≤

EZ¯_hj ≤ 3

2C_EZ,j¯ h^p−1_j , (12) where C_EZ,j¯ :=

R

Rt^pK(t)dt

(p−1)! ∂_j^pf(w) .

Besides, let C_E|Z|¯ :=kfk_∞,_U_n_(w)kJk₁kKk^d−1₁ . Then : E

|Z¯_h1j|

≤C_E|Z|¯h⁻¹_j . (13) 2. LetBZ,hj¯ :={|Z¯_hj −EZ¯_hj

| ≤ ¹₂λ_hj}. Under Assumptions 1 to 3, if the bandwidth satisfies:

(13)

CondZ¯(h):

d

Q

k=1

h_k≥condZ¯ (logn)^a

n , with condZ¯ := ^4kJk²^∞^kKk

2(d−1)

∞

3²kfk∞,Un(w)kJk²₂kKk^2(d−1)₂ , then:P

B^c_Z,hj_¯

≤2e^−γ^Z,n,withγZ,n:= _kfk ^δ

∞,Un(w)(logn)^a. 3. LetB_|Z|,hj¯ :={|_n¹

n

P

i=1

|Z¯hij| −E

|Z¯h1j|

| ≤C_E|Z|¯h⁻¹_j }. Then, under Assumptions 1 to 3:

P

B^c_|_Z|,hj_¯

≤2e^−C^γ|^Z|^¯ⁿ

Qd k=1hk, with C_γ|Z|¯ := min

δC²_E|_Z|_¯

4kfk_∞,_U_n(w)kJk²₂kKk^2(d−1)₂ ; ^3δC^E|^Z|^¯

4kKk^d−1∞ kJk∞

.

Lemma 7. For anyh ∈ (0, h0]^dand any componentj = 1 : d, we denote∆_Z,hj := Z_hj −Z¯_hj and∆_h :=

fˆ_h(w)−f¯_h(w). Under Assumptions 1 to 3, if the conditions onf˜_Xare satisfied (see section 3.2), then, 1. for C_M∆Z := ^2C^E|_C^Z|^¯^M^X

λ :

1_B

|Z|,hj¯ ∩Aen|∆_Z,hj| ≤ CM∆Z

(logn)^a²λ_hj 2. for CM∆:= ^2C^¯^E_C^M^X

σ :

1_A_e_n_∩B

|f|h¯ |∆_h| ≤ C_M∆

(logn)^a²σ_h.

We introduce the notationh^(t),t∈N, the state of the bandwidth at iterationtifˆh=h. In particular for a fixed t∈ {0, . . . ,bτ_nc},h^(t)is identical for anyh∈ H_hp. Then we consider the event:

E_Z :=Ae_n∩ \

j /∈R

nB_Z,h_¯ (0)j ∩ B_|_Z|,h_¯ (0)j

o∩ \

j∈R





\

h∈H_hp

nBZ,hj¯ ∩ B_|Z|,hj¯

o∩

bτ_nc

\

t=0

nB_Z,h_¯ (t)j∩ B_|_Z|,h_¯ (t)j

o



,

where we denoteAe_n:=

( sup

u∈ U_n(x)

fX(u)−f˜X(u) f˜X(u)

≤M_X^(log_nⁿα^X⁾^b X

) .

Proposition 8(CDRodeobehaviour). Under Assumptions 1 to 3, onE_Z,ˆh∈ H_hp. In other words, whenE_Zhappens:

1. non relevant components are deactivated during the iteration0;

2. at the end of the iterationbτ_nc, the active components are exactly the relevant ones;

3. CDRodeostops at last at the iterationbT_nc. Moreover, for anyq >0:

P(E_Z^c) =o(n^−q).

The following lemma give a technical result to canonically obtain an upper bound of the bias of a kernel estimator. Let us denote·the multiplication terms by terms of two vectors.

Lemma 9. Letu∈R^d

0and a bandwidthh∈ R^∗+

d⁰

. Forj = 1 :d⁰, letK :R→ Rbe a continuous function with compact support and with at leastp−1zero moments, ie: forl= 1 : (p−1),

Z

R

K(t)t^ldt= 0.

(14)

Letf0:R^d

0 →Ra function of classC^ponU_h(u) :=

n

u⁰ ∈R^d

0 :∀j= 1 :d⁰, u⁰_j =uj −hjzj,withzj ∈supp(Kj) o

. Then:

Z

R^d⁰





d⁰

Y

j=1

h⁻¹_j K _u

j−u⁰_j hj



f0(u⁰)du⁰−f0(u) Z

R^d⁰





d⁰

Y

j=1

K(zj)



dz=

d⁰

X

k=1

(I_k+II_k), (14) where

Ik:=

Z

z∈R^d⁰





d⁰

Y

j=1

K(zj)



ρkdz, with the notationsρ_k := ρ_k(z, h, u) = (−h_kz_k)^p R

0≤tp≤···≤t₁≤1

∂_k^pf₀(zk−1−t_ph_kz_ke_k)−∂^p_kf₀(zk−1) dt_1:p, andzk−1:=u−

k−1

P

j=1

hjzjej(where{e_j}^d_j=1⁰ is the canonical basis ofR^d

0), and

II_k:= (−h_k)^p Z

t∈R

t^p p!K(t)dt

Z

z−k∈R^d0−1

∂_k^pf0(zk−1)



 Y

j6=k

K(zj)



dz−k.

Finally, we recall (without proof) the classical Bernstein’s Inequality and Taylor’s theorem with integral remain- der.

Lemma 10(Bernstein’s inequality).LetU₁, . . . , U_nbe independent random variables almost surely uniformly bounded by a positive constantc >0and such that fori= 1, . . . , n,E[U_i²]≤v. Then for anyλ >0,

P 1 n

n

X

i=1

Ui−E[Ui]

≥λ

!

≤2 exp

−min nλ²

4v ,3nλ 4c

.

Note that this version is a simple consequence of Birgé and Massart (p.366 of [Birgé and Massart 1998]).

Lemma 11(Taylor’s theorem). Letg: [0,1]→Rbe a function of classC^q. Then we have:

g(1)−g(0) =

q

X

l=1

g^(l)(0) l! +

Z 1 t1=0

Z t1

t2=0

· · · Z tq−1

tq=0

(g^(q)(tq)−g^(q)(0))dtqdtq−1. . . dt1

5.3 Proofs of Proposition 1, Theorem 2 and Corollary 3 5.3.1 Proof of Proposition 1

We constructf˜_X in two steps: we first construct an estimatorf˜_X^Kwhich satisfies

P kf_X−f˜_X^Kk_∞,_U_n_(x)> M_X(logn)³⁴ n¹²

!

≤exp(−(logn)⁵⁴), (15) then we show that if we setf˜_X ≡f˜_X^K∨(logn)⁻¹⁴,f˜_X satisfies Conditions (i) and (ii) fornlarge enough.

We takef˜_X^K as the kernel density estimator defined in(10), with a kernelKX : R → Rthat is compactly supported, of classC¹, of orderp_X ≥ _2(c−1)^d¹ . and a bandwidthh_X ∈ R^∗+specified later.Let us control the bias

E

hf˜_X^Ki

−f_X _∞,_U

n(x). We definep⁰_X = min(p⁰, p_X + 1). In particular,f_X is of classC^p⁰^X andK_X has p⁰_X −1zero moments.

Therefore we can apply Lemma 4 : E

hf˜_X^K i

−fX

∞,Un(x) ≤C⁰_bias_Xh^p

0 X

X ,