• Aucun résultat trouvé

A Kernel-based Consensual Aggregation for Regression

N/A
N/A
Protected

Academic year: 2021

Partager "A Kernel-based Consensual Aggregation for Regression"

Copied!
42
0
0

Texte intégral

(1)

HAL Id: hal-02884333

https://hal.archives-ouvertes.fr/hal-02884333v5

Preprint submitted on 27 Apr 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

A Kernel-based Consensual Aggregation for Regression

Sothea Has

To cite this version:

(2)

A Kernel-based Consensual Aggregation

for Regression

Sothea Has

LPSM, Sorbonne Université Pierre et Marie Curie (Paris 6) 75005 Paris, France

sothea.has@lpsm.paris

Abstract

In this article, we introduce a kernel-based consensual aggregation method for regression problems. We aim to exibly combine individ-ual regression estimators r1, r2, ..., rM using a weighted average where

the weights are dened based on some kernel function to build a target prediction. This work extends the context of Biau et al. (2016) to a more general kernel-based framework. We show that this more gen-eral conguration also inherits the consistency of the basic consistent estimators. Moreover, an optimization method based on gradient de-scent algorithm is proposed to eciently and rapidly estimate the key parameter of the strategy. The numerical experiments carried out on several simulated and real datasets are also provided to illustrate the speed-up of gradient descent algorithm in estimating the key param-eter and the improvement of overall performance of the method with the introduction of smoother kernel functions.

Keywords: Consensual aggregation, kernel, regression.

2010 Mathematics Subject Classication: 62G08, 62J99, 62P30

1 Introduction

Aggregation methods, given the high diversity of available estimation strate-gies, are now of great interest in constructing predictive models. To this goal, several aggregation methods consisting of building a linear or convex combination of a bunch of initial estimators have been introduced, for in-stance inCatoni(2004),Juditsky and Nemirovski(2000),Nemirovski(2000),

Yang (2000, 2001, 2004), Györ et al. (2002), Wegkamp (2003), Audibert

(3)

Another approach of model selection, which aims at selecting the best es-timator among the candidate eses-timators, has also been proposed (see, for example, Massart (2007)).

Apart from the usual linear combination and model selection methods, a dierent technique has been introduced in classication problems by Mojir-sheibani(1999). In his paper, the combination is the majority vote among all the points for which their predicted classes, given by all the basic classiers, coincide with the predicted classes of the query point. Roughly speaking, in-stead of predicting a new point based on the structure of the original input, we look at the topology dened by the predictions of the candidate estima-tors. Each estimator was constructed dierently so may be able to capture dierent features of the input data and useful in dening closeness. Con-sequently, two points having similar predictions or classes seem reasonably having similar actual response values or belonging to the same actual class.

Later, Mojirsheibani (2000) and Mojirsheibani and Kong (2016) intro-duced exponential and general kernel-based versions of the primal idea to improve the smoothness in selecting and weighting individual data points in the combination. In this context, the kernel function transforms the level of disagreements between the predicted classes of a training point xi and the

query point x into a contributed weight given to the corresponding point in the vote. Besides, Biau et al. (2016) congured the original idea of Mojir-sheibani(1999) as a regression framework where a training point xi is close

to the query point x if each of their predictions given by all the basic regres-sion estimators is close. Each of the close neighbors of x will be given a uniformly 0-1 weight contributing to the combination. It was shown theoret-ically in these former papers that the combinations inherit the consistency property of consistent basic estimators.

Recently from a practical point of view, a kernel-based version of Biau et al. (2016) called KernelCobra has been implemented in pycobra python library (seeGuedj and Srinivasa Desikan (2018)). Moreover, it has also been applied in ltering to improve the image denoising (see Guedj and Rengot

(2020)). In a complementary manner to the earlier works, we present another kernel-based consensual regression aggregation method in this paper, as well as its theoretical and numerical performances. More precisely, we show that the consistency inheritance property shown inBiau et al.(2016) also holds for this kernel-based conguration for a broad class of regular kernels. Moreover, an evidence of numerical simulation carried out on a similar set of simulated models, and some real datasets shows that the present method outperforms

(4)

the classical one.

This paper is organized as follows. Section 2 introduces some notation, the denition of the proposed method, and presents the theoretical results, namely consistency and convergence rate of the variance term of the method for a subclass of regular kernel functions. A method based on gradient descent algorithm to estimate the bandwidth parameter is described in Section 3. Section 4 illustrates the performances of the proposed method through several numerical examples of simulated and real datasets. Lastly, Section 6 collects all the proofs of the theoretical results given in Section 2.

2 The kernel-based combining regression

2.1 Notation

We consider a training sample Dn= {(Xi, Yi)ni=1}where (Xi, Yi), i = 1, 2, ..., n,

are iid copies of the generic couple (X, Y ). We assume that (X, Y ) is an Rd × R-valued random variable with a suitable integrability which will be specied later.

We randomly split the training data Dn into two parts of size ` and

k such that ` + k = n, which are denoted by D` = {(X (`) i , Y (`) i )`i=1} and Dk = {(X (k) i , Y (k)

i )ki=1} respectively (a common choice is k = dn/2e = n − `).

The M basic regression estimators or machines rk,1, rk,2, ..., rk,M are

con-structed using only the data points in Dk. These basic machines can be

any regression estimators such as linear regression, kNN, kernel smoother, SVR, lasso, ridge, neural networks, naive Bayes, bagging, gradient boost-ing, random forests, etc. They could be parametric, nonparametric or semi-parametric with their possible tuning parameters. For the combination, we only need the predictions given by all these basic machines of the remaining part D` and the query point x.

In the sequel, for any x ∈ Rd, the following notation is used:

ˆ rk(x) = (rk,1(x), rk,2(x), ..., rk,M(x)): the vector of predictions of x.

ˆ kxk = kxk2 =

q Pd

i=1x2i: Euclidean norm on Rd.

ˆ kxk1 =Pdi=1|xi|: `1 norm on Rd.

ˆ g∗

(5)

ˆ g∗(r

k(x)) = E[Y |rk(x)]: the conditional expectation of the response

variable given all the predictions. This can be proven to be the optimal estimator in regression over the set of predictions rk(X).

The consensual regression aggregation is the weighted average dened by gn(rk(x)) = ` X i=1 Wn,i(x)Y (`) i . (1)

Recall that given all the basic machines rk,1, rk,2, ..., rk,M, the aggregation

method proposed by Biau et al. (2016) corresponds to the following naive weights: Wn,i(x) = QM m=11{|rk,m(Xi)−rk,m(x)|<h} P` j=1 QM m=11{|rk,m(Xj)−rk,m(x)|<h} , i = 1, 2, ..., `. (2) Moreover, the condition of closeness for all predictions, can be relaxed to some predictions, which corresponds to the following weights:

Wn,i(x) = 1{PM m=11{|rk,m(Xi)−rk,m(x)|<h}≥αM } P` j=11{PM m=11{|rk,m(Xj )−rk,m(x)|<h}≥αM } , i = 1, 2, ..., ` (3) where α ∈ {1/M, 2/M, ..., 1} is the proportion of consensual predictions re-quired and h > 0 is the bandwidth or window parameter to be determined. Constructing the proposed method is equivalent to searching for the best pos-sible value of these parameters over a given grid, minimizing some quadratic error which will be described in Section 3.

In the present paper, K : RM

→ R+ denotes a regular kernel which is a

decreasing function satisfying: ∃b, κ0, ρ > 0 such that

(

b1BM(0,ρ)(z) ≤ K(z) ≤ 1, ∀z ∈ R

M

R

RM supu∈BM(z,ρ)K(u)dz = κ0 < +∞

(4) where BM(c, r) = {z ∈ RM : kc − zk < r} denotes the open ball of center

c ∈ RM and radius r > 0 of RM. We propose in (1) a method associated to the weights dened at any query point x ∈ Rd by

Wn,i(x) = Kh(rk(X (`) i ) −rk(x)) P` j=1Kh(rk(X (`) j ) −rk(x)) , i = 1, 2, ..., ` (5)

(6)

where Kh(z) = K(z/h) for some bandwidth parameter h > 0 with the

con-vention of 0/0 = 0. Observe that the combination is based only on D` but

the whole construction of the method depends on the whole training data Dn

as the basic machines are all constructed using Dk. In our setting, we treat

the vector of predictions rk(x) as an M-dimensional feature, and the kernel

function is applied on the whole vector at once. Note that the implementa-tion of KernelCobra in Guedj and Srinivasa Desikan (2020) corresponds to the following weights:

Wn,i(x) = PM m=1Kh(rk,m(X (`) i ) − rk,m(x)) P` j=1 PM m=1Kh(rk,m(X (`) j ) − rk,m(x)) , i = 1, 2, ..., ` (6) where the univariate kernel function K is applied on each component of rk(x)

separately.

2.2 Theoretical performance

The performance of the combining estimation gnis measured using the quadratic

risk dened by

E h

|gn(rk(X)) − g∗(X)|2

i

where the expectation is taken with respect to both X and the training sample Dn. Firstly, we begin with a simple decomposition of the distortion

between the proposed method and the optimal regression estimator g∗(X)

by introducing the optimal regression estimator over the set of predictions g∗(rk(X)). The following proposition shows that the nonasymptotic-type

control of the distortion, presented in Proposition.2.1 of Biau et al. (2016), also holds for this case of regular kernels.

Proposition 1 Let rk = (rk,1, rk,2, ..., rk,M) be the collection of all basic

es-timators and gn(rk(x)) be the combined estimator dened in (1) with the

weights given in (5) computed at point x ∈ Rd. Then, for all distributions of

(X, Y ) with E[|Y |2] < +∞, E h |gn(rk(X)) − g∗(X)|2 i ≤ inf f ∈GE h |f (rk(X)) − g∗(X)|2 i + Eh|gn(rk(X)) − g∗(rk(X))|2 i

(7)

where G is the class of any function f : RM → R satisfying E[f(r k(X))|2] < +∞. In particular, E h |gn(rk(X)) − g∗(X)|2 i ≤ min 1≤m≤ME h |rk,m(X) − g∗(X)|2 i + Eh|gn(rk(X)) − g∗(rk(X))|2 i .

The two terms of the last bound can be viewed as a bias-variance decom-position where the rst term min1≤m≤ME[|rk,m(X) − g∗(X)|2] can be seen

as the bias and E[|gn(rk(X)) − g∗(rk(X))|2] is the variance-type term (Biau

et al. (2016)). Given all the machines, the rst term cannot be controlled as it depends on the performance of the best constructed machine, and it will be there as the asymptotic control of the performance of the proposed method. Our main task is to deal with the second term, which can be proven to be asymptotically negligible in the following key proposition.

Proposition 2 Assume that rk,m is bounded for all m = 1, 2, .., M. Let

h → 0 and ` → +∞ such that hM` → +∞. Then E

h

|gn(rk(X)) − g∗(rk(X))|2

i

→ 0 as ` → +∞ for all distribution of (X, Y ) with E[|Y |2] < +∞. Thus,

lim sup `→+∞ E h |gn(rk(X)) − g∗(X)|2 i ≤ inf f ∈GE h |f (rk(X)) − g∗(X)|2 i . And in particular, lim sup `→+∞ E h |gn(rk(X)) − g∗(X)|2 i ≤ min 1≤m≤ME h |rk,m(X) − g∗(X)|2 i .

Proposition 2 above is an analogous setup of Proposition 2.2 in Biau et al.

(2016). To prove this result, we follow the procedure of Stone's theorem (see, for example, Stone (1977) and Chapter 4 of Györ et al. (2002)) of weak universal consistency of non-parametric regression. However, showing this result for the class of regular kernels is not straightforward. Most of the previous studies provided such a result of L2-consistency only for the class

of compactly supported kernels (see, for example, Chapter 5 of Györ et al.

(2002)). In this study, we can derive the result for this broader class thanks to the boundedness of all basic machines. However, the price to pay for the

(8)

universality for this class of regular kernels is the lack of convergence rate. To this goal, a weak smoothness assumption of g∗ with respect to the basic

machines is required. For example, the convergence rate of the variance-type term inBiau et al.(2016) is of order O(`−2/(M +2))under the same smoothness

assumption, and this result also holds for all the compactly support kernels. Our goal is not to theoretically do better than the classical method but to investigate such a similar result in a broader class of kernel functions. For those kernels which the tails decrease fast enough, the convergence rate of the variance-type term can be attained as described in the following main theorem of this paper.

Theorem 1 Assume that the response variable Y and all the basic machines rk,m, m = 1, 2, ..., M, are bounded by some constant R. Suppose that there

exists a constant L ≥ 0 such that, for every k ≥ 1,

|g∗(rk(x)) − g∗(rk(y))| ≤ Lkrk(x) −rk(y)k, ∀x, y ∈ Rd.

We assume moreover that ∃RK, Ck> 0 : K(z)kzk2 ≤

CK

1 + kzkM, ∀z ∈ R

M such that kzk ≥ R

K. (7)

Then, with the choice of h ∝ `− M +2

M 2+2M +4, one has

E[|gn(rk(X)) − g∗(X)|2] ≤ min

1≤m≤ME[|rk,m(X) − g

(X)|2] + C`−M 2+2M +44 (8)

for some positive constant C = C(b, L, R, RK, CK) independent of `.

Moreover, if there exists a consistent estimator named rk,m0 among {rk,m}

M m=1 i.e., E[|rk,m0(X) − g ∗ (X)|2] → 0 as k → +∞,

then the combing estimator gn is also consistent for all distribution of (X, Y )

in some class M. Consequently, under the assumption of Theorem 1, one has

lim

k,`→+∞E[|gn(rk(X)) − g

(X)|2] = 0.

Remark 1 The assumption on the upper bound of the kernel K in the theo-rem above is very weak, chosen so that the result holds for a large subclass of regular kernels. However, the convergence rate is indeed slow for this subclass

(9)

of kernel functions. If we strengthen this condition, we can obtain a much nicer result. For instance, if we assume that the tails decrease at least of exponential speed i.e.,

∃RK, CK > 0 and α ∈ (0, 1) : K(z) ≤ CKe−kzk

α

, ∀z ∈ RM, kzk ≥ RK,

by following the same procedure as in the proof of the above theorem (Sec-tion 6), one can easily check that the convergence rate of the variance-type term is of order O(`−2α/(M +2α)

). This rate approaches the state of the art of the classical method by Biau et al. (2016) when α approaches 1.

3 Bandwidth parameter estimation thanks to

gradient descent

In earlier works byBiau et al.(2016) andGuedj and Srinivasa Desikan(2020), the training data Dn is practically broken down into three parts Dk where

all the candidate machines {rk,m}Mm=1 are built, and two other parts D`1 and

D`2. D`1 is used for the combination dened in equation (1), and D`2 is the

validation set used to learn the bandwidth parameter h of equation (2) and the proportion α of equation (3) by minimizing the average quadratic error evaluated on D`2 dened as follows,

ϕM(h) = 1 |D`2| X (Xj,Yj)∈D`2 [gn(rk(Xj)) − Yj]2 (9)

where |D`2|denotes the cardinality of D`2, gn(rk(Xj)) =

P

(Xi,Yi)∈D`1Wn,i(Xj)Yi

dened in equation (1), and the weight Wn,i(Xj)is given in equation (2) and

(6) forBiau et al.(2016) andGuedj and Srinivasa Desikan(2020) respectively. Note that the subscript M of ϕM(h)indicates the full consensus between the

M components of the predictions rk(Xi) and rk(Xj) for any Xi and Xj of

D`1 and D`2 respectively. In this case, constructing a combining estimation

gn is equivalent to searching for an optimal parameter h∗ over a given grid

H = {hmin, ..., hmax} i.e.,

h∗ = argmin

h∈H

ϕM(h).

The parameter α of equation (3) can be tuned easily by considering ϕαM(h)

(10)

among the M components of the predictions. In this case, the optimal pa-rameters α∗ and hare chosen to be the minimizer of ϕ

αM(h)i.e.,

(α∗, h∗) = argmin

(α,h)∈{1/2,1/3,...,1}×H

ϕαM(h).

Note that in both papers, the grid search algorithm is used in searching for the optimal bandwidth parameter.

In this paper, the training data is broken down into only two parts, Dk

and D`. Again, we construct the basic machines using Dk, and we propose the

following κ-fold cross-validation error which is a function of the bandwidth parameter h > 0 dened by ϕκ(h) = 1 κ κ X p=1 X (Xj,Yj)∈Fp [gn(rk(Xj)) − Yj]2 (10)

where in this case, gn(rk(Xj)) =

P

(Xi,Yi)∈D`\FpWn,i(Xj)Yi, is computed using

the remaining κ − 1 folds of D` leaving Fp ⊂ D` as the corresponding

vali-dation fold. We often observe the convex-like curves of the cross-valivali-dation quadratic error on many simulations; and from this observation, we propose to use a gradient descent algorithm to estimate the optimal bandwidth pa-rameter. The associated gradient descent algorithm used to estimate the optimal parameter h∗ is implemented as follows:

Algorithm 1 : Gradient descent for estimating h∗:

1. Initialization: h0, a learning rate λ > 0, threshold δ > 0 and the

maximum number of iteration N. 2. For k = 1, 2, ..., N, while d dhϕ κ(h k−1) > δ do: hk← hk−1− λ d dhϕ κ(h k−1)

3. return hkviolating the while condition or hN to be the estimation

(11)

From equation (10), for any (Xj, Yj) ∈ Fp, one has d dhϕ κ(h) = 1 κ κ X p=1 X (Xj,Yj)∈Fp 2 ∂ ∂hgn(rk(Xj))(gn(rk(Xj)) − Yj) where gn(rk(Xj)) = P (Xi,Yi)D`∈\FpYiKh(rk(Xj) −rk(Xi)) P (Xq,Yq)∈D`\FpKh(rk(Xj) −rk(Xq)) ⇒ ∂ ∂hgn(rk(Xj)) = X (Xi,Yi),(Xq,Yq)∈D`\Fp (Yi− Yq)× ∂ ∂hKh(rk(Xj) −rk(Xi))Kh(rk(Xj) −rk(Xq)) h P (Xi,Yi)D`∈\FpKh(rk(Xj) −rk(Xi)) i2 .

The dierentiability of gn depends entirely on the kernel function K. Therefore,

for suitable kernels, the implementation of the algorithm is straightforward. For example, in the case of Gaussian kernel Kh(x) = exp(−hkxk2/(2σ2)) for some

σ > 0, one has ∂ ∂hgn(rk(Xj)) = X (Xi,Yi),(Xq,Yq)∈D`\Fp (Yq− Yi)krk(Xj) −rk(Xi)k2× exp− h(krk(Xj) −rk(Xi)k2+ krk(Xj) −rk(Xq)k2)/(2σ2)  2σ2P (Xq,Yq) /∈Fpexp(−hkrk(Xj) −rk(Xq)k 2/(2σ2))2 .

In our numerical experiment, the numerical gradient of (10) can be computed eciently and rapidly thanks to grad function contained in pracma library of R program (see Borchers(2019)). We observe that the algorithm works much faster, and more importantly it does not require the information of the interval containing the optimal parameter as the grid search does. Most of the time, the parameter h vanishing the numerical gradient of the objective function can be attained, lead-ing to a good construction of the correspondlead-ing combinlead-ing estimation method, as reported in the next section.

4 Numerical examples

This section is devoted to numerical experiments to illustrate the performance of our proposed method. It is shown in Biau et al.(2016) that the classical method

(12)

mostly outperforms the basic machines of the combination. In this experiment, we compare the performances of the proposed methods with the classical one and all the basic machines. Several options of kernel functions are considered. Most kernels are compactly supported on [−1, 1], taking nonzero values only on [−1, 1], except for the case of compactly supported Gaussian which is supported on [−ρ1, ρ1], for

some ρ1 > 0. Moreover to implement the gradient descent algorithm in estimating

the bandwidth parameter, we also present the results of non-compactly supported cases such as classical Gaussian and 4-exponential kernels. All kernels considered in this paper are listed in Table 1, and some of them are displayed (univariate case) in Figure 1 below. Kernel Formula Naive1 K(x) =Qd i=11{|xi|≤1} Epanechnikov K(x) = (1 − kxk2)1{kxk≤1} Bi-weight K(x) = (1 − kxk2)21{kxk≤1} Tri-weight K(x) = (1 − kxk2)31{kxk≤1}

Compact-support Gaussian K(x) = exp{−kxk2/(2σ2)}1

{kxk≤ρ1}, σ, ρ1 > 0

Gaussian K(x) = exp{−kxk2/(2σ2)}, σ > 0 4-exponential K(x) = exp{−kxk4/(2σ4)}, σ > 0

Table 1: Kernel functions used.

−3 −2 −1 0 1 2 3 0 0.2 0.4 0.6 0.8 1 x K (x ) Naive Epanechnikov Bi-weight Tri-weight Gaussian 4-exponential

Figure 1: The shapes of some kernels.

(13)

4.1 Simulated datasets

In this subsection, we study the performances of our proposed method on the same set of simulated datasets of size n as provided inBiau et al.(2016). The input data is either independent and uniformly distributed over (−1, 1)d (uncorrelated case)

or distributed from a Gaussian distribution N (0, Σ) where the covariance matrix Σ is dened by Σij = 2−|i−j| for 1 ≤ i, j ≤ d (correlated case). We consider the

following models:

Model 1 : n = 800, d = 50, Y = X2

1 + exp(−X22).

Model 2 : n = 600, d = 100, Y = X1X2+ X32− X4X7+ X8X10− X62+ N (0, 0.5).

Model 3 : n = 600, d = 100, Y = − sin(2X1) + X22+ X3− exp(−X4) + N (0, 0.5).

Model 4 : n = 600, d = 100, Y = X1+ (2X2− 1)2+ sin(2πX3)/(2 − sin(2πX3)) +

sin(2πX4) + 2 cos(2πX4) + 3 sin2(2πX4) + 4 cos2(2πX4) + N (0, 0.5).

Model 5 : n = 700, d = 20, Y = 1{X1>0} + X 3 2 +1{X4+X6−X8−X9>1+X14} + exp(−X22) + N (0, 0.05). Model 6 : n = 500, d = 30, Y = P10 k=11{Xk<0}−1{N (0,1)>1.25}. Model 7 : n = 600, d = 300, Y = X2 1+ X22X3exp(−|X4|) + X6− X8+ N (0, 0.5). Model 8 : n = 600, d = 50, Y = 1{X1+X43+X9+sin(X12X18)+N (0,0.01)>0.38}.

Moreover, it is interesting to consider some high-dimensional cases as many real problems such as image and signal processing involve these kinds of datasets. Therefore, we also consider the following two high-dimensional models, where the last one is not from Biau et al.(2016) but a made-up one.

Model 9 : n = 500, d = 1000, Y = X1+ 3X32− 2 exp(−X5) + X6.

Model 10 : n = 500, d = 1500, Y = exp(X1) + exp(−X1) +Pdj=2[cos(X j

j)) −

2 sin(Xjj) − exp(−|Xj|)].

For each model, the proposed method is implemented over 100 replications. We randomly split 80% of each simulated dataset into two equal parts, D`and Dkwhere

` = d0.8 × n/2e − k, and the remaining 20% will be treated as the corresponding testing data. We measure the performance of any regression method f using mean square error (MSE) evaluated on the 20%-testing data dened by

(14)

MSE(f) = 1 ntest ntest X i=1 (yitest− f (xtesti ))2. (11) Table 2 and 3 below contain the average MSEs and the corresponding standard errors (into brackets) over 100 runs of uncorrelated and correlated cases respectively. In each table, the rst block contains ve columns corresponding to the following ve basic machines rk= (rk,m)5m=1:

ˆ Rid: Ridge regression (R package glmnet, seeFriedman et al.(2010)). ˆ Las: Lasso regression (R package glmnet).

ˆ kNN: k-nearest neighbors regression (R package FNN, seeLi (2019)).

ˆ Tr: Regression tree (R package tree, see Ripley(2019)).

ˆ RF: Random Forest regression (R package randomForest, see Liaw and Wiener (2002)).

We choose k = 5 for k-NN and ntree = 300 for random forest algorithm, and other methods are implemented using the default parameters. The best performance of each method in this block is given in boldface. The second block contains the last seven columns corresponding to the kernel functions used in the combining method where COBRA2, Epan, Bi-wgt, Tri-wgt, C-Gaus, Gauss and Exp4

respec-tively stand for classical COBRA, Epanechnikov, Bi-weight, Tri-weight, Compact-support Gaussian, Gaussian and 4-exponential kernels as listed in Table 1. In this block, the smallest MSE of each case is again written in boldface. For all the compactly supported kernels, we consider 500 values of h in a uniform grid {10−100, ..., hmax} where hmax = 10, which is chosen to be large enough, likely

to contain the optimal parameter to be searched. For the compactly supported Gaussian, we set ρ1 = 3and σ = 1 therefore its support is [−3, 3]. Lastly, for the

two non-compactly supported kernels, Gaussian and 4-exponential, the optimal pa-rameters are estimated using gradient descent algorithm described in the previous section. Note that the results in the rst block are not necessarily exactly the same as the ones reported in Biau et al.(2016) due to the choices of the parameters of the basic machines.

We can easily compare the performances of the combining estimation methods with all the basic machines and among themselves as the results reported in the second block are the straight combinations of those in the rst block. In each table, 2We use the relaxed version ofBiau et al.(2016) with the weights given in equation (3).

(15)

we are interested in comparing the smallest average MSE in the rst block to all the columns in the second block. First of all, we can see that all columns of the second block always outperform the best machine of the rst block, which illustrates the theoretical result of the combining estimation methods. Secondly, the kernel-based methods beat the rst column (classical COBRA) of the second block for almost all kernels. Lastly, the combing estimation method with Gaussian kernel is the absolute winner as the corresponding column is bold in both tables. Note that with the proposed gradient descent algorithm, we can obtain the value of bandwidth parameter with null gradient of cross-validation error dened in equation (10), which is often better and much faster than the one obtained by the grid search algorithm (2 or 3 times faster). Figure 2 below contains boxplots of runtimes of 100 runs of Model 1 and 9 of both correlated and uncorrelated cases computed on a machine with the following characteristics:

ˆ Processor: 2x AMD Opteron 6174, 12C, 2.2GHz, 12x512K L2/12M L3 Cache, 80W ACP, DDR3-1333MHz.

ˆ Memory: 64GB Memory for 2 CPUs, DDR3, 1333MHz.

Figure 2: Boxplots of runtimes of GD and grid search algorithm implemented on some models.

(16)

Table 2: Av erage MSEs in the uncorrelated case. Mo de l Las Rid k NN T r RF COBRA Epan Bi-wgt T ri-wgt C-Gaus Gauss Exp 4 1 0 .156 0 .134 0 .144 0 .027 0.033 0 .022 0 .020 0 .019 0 .019 0 .019 0.018 0 .019 (0.016) (0.013) (0.014) (0. 004) (0.004) (0 .004) (0 .003) (0 .003) (0 .003) (0 .003) (0 .002) (0.003) 2 1 .301 0 .784 0 .873 1 .124 0.707 0 .722 0 .718 0 .712 0 .715 0 .712 0.709 0 .710 (0.216) (0.110) (0.123) (0. 165) (0.097) (0 .065) (0 .079) (0 .080) (0 .079) (0 .079) (0 .078) (0.079) 3 0 .664 0 .669 1 .477 0 .797 0.629 0 .554 0 .482 0 .478 0 .476 0 .479 0.475 0 .483 (0.107) (0.255) (0.192) (0. 135) (0.091) (0.069) (0.062) (0.060) (0.060) (0.063) (0.060) (0.060) 4 7 .783 6 .550 10 .238 3 .796 3.774 3 .608 3 .231 3 .185 3 .153 3 .189 2.996 3 .186 (1.121) (1.115) (1.398) (0. 840) (0.523) (0.526) (0.383) (0.382) (0.384) (0.371) (0.384) (0.464) 5 0 .508 0 .518 0 .699 0 .575 0.436 0 .429 0 .389 0 .387 0 .386 0 .387 0.383 0 .387 (0.051) (0.073) (0.084) (0. 081) (0.051) (0 .035) (0 .031) (0 .030) (0 .030) (0 .030) (0 .030) (0.028) 6 2 .693 1 .958 2 .675 3 .065 1.826 1 .574 1 .274 1 .259 1 .254 1 .270 1 .273 1 .286 (0.537) (0.292) (0.349) (0. 475) (0.262) (0 .270) (0 .129) (0 .130) (0 .130) (0 .125) (0 .130) (0.130) 7 1 .971 0 .796 1 .074 0 .737 0.515 0 .506 0 .472 0 .468 0 .467 0 .469 0.451 0 .477 (0.410) (0.132) (0.152) (0. 109) (0.073) (0 .063) (0 .049) (0 .048) (0 .049) (0 .049) (0 .049) (0.067) 8 0 .134 0 .131 0 .200 0 .174 0.127 0 .104 0 .092 0.091 0.091 0.091 0.091 0 .094 (0.016) (0.020) (0.020) (0. 034) (0.013) (0 .013) (0 .013) (0 .013) (0 .013) (0 .013) (0 .011) (0.016) 9 1 .592 2 .948 3 .489 1 .830 1.488 1 .130 0 .929 0 .918 0 .914 0 .918 0.895 0 .993 (0.219) (0.436) (0.516) (0. 373) (0.267) (0 .151) (0 .128) (0 .127) (0 .130) (0 .124) (0 .126) (0.186) 10 2012 .660 1485 .065 1778 .955 3058 .381 1618.977 1511 .283 1462 .509 1458 .306 1459 .558 1452 .523 1400 .365 1414 .316 (284.391) (210.816) (261.396) (486.504) (231.555) (129 .796) (143 .976) (142 .988) (142 .602) (141 .168) ( 143 .330) (144.929) Table 3: Av erage MSEs in the correlated case. Mo d el Las Rid k NN T r RF COBRA Epan Bi-wgt T ri-wgt C-Gaus Gauss Exp 4 1 2 .294 1 .947 1 .941 0 .320 0.542 0 .307 0 .304 0 .301 0 .288 0 .297 0.269 0 .291 (0.544 (0.507) (0.487) (0.145) (0.231) (0 .129) (0 .105) (0 .111) (0 .103) (0 .104) (0 .092) (0.098) 2 14 .273 8 .442 8 .572 6 .796 5.135 5 .345 4 .582 4 .529 4 .491 4 .541 4.377 4 .910 (2.593) (1.912) (1.751) (1 .54 8) (1.372) (1 .194) (0 .941) (0 .934) (0 .922) (0 .896) (0 .905) (1.181) 3 7 .996 6 .266 8 .704 4 .110 3.722 3 .327 2 .598 2 .536 2 .444 2 .554 2.168 2 .357 (3.393) (3.296) (3.523) (2 .89 4) (2.956) (1 .006) (0 .912) (0 .944) (0 .840) (0 .907) (0 .680) (0.756) 4 61 .474 42 .351 46 .934 8 .855 13.381 9 .599 10 .511 9 .963 9 .682 10 .085 9.056 9 .713 (13.986) (11.622) (12.543) (3.480) (5.549) (4 .125) (2 .961) (3 .101) (2 .860) (2 .904) (2 .407) (2.695) 5 6 .805 7 .479 10 .342 4 .000 4.880 3 .225 2 .640 2 .401 2 .235 2 .412 1.792 2 .194 (3.685) (5.336) (5.425) (3 .14 4) (3.787) (2 .088) (1 .455) (1 .387) (1 .250) (1 .355) (0 .913) (1.242) 6 4 .221 2 .087 4 .461 3 .408 1.701 1 .493 1 .271 1 .238 1 .217 1 .248 1.097 1 .270 (0.848) (0.485) (0.599) (0 .63 6) (0.288) (0 .326) (0 .149) (0 .146) (0 .143) (0 .148) (0 .145) (0.386) 7 17 .875 4 .695 5 .591 4 .132 3.081 3 .304 2 .819 2 .779 2 .736 2 .788 2.640 2 .979 (5.632) (1.318) (1.418) (1 .36 0) (1.091) (0 .799 (0 .636) (0 .614) (0 .605) (0 .623) (0 .590) (0.764) 8 0 .139 0 .133 0 .201 0 .159 0.121 0 .102 0 .100 0 .100 0 .100 0 .100 0.092 0 .092 (0.016) (0.020) (0.019) (0 .03 5) (0.013) (0 .021) (0 .020) (0 .021) (0 .020) (0 .020) (0 .021) (0.018) 9 43 .445 37 .827 43 .991 15 .258 16.957 13 .505 11 .303 11 .007 11 .067 11 .206 10.303 12 .346 (12.210) (12.201) (12.920) (8.119) (8.774) (4 .822) (3 .891) (3 .815) (3 .949) (3 .960) (3 .634) (5.014) 10 7235 .062 5244 .843 7636 .811 13014 .596 7092.741 5147 .950 4717 .225 4669 .516 4663 .430 4697 .019 4660.043 5073 .591 (1100.579) (996 .18 1) (1159.445) (2020.133) (1030.249) (835 .384) (703 .049) (696 .027) (687 .474) (681 .370) (764 .363) (1022.894)

(17)

4.2 Real public datasets

In this part, we consider three public datasets which are available and easily ac-cessible on the internet. The rst dataset (Abalone, available at Dua and Gra

(2017a)) contains 4177 rows and 9 columns of measurements of abalones observed in Tasmania, Australia. We are interested in predicting the age of each abalone through the number of rings using its physical characteristics such as gender, size, weight, etc. The second dataset (House, available at Kaggle (2016)) comprises house sale prices for King County including Seattle. It contains homes sold be-tween May 2014 and May 2015. The dataset consists of 21613 rows of houses and 21 columns of characteristics of each house including ID, Year of sale, Size, Loca-tion, etc. In this case, we want to predict the price of each house using all of its quantitative characteristics.

Notice that Model 6 and 8 of the previous subsection are about predicting integer labels of the response variable. Analogously, the last dataset (Wine, see

Dua and Gra (2017b); Cortez et al. (2009)), which was also considered in Biau et al. (2016), containing 1599 rows of dierent types of wines and 12 columns corresponding to dierent substances of red wines including the amount of dierent types of acids, sugar, chlorides, PH, etc. The variable of interest is quality which scales from 3 to 8 where 8 represents the best quality. We aim at predicting the quality of each wine, which is treated as a continuous variable, using all of its substances.

The ve primary machines are Ridge, LASSO, kNN, Tree and Random Forest regression. In this case, the parameter ntree = 500 for random forest, and kNN is implemented using k = 20, 12 and 5 for Abalone, House and Wine dataset re-spectively. The ve machines are combined using the classical method by Biau et al. (2016) and the kernel-based method with Gaussian kernel as it is the most outstanding one among all the kernel functions. In this case, the search for param-eter h for the classical COBRA method is performed using a grid of size 300. In addition, due to the scaling issue, we measure the performance of any method f in this case using average root mean square error (RMSE) dened by,

RMSE(f) = pMSE(f) = v u u t 1 ntest ntest X i=1 (yitest− f (xtesti ))2. (12)

The average RMSEs obtained from 100 independent runs, evaluated on 20%-testing data of the three public datasets, are provided in Table 4 below (the rst three rows). We observe that random forest is the best estimator among all the basic machines in the rst block, and the proposed method either outperforms other columns (Wine and Abalone) or biases towards the best basic machine (House).

(18)

Moreover, the performances of kernel-based method always exceed the ones of the classical method by Biau et al.(2016).

4.3 Real private datasets

The results presented in this subsection are obtained from two private datasets. The rst dataset contains six columns corresponding to the six variables including Air temperature, Input Pressure, Output Pressure, Flow, Water Temperature and Power Consumption along with 2026 rows of hourly observations of these mea-surements of an air compressor machine provided byCadet et al.(2005). The goal is to predict the power consumption of this machine using the ve remaining ex-planatory variables. The second dataset is provided by the wind energy company Ma¨a Eolis. It contains 8721 observations of seven variables representing 10-minute measurements of Electrical power, Wind speed, Wind direction, Temperature, Vari-ance of wind speed and VariVari-ance of wind direction measured from a wind turbine of the company (see,Fischer et al. (2017)). In this case, we aim at predicting the electrical power produced by the turbine using the remaining six measurements as explanatory variables. We use the same set of parameters as in the previous subsection except for kNN where in this case k = 10 and k = 7 are used for air compressor and wind turbine dataset respectively. The results obtained from 100independent runs of the methods are presented in the last two rows (Air and Turbine) of Table 4 below. We observe on one hand that the proposed method (Gauss) outperforms both the best basic machines (RF) and the classical method by Biau et al. (2016) in the case of Turbine dataset. On the other hand, the performance of our method approaches the performance of the best basic machine (Las) and outperforms the classical COBRA in the case of Air dataset. Moreover, boxplots of runtimes (100 runs) measured on Wine and Turbine datasets (com-puted using the same machine as described in the subsection of simulated data) are also given in Figure 3 below.

Table 4: Average RMSEs of real datasets.

Model Las Rid kNN Tr RF COBRA Gauss

House 241083.959(8883.107) 241072.974(8906.332) (23548.367)245153.608 254099.652(9350.885) 205943.768(7496.766) (13299.934)223596.317 209955.276(7815.623) Wine 0.0.660 0.685 0.767 0.711 0.623 0.650 0.617 (0.029) (0.053) (0.031) (0.030) (0.028) (0.026) (0.020) Abalone (0.071)2.204 (0.075)2.215 (0.062)2.175 (0.072)2.397 (0.060)2.153 (0.081)2.171 (0.057)2.128 Air 163.099 164.230 241.657 351.317 174.836 172.858 163.253 (3.694) (3.746) (5.867) (31.876) (6.554) (7.644) (3.333) Turbine (4.986)70.051 (3.413)68.987 (1.671)44.516 (4.976)81.714 38.894(1.506) (1.561)38.927 37.135(1.555)

(19)

Figure 3: Boxplots of runtimes of GD and grid search algorithm implemented on Wine and Turbine datasets.

5 Conclusion

In this study, we investigate and extend the context of a naive kernel-based consen-sual regression aggregation method byBiau et al.(2016) to a more general regular kernel-based framework. Moreover, an optimization algorithm based on gradient descent is proposed to eciently and rapidly estimate the key parameter of the method. It is also shown through several numerical simulations that the perfor-mance of the method is improved signicantly with smoother kernel functions. In practice, the performance of the consensual aggregation depends both on the performance of the individual regression machines, and on the nal combination here involving kernel functions. Since the calibration of hyperparameters may be critical in both steps, it could be very interesting to investigate in future work how automated machine learning models can improve the performances of the global model.

6 Proofs

The following lemma, which is a variant of lemma 4.1 inGyör et al.(2002) related to the property of binomial random variables, is needed.

Lemma 1 Let B(n, p) be the binomial random variable with parameters n and p. Then 1. For any c > 0, E h 1 c + B(n, p) i ≤ 2 p(n + 1).

(20)

2. E h 1 B(n, p)1B(n,p)>0 i ≤ 2 p(n + 1). Proof of Lemma 1 1. For any c > 0, one has

E h 1 c + B(n, p) i = n X k=0 1 c + k × n! (n − k)!k!p k(1 − p)n−k = n X k=0 1 k + 1 × k + 1 k + c × n! (n − k)!k!p k(1 − p)n−k ≤ 2 p(n + 1) n X k=0 (n + 1)!pk+1(1 − p)n+1−(k+1) [n + 1 − (k + 1)]!(k + 1)! ≤ 2 p(n + 1) n+1 X k=0 (n + 1)!pk(1 − p)n+1−k [n + 1 − k]!k! = 2 p(n + 1)(p + 1 − p) n+1 = 2 p(n + 1) 2. E h 1 B(n, p)1B(n,p)>0 i ≤ Eh 2 1 + B(n, p) i = n X k=0 2 k + 1 × n! (n − k)!k!p k(1 − p)n−k = 2 p(n + 1) n X k=0 (n + 1)!pk+1(1 − p)n+1−(k+1) [n + 1 − (k + 1)]!(k + 1)! ≤ 2 p(n + 1) n+1 X k=0 (n + 1)!pk(1 − p)n+1−k [n + 1 − k]!k! = 2 p(n + 1)(p + 1 − p) n+1 = 2 p(n + 1) 

(21)

Proof of Proposition 1 For any square integrable function with respect to rk(X), one has E h |gn(rk(X)) − g∗(X)|2 i = Eh|gn(rk(X)) − g∗(rk(X)) + g∗(rk(X)) − g∗(X)|2 i = E h |gn(rk(X)) − g∗(rk(X))|2 i + 2Eh(gn(rk(X)) − g∗(rk(X)))(g∗(rk(X)) − g∗(X)) i + Eh|g∗(rk(X)) − g∗(X)|2 i .

We consider the second term of the right hand side of the last equality, E h (gn(rk(X)) − g∗(rk(X)))(g∗(rk(X)) − g∗(X)) i = Erk(X) h EX h (gn(rk(X)) − g∗(rk(X)))(g∗(rk(X)) − g∗(X)) rk(X) ii = Erk(X) h (gn(rk(X)) − g∗(rk(X)))(g∗(rk(X)) − E[g∗(X)|rk(X)]) i = 0 where g∗(r

k(X)) = E[g∗(X)|rk(X)] thanks to the denition of g∗(rk(X)) and the

tower property of conditional expectation. It remains to check that E h |g∗(rk(X)) − g∗(X)|2 i ≤ inf f ∈GE h |f (rk(X)) − g∗(X)|2i. For any function f s.t Eh|f (rk(X))|2i< +∞, one has

E h |f (rk(X)) − g∗(X)|2i= Eh|f (rk(X)) − g∗(rk(X)) + g∗(rk(X)) − g∗(X)|2 i = E h |f (rk(X)) − g∗(rk(X))|2 i + 2Eh(f (rk(X)) − g∗(rk(X)))(g∗(rk(X)) − g∗(X)) i + Eh|g∗(rk(X)) − g∗(X)|2 i . Similarly, E h (f (rk(X)) − g∗(rk(X)))(g∗(rk(X)) − g∗(X)) i = 0. Therefore, E h |f (rk(X)) − g∗(X)|2 i = E h |f (rk(X)) − g∗(rk(X))|2 i + Eh|g∗(rk(X)) − g∗(X)|2 i .

(22)

As the rst term of the right-hand side is nonnegative thus, E h |g∗(rk(X)) − g∗(X)|2 i ≤ inf f ∈GE h |f (rk(X)) − g∗(X)|2 i . Finally, we can conclude that

E h |gn(rk(X)) − g∗(X)|2 i ≤ Eh|gn(rk(X)) − g∗(rk(X))|2 i + inf f ∈GE h |f (rk(X)) − g∗(X)|2 i .

We obtain the particular case by restricting G to be the coordinates of rk, one has

E h |gn(rk(X)) − g∗(X)|2 i ≤ Eh|gn(rk(X)) − g∗(rk(X))|2 i + min 1≤m≤ME h |rk,m(X) − g∗(X)|2i.  Proof of Proposition 2 The procedure of proving this result is indeed the proce-dure of checking the conditions of Stone's theorem (see, for example, Stone (1977) and Chapter 4 of Györ et al. (2002)) which is also used in the classical method by

Biau et al.(2016). First of all, using the inequality: (a + b + c)2 ≤ 3(a2+ b2+ c2),

one has E h |gn(rk(X)) − g∗(rk(X))|2 i = E h ` X i=1 Wn,i(X)Yi− g∗(rk(X)) 2i = E h ` X i=1 Wn,i(X)[Yi− g∗(rk(Xi))] + ` X i=1 Wn,i(X)[g∗(rk(Xi)) − g∗(rk(X))] + ` X i=1 Wn,i(X)g∗(rk(X)) − g∗(rk(X)) 2i ≤ 3Eh ` X i=1 Wn,i(X)[g∗(rk(Xi)) − g∗(rk(X))] 2i + 3Eh ` X i=1 Wn,i(X)[Yi− g∗(rk(Xi))] 2i + 3E h g ∗ (rk(X)) ` X i=1 (Wn,i(X) − 1) 2i .

(23)

The three terms of the right-hand side are denoted by A.1, A.2 and A.3 respectively, thus one has

E h

|gn(rk(X)) − g∗(rk(X))|2

i

≤ 3(A.1 + A.2 + A.3).

To prove the result, it is enough to prove that the three terms A.1, A.2 and A.3 vanish under the assumptions of Proposition 2. We deal with the rst term A.1 in the following proposition.

Proposition A.1 Under the assumptions of Proposition 2,

lim `→+∞E h ` X i=1 Wn,i(X)[g∗(rk(Xi)) − g∗(rk(X))] 2i = 0. Proof of Proposition A.1 Using Cauchy-Schwarz's inequality, one has

A.1 = Eh ` X i=1 Wn,i(X)[g∗(rk(Xi)) − g∗(rk(X))] 2i = E h ` X i=1 q Wn,i(X) q Wn,i(X)[g∗(rk(Xi)) − g∗(rk(X))] 2i ≤ Eh ` X i=1 Wn,i(X) X` i=1 Wn,i(X)[g∗(rk(Xi)) − g∗(rk(X))]2 i = Eh ` X i=1 Wn,i(X)[g∗(rk(Xi)) − g∗(rk(X))]2 i def= A n.

Note that the regression function g∗ satises E[|g(r

k(X))|2] < +∞, thus it can be

approximated in L2 sense by a continuous function with compact support named ˜g

(see, for example, Theorem A.1 inDevroye et al. (1997)). This means that for any ε > 0, there exists a continuous function with compact support ˜g such that,

(24)

Thus, one has An= E hX` i=1 Wn,i(X)[g∗(rk(Xi)) − g∗(rk(X))]2 i ≤ 3Eh ` X i=1 Wn,i(X)[g∗(rk(Xi)) − ˜g(rk(Xi))]2 i + 3E hX` i=1 Wn,i(X)[˜g(rk(Xi)) − ˜g(rk(X))]2 i + 3Eh ` X i=1 Wn,i(X)[˜g(rk(X)) − g∗(rk(X))]2 i def= 3(A n1+ An2+ An3).

We deal with each term of the last upper bound as follows. ˆ Computation of An3: applying the denition of ˜g,

An3= E hX` i=1 Wn,i(X)[˜g(rk(X)) − g∗(rk(X))]2 i ≤ Eh|˜g(rk(X)) − g∗(rk(X))|2 i < ε.

ˆ Computation of An1: denoted by µ the distribution of X. Thus,

An1 = E hX` i=1 Wn,i(X)|g∗(rk(Xi)) − ˜g(rk(Xi))|2 i = `EhWn,1(X)|g∗(rk(X1)) − ˜g(rk(X1))|2 i = `EhP`Kh(rk(X) −rk(X1)) j=1Kh(rk(X) −rk(Xj)) |g∗(rk(X1)) − ˜g(rk(X1))|2 i = `EDk h E{Xj}`j=1 hZ Kh(rk(v) −rk(X1)) P` j=1Kh(rk(v) −rk(Xj)) × |g∗(rk(X1)) − ˜g(rk(X1))|2µ(dv) Dk ii = `EDk h E{Xj}`j=2 hZ Z |g∗(rk(u)) − ˜g(rk(u))|2× Kh(rk(v) −rk(u)) Kh(rk(v) −rk(u)) + P` j=2Kh(rk(v) −rk(Xj)) µ(du)µ(dv) Dk ii

(25)

= `EDk hZ |g∗(rk(u)) − ˜g(rk(u))|2× E{Xj}`j=2 hZ Kh(rk(v) −rk(u))µ(dv) Kh(rk(v) −rk(u)) + P` j=2Kh(rk(v) −rk(Xj)) Dk i µ(du)i = `EDk hZ

|g∗(rk(u)) − ˜g(rk(u))|2× I(u, `)µ(du)

i .

Fubini's theorem is employed to obtain the result of the last bound where the inner conditional expectation is denoted by I(u, `). We bound I(u, `) using the argument of covering RM with a countable family of balls Bdef= {B

M(xi, ρ/2) :

i = 1, 2, ....} and the facts that

1. rk(v) ∈ BM(rk(u)+hxi, hρ/2) ⇒ BM(rk(u)+hxi, hρ/2) ⊂ BM(rk(v), hρ).

2. b1{BM(0,ρ)}(z) < K(z) ≤ 1, ∀z ∈ R

M.

Now, let

 Ai,h(u)def= {v ∈ Rd: krk(v) −rk(u) − hxik < hρ/2}.

 B`

i,h(u)def=

P`

j=21{krk(Xj)−rk(u)−hxik<hρ/2}.

Thus, one has I(u, `)def= E{Xj}`j=2 hZ Kh(rk(v) −rk(u))µ(dv) Kh(rk(v) −rk(u)) +Pj=2` Kh(rk(v) −rk(Xj)) Dk i ≤ E{Xj}`j=2 h+∞X i=1 Z v:krk(v)−rk(u)−hxik<hρ/2 Kh(rk(v) −rk(u))µ(dv) Kh(rk(v) −rk(u)) +Pj=2` Kh(rk(v) −rk(Xj)) Dk i ≤ E{Xj}`j=2 h+∞X i=1 Z Ai,h(u) supz:kz−hxik<hρ/2Kh(z)µ(dv) supz:kz−hxik<hρ/2Kh(z) +P`j=2Kh(rk(v) −rk(Xj)) Dk i ≤ E{Xj}`j=2 h+∞X i=1 Z Ai,h(u) supz:kz−hxik<hρ/2Kh(z)µ(dv) supz:kz−hxik<hρ/2Kh(z) + b P` j=21{krk(v)−rk(Xj)k<hρ} Dk i

(26)

≤ 1 bE{Xj}`j=2 hX+∞ i=1 Z Ai,h(u) supz:kz−hxik<hρ/2Kh(z)µ(dv) supz:kz−hxik<hρ/2Kh(z) +P`j=21{krk(Xj)−rk(u)−hxik<hρ/2} Dk i ≤ 1 b +∞ X i=1 E{Xj}`j=2 hsupz:kz−hx

ik<hρ/2Kh(z)µ(Ai,h(u))

supz:kz−hxik<hρ/2Kh(z) + Bi,h` (u)

Dk i . Note that B`

i,h(u) is a binomial random variable B(` − 1, µ(Ai,h(u))) under

the law of {Xj}`j=2. Applying part 1 of Lemma 1, one has

I(u, `) ≤ 1 b

+∞

X

i=1

2 supz:kz−hxik<hρ/2Kh(z)µ(Ai,h(u))

`µ(Ai,h(u))

≤ 2 b` +∞ X i=1 sup w:kw−xik<ρ/2 K(w) = 2 b` +∞ X i=1 sup w∈BM(xi,ρ/2) K(w) ≤ 2 b` +∞ X i=1 sup w∈BM(xi,ρ/2) K(w) ≤ 2 b`λM(BM(0, ρ/2)) +∞ X i=1 Z BM(xi,ρ/2) sup w∈BM(xi,ρ/2) K(w)dy ≤ 2 b`λM(BM(0, ρ/2)) +∞ X i=1 Z BM(xi,ρ/2) sup w∈BM(y,ρ) K(w)dy ≤ 2κM b`λM(BM(0, ρ/2)) Z sup w∈BM(y,ρ) K(w)dy | {z } = κ0 by (4) ≤ 2κMκ0 b`λM(BM(0, ρ)) def= C(b, ρ, κ0, M ) ` < +∞

where λM denotes the Lebesque measure on of RM, κM denotes the

num-ber of balls covering a certain element of RM, and the constant part is

de-noted by C(b, ρ, κ0, M ) depending on the parameters indicated in the bracket.

The last inequality is attained from the fact that the overlapping integrals P+∞

i=1

R

(27)

the entire space R supz∈BM(y,ρ/2)K(z)dymultiplying by the number of

cover-ing balls kM. Therefore,

An1≤ `

C(b, ρ, κ0, M )

` EDk

hZ

|g∗(rk(u)) − ˜g(rk(u))|2µ(du)

i = C(b, ρ, κ0, M )E h |˜g(rk(X)) − g∗(rk(X))|2 i < C(b, ρ, κ0, M )ε.

ˆ Computation of An2: for any δ > 0 one has

An2= E hX` i=1 Wn,i(X)|˜g(rk(Xi)) − ˜g(rk(X))|2 i = E hX` i=1 Wn,i(X)|˜g(rk(Xi)) − ˜g(rk(X))|21{krk(Xi)−rk(X)k≥δ} i + Eh ` X i=1 Wn,i(X)|˜g(rk(Xi)) − ˜g(rk(X))|21{krk(Xi)−rk(X)k<δ} i ≤ 4 sup u∈Rd |˜g(rk(u))|2E hX` i=1 Wn,i(X)1{krk(Xi)−rk(X)k≥δ} i + sup u,v∈Rd:kr k(u)−rk(v)k<δ |˜g(rk(u)) − ˜g(rk(v))|2

Using the uniform continuity of ˜g, the second term of the upper bound of An2

tends to 0 when δ tends 0. Thus, we only need to prove that the rst term of this upper bound also tends to 0. We follow a similar procedure as in the previous part: E hX` i=1 Wn,i(X)1{krk(Xi)−rk(X)k≥δ} i = EDk hX` i=1 EX,{Xj}`j=1 h Wn,i(X)1{krk(X)−rk(Xi)k≥δ} Dk ii = EDk hX` i=1 E{Xj}`j=1 hZ Kh(rk(v) −rk(Xi))1{kr k(v)−rk(Xi)k≥δ} P` j=1Kh(rk(v) −rk(Xj)) µ(dv) Dk ii = `EDk h E{Xj}`j=2 hZ Z Kh(rk(v) −rk(u))1{kr k(v)−rk(u)k≥δ}µ(du)µ(dv) Kh(rk(v) −rk(u)) +Pj=2` Kh(rk(v) −rk(Xj)) Dk ii = `EDk hZ J (u, `)µ(du)i.

(28)

Fubini's theorem is applied to obtain the last equation where for any u ∈ Rd, J (u, `)def= E{Xj}`j=2 hZ Kh(rk(v) −rk(u))1{kr k(v)−rk(u)k≥δ}µ(dv) Kh(rk(v) −rk(u)) + P` j=2Kh(rk(v) −rk(Xj)) Dk i ≤ E{Xj}`j=2 h+∞X i=1 Z v:krk(v)−rk(u)−hxik<hρ/2 Kh(rk(v) −rk(u))1{krk(v)−rk(u)k≥δ} Kh(rk(v) −rk(u)) +Pj=2` Kh(rk(v) −rk(Xj)) µ(dv) Dk i ≤ E{Xj}`j=2 h+∞X i=1 Z Ai,h(u) supz:kz−hxik<hρ/2Kh(z)1{kzk≥δ} supz:kz−hxik<hρ/2Kh(z) + P` j=2Kh(rk(v) −rk(Xj)) µ(dv) Dk i ≤ +∞ X i=1 sup z:kz−hxik<hρ/2 Kh(z)1{kzk≥δ}× E{Xj}`j=2 hZ Ai,h(u) µ(dv) supz:kz−hxik<hρ/2Kh(z) + b P` j=21{krk(Xj)−rk(v)k<hρ} Dk i ≤ +∞ X i=1 sup z:kz−hxik<hρ/2 Kh(z)1{kzk≥δ}× E{Xj}`j=2 hZ Ai,h(u) E{Xj}`j=2 h µ(dv) supz:kz−hxik<hρ/2Kh(z) + b P` j=21{krk(Xj)−rk(u)−hxik<hρ/2} Dk i ≤ +∞ X i=1 sup z:kz−hxik<hρ/2

Kh(z)1{kzk≥δ}µ(Ai,h(u))×

1

bE{Xj}`j=2

h 1

supz:kz−hxik<hρ/2Kh(z) + Bi,h` (u)

Dk i ≤ 1 b +∞ X i=1

2 supz:kz−hxik<hρ/2Kh(z)µ(Ai,h(u))1{kzk≥δ}

`µ(Ai,h(u))

≤ 2 b` +∞ X i=1 sup w:kw−xik<ρ/2 K(w)1{kwk≥δ/h}.

Thus, one has

E hX` i=1 Wn,i(X)1{krk(Xi)−rk(X)k≥δ} i ≤ `2 b` +∞ X i=1 sup w∈BM(xi,ρ/2) K(w)1{kwk≥δ/h}

(29)

When both h → 0 and δ → 0 satisfying δ/h → +∞, the upper bound series converges to zero. Indeed, it is a non-negative convergent series thanks to the proof of I(u, l) in the previous part. Moreover, the general term of the series, sk = supw∈BM(xk,ρ/2)K(w)1{kwk≥δ/h}, satisfying limδ/h→+∞sk = 0

for all k ≥ 1. Therefore, this series converges to zero when h → 0, δ → 0 such that δ/h → +∞.

In conclusion, when ` → +∞ and ε, h, δ → 0 such that δ/h → +∞, all the three terms of the upper bound of An tend to 0, so does An.

 Proposition A.2 Under the assumptions of Proposition 2,

lim `→+∞E h ` X i=1 Wn,i(X)[Yi− gn(rk(Xi))] 2i = 0.

Proof of Proposition A.2 Using the independence between (Xi, Yi)and (Xj, Yj)

for all i 6= j, one has A.2 = E h ` X i=1 Wn,i(X)[Yi− gn(rk(Xi))] 2i = X 1≤i,j≤` E h

Wn,i(X)Wn,j(X)[Yi− gn(rk(Xi))][Yj− gn(rk(Xj))]

i = Eh ` X i=1 Wn,i2 (X)|Yi− gn(rk(Xi))|2 i = Eh ` X i=1 Wn,i2 (X)σ2(rk(Xi)) i where σ2(rk(x))def= E[(Yi− gn(rk(Xi)))2|rk(x)].

Thus, based on the assumption of X and Y we have σ2 ∈ L

1(µ). Therefore, σ2

can be approximated in L1 sense i.e., for any ε > 0, ∃˜σ2 a continuous function with

compact support such that

E[|σ2(rk(X)) − ˜σ2(rk(X))|] < ε.

Thus, one has A.2 ≤ Eh ` X i=1 Wn,i2 (X)˜σ2(rk(Xi)) i + Eh ` X i=1 Wn,i2 (X)|σ2(rk(Xi)) − ˜σ2(rk(Xi))| i ≤ sup u∈Rd |˜σ2(rk(u))|E hX` i=1 Wn,i2 (X) i + E hX` i=1 Wn,i2 (X)|σ2(rk(Xi)) − ˜σ2(rk(Xi))| i .

(30)

Using similar argument as in the case of An1 and the fact that Wn,i(x) ≤ 1, ∀i =

1, 2, ..., `, thus for any ε > 0, one has E hX` i=1 Wn,i2 (X)|σ2(rk(Xi)) − ˜σ2(rk(Xi))| i ≤ Eh ` X i=1 Wn,i(X)|σ2(rk(Xi)) − ˜σ2(rk(Xi))| i < C(b, ρ, κ0, M )ε.

Therefore, it remains to prove that E[P`

i=1Wn,i2 (X)] → 0 as ` → +∞. As

b1{BM(0,ρ)}(z) < K(z) ≤ 1, ∀z ∈ R

M with the convention of 0/0 = 0, for a xed

δ > 0, one has ` X i=1 Wn,i2 (X) = ` X i=1  Kh(rk(X) −rk(Xi)) P` j=1Kh(rk(X) −rk(Xj)) 2 ≤ P` i=1Kh(rk(X) −rk(Xi))  P` j=1Kh(rk(X) −rk(Xj)) 2 ≤ minnδ, 1{P` j=1Kh(rk(X)−rk(Xj))>0} P` j=1Kh(rk(X) −rk(Xj)) o ≤ minnδ, 1{P` j=11{krk(X)−rk(Xj)k<hρ}>0} bP` j=11{krk(X)−rk(Xj)k<hρ} ≤ δ + 1{P` j=11{krk(X)−rk(Xj)k<hρ}>0} bP` j=11{krk(X)−rk(Xj)k<hρ} . (13)

Therefore, it is enough to show that E h 1{P`j=11{krk(X)−rk(Xj)k<hρ}>0} P` j=11{krk(X)−rk(Xj)k<hρ} i `→+∞ −−−−→ 0. One has E h 1{P`j=11{krk(X)−rk(Xj)k<hρ}>0} P` j=11{krk(X)−rk(Xj)k<hρ} i ≤ Eh 1{ P` j=11{krk(X)−rk(Xj)k<hρ}>0} P` j=11{krk(X)−rk(Xj)k<hρ} 1{rk(X)∈B} i + µ({v ∈ Rd:rk(v) ∈ Bc}) = Eh1{rk(X)∈B}E h 1{P`j=11{krk(X)−rk(Xj)k<hρ}>0} P` j=11{krk(X)−rk(Xj)k<hρ} X ii + µ({v ∈ Rd:rk(v) ∈ Bc}) ≤ 2Eh 1{rk(X)∈B} (` + 1)µ({v ∈ Rd: kr k(v) −rk(X)k < hρ}) i + µ({v ∈ Rd:rk(v) ∈ Bc})

(31)

where B is a M-dimensional ball centered at the origin chosen so that the second term µ({v ∈ Rd:r

k(v) ∈ Bc})is small. The last inequality is attained by applying

part 2 of lemma 1. Moreover, as rk = (rk,m)Mm=1 is bounded then there exists a

nite number of balls in B = {BM(xj, hρ/2) : j = 1, 2, ...}such that B is contained

in the union of these balls i.e., ∃Ih,M nite, such that B ⊂ ∪j∈Ih,MBM(xj, hρ/2).

E h 1{r k(X)∈B} (` + 1)µ({v ∈ Rd: kr k(v) −rk(X)k < hρ}) i ≤ X j∈Ih,M Z u:krk(u)−xjk<hρ/2 µ(du) (` + 1)µ({v ∈ Rd: kr k(v) −rk(u)k < hρ}) + µ({v ∈ Rd:rk(v) ∈ Bc}) ≤ X j∈Ih,M Z u:krk(u)−xjk<hρ/2 µ(du) (` + 1)µ({v ∈ Rd: kr k(v) − xjk < hρ/2}) + µ({v ∈ Rd:rk(v) ∈ Bc}) = X j∈Ih,M µ({u ∈ Rd: krk(u) − xjk < hρ/2}) (` + 1)µ({v ∈ Rd: kr k(v) − xjk < hρ/2}) + µ({v ∈ R d:r k(v) ∈ Bc}) = |Ih,M| ` + 1 + µ({v ∈ R d:r k(v) ∈ Bc}) ≤ C0 hM(` + 1)+ µ({v ∈ R d:r k(v) ∈ Bc}) (14) `→+∞,h→0 −−−−−−−−→ hM`→+∞ µ({v ∈ R d:r k(v) ∈ Bc}).

It is easy to check the following fact, |Ih,M| ≤

C0

hM for some C0> 0. (15)

To prove this, we consider again the cover B = {BM(xj, hρ/2) : j = 1, 2, ...}of RM.

For any ρ > 0 xed and h > 0, note that the covering number |Ih,M|is proportional

to the ratio between the volume of B and the volume of the ball BM(0, hρ/2) i.e.,

|Ih,M| ∝ Vol(B) Vol(BM(0, hρ/2)) ∝ Vol(B) (hρ/2)M ≤ C0 hM

(32)

for some positive constant C0 proportional to the volume of B. Finally, we can

conclude the proof of the proposition as we can choose B such that µ({v ∈ Rd :

rk(v) ∈ Bc}) = 0thanks to the boundedness of the basic machines.

Remark 2 The assumption on the boundedness of the constructed machines is crucial. This assumption allows us to choose a ball B which can be covered using a nite number |Ih,M| of balls BM(xj, hρ/2), therefore makes it possible to prove the

result of this proposition for this class of regular kernels. Note that for the class of compactly supported kernels, it is easy to obtain such a result directly from the begging of the evaluation of each integral (see, for example, Chapter 5 of Györ et al. (2002)).

 Proposition A.3 Under the assumptions of Proposition 2,

lim `→+∞E h g ∗(r k(X)) X` i=1 Wn,i(X) − 1  2i = 0. Proof of Proposition A.3 Note that | P`

i=1Wn,i(X) − 1| ≤ 1 thus one has

g ∗(r k(X)) X` i=1 Wn,i(X) − 1  2 ≤ |g∗(rk(X))|2.

Consequently, by Lebesque's dominated convergence theorem, to prove this propo-sition, it is enough to show that P`

i=1Wn,i(X) → 1 almost surely. Note that

1 −P` i=1Wn,i(X) =1{P` i=1Kh(rk(X)−rk(Xi))=0} therefore, P hX` i=1 Wn,i(X) 6= 1 i = P hX` i=1 Kh(rk(X) −rk(Xi)) = 0 i ≤ P ` X j=1 1{krk(X)−rk(Xj)k<hρ} = 0  = Z P X` j=1 1{krk(x)−rk(Xj)k<hρ} = 0  µ(dx) = Z P  ∩`j=1{krk(x) −rk(Xj)k ≥ hρ}  µ(dx) = Z h 1 − P{krk(x) −rk(X1)k < hρ} i` µ(dx) = Z h 1 − µ  {v ∈ Rd: krk(x) −rk(v)k < hρ} i` µ(dx)

(33)

≤ Z e−`µ(Ah(x))µ(dx) = Z e−`µ(Ah(x))1 {rk(x)∈B}µ(dx) + µ({v ∈ R d:r k(v) ∈ Bc}) ≤ maxu{ue −u} ` Z 1 {rk(x)∈B} µ(Ah(x)) µ(dx) + µ({v ∈ R d:r k(v) ∈ Bc}) where Ah(x)def= {v ∈ Rd: krk(x) −rk(v)k < hρ}. (16) Therefore, P hX` i=1 Wn,i(X) 6= 1 i ≤ e −1 ` E h 1{r k(X)∈B} µ({v ∈ Rd: kr k(v) −rk(X)k < hρ}) i + µ({v ∈ Rd:rk(v) ∈ Bc}).

Following the same procedure as in the proof of A.2 we obtain the desire result.  Proof of Theorem 1 Choose a new observation x ∈ Rd, given the training data

Dk and the predictions {rk(Xp)}`p=1 on D`, taking expectation with respect to the

response variables {Y(`)

p }`p=1, it is easy to check that

E[|gn(rk(x)) − g∗(rk(x))|2|{rk(Xp)}`p=1, Dk] = E h gn(rk(x)) − E[gn(rk(x))|{rk(Xp)} ` p=1, Dk] + E[gn(rk(x))|{rk(Xp)}`p=1, Dk] − g∗(rk(x)) 2 {rk(Xp)} ` p=1, Dk i = E[|gn(rk(x)) − E[gn(rk(x))|{rk(Xp)}`p=1, Dk]|2|{rk(Xp)}`p=1, Dk] + |g∗(rk(x)) − E[gn(rk(x))|{rk(Xp)}`p=1, Dk]|2 def= E 1+ E2.

On one hand by using the independence between Yi and (Yj, Xj) for all i 6= j, we

(34)

E1def= E h gn(rk(x)) − E[gn(rk(x))|{rk(Xp)} ` p=1, Dk] 2 {rk(Xp)} ` p=1, Dk i = Eh ` X i=1

Wn,i(x)(Yi− E[Yi|rk(Xi)])

2 {rk(Xp)} ` p=1, Dk i = Eh ` X i=1

Wn,i2 (x)(Yi− E[Yi|rk(Xi)])2

{rk(Xp)} ` p=1, Dk i = ` X i=1

Wn,i2 (x)EYi[(Yi− E[Yi|rk(Xi)])

2|r k(Xi)] = V[Y1|rk(X1)] ` X i=1 Wn,i2 (x) (13) ≤ 4R 2 b  δ + 1{P` j=11{krk(x)−rk(Xj)k<hρ}>0} P` j=11{krk(x)−rk(Xj)k<hρ} 

where the notation V(Z) stands for the variance of a random variable Z. Therefore, using the result of inequality (14), one has

E(E1) ≤ 4R2 b  δ + C0 hM(` + 1)  (17) for some C0 > 0. On the other hand, set

 C` h(x)def= P` j=11{krk(Xj)−rk(x)k<hρ}.  D` h(x)def= P` j=1Kh(rk(Xj) −rk(x)).

The second term E2 is hard to control as it depends on the behavior of g∗(rk(.)).

That is why a weak smoothness assumption of the theorem is required to connect this behavior to the behavior of the input machines. Using this assumption, one has E2def= g ∗(r k(x)) − E[gn(rk(x))|{rk(Xp)}`p=1, Dk] 2 = ` X i=1 Wn,i(X)(g∗(rk(x)) − E[Yi|rk(Xi)]) 2 1{D` h(x)>0} + (g∗(rk(x)))21{D` h(x)=0}

(35)

(Jensen) ≤ ` X i=1 Wn,i(x)(g∗(rk(x)) − E[Yi|rk(Xi)])21{D` h(x)>0} + (g∗(rk(x)))21{D` h(x)=0} ≤ ` X i=1 Kh(rk(x) −rk(Xi))(g∗(rk(x)) − g∗(rk(Xi)))2 P` j=1Kh(rk(x) −rk(Xj)) 1{D` h(x)>0} + (g∗(rk(x)))21{D` h(x)=0} ≤ L2 ` X i=1 Kh(rk(x) −rk(Xi))krk(x) −rk(Xi)k2 P` j=1Kh(rk(x) −rk(Xj)) 1{D` h(x)>0} + (g∗(rk(x)))21{D` h(x)=0} ≤ L2h ` X i=1 Kh(rk(x) −rk(Xi))krk(x) −rk(Xi)k21{krk(x)−rk(Xi)k<RKhα} P` j=1Kh(rk(x) −rk(Xj)) + ` X i=1 Kh(rk(x) −rk(Xi))krk(x) −rk(Xi)k21{krk(x)−rk(Xi)k≥RKhα} P` j=1Kh(rk(x) −rk(Xj)) i × 1{D` h(x)>0}+ (g ∗(r k(x)))21{C` h(x)=0} def= E1 2 + E22+ E23.

for any α ∈ (0, 1) chosen arbitrarily at this point. Now, we bound the expectation of the three terms of the last inequality.

ˆ Firstly, E1

2 can be easily bounded from above by

E21def= L2 ` X i=1 Kh(rk(x) −rk(Xi))krk(x) −rk(Xi)k2 P` j=1Kh(rk(x) −rk(Xj)) 1{D` h(x)>0}× 1{krk(x)−rk(Xi)k<RKhα} ≤ L2h2αR2K ` X i=1 Kh(rk(x) −rk(Xi)) P` j=1Kh(rk(x) −rk(Xj)) 1{D` h(x)>0} = L2h2αR2K.

Therefore, its expectation is simply bounded by the same upper bound i.e., E(E21) ≤ L2h2αR2K (18)

ˆ Secondly, we bound the second term E2

(36)

K given equation (7), thus for any h > 0: E22def= h2L2 ` X i=1 Kh(rk(x) −rk(Xi))k(rk(x) −rk(Xi))/hk2 P` j=1Kh(rk(x) −rk(Xj)) 1{D` h(x)>0} × 1{krk(x)−rk(Xi)k≥RKhα} ≤ h2L2 ` X i=1 CK1{k(rk(x)−rk(Xi))/hk≥RK/h1−α}1{D`h(x)>0} (1 + k(rk(x) −rk(Xi))/hkM)P`j=1Kh(rk(x) −rk(Xj)) ≤ hM +2L2 ` X i=1 CK1{krk(x)−rk(Xi)k≥RKhα}1{Dh`(x)>0} (hM + (R Khα)M)P`j=1Kh(rk(x) −rk(Xj)) ≤ hM +2L2CK ` X i=1 1{krk(x)−rk(Xi)k≥RKhα} (hM + RM KhαM) P` j=1Kh(rk(x) −rk(Xj)) 1{D` h(x)>0} ≤ h M +2−αML2C K hM (1−α)+ RM K × P` i=11{krk(x)−rk(Xi)k≥RKhα} P` j=1Kh(rk(x) −rk(Xj)) 1{D` h(x)>0} ≤ h (1−α)M +2L2C K bRM K × P` i=11{krk(x)−rk(Xi)k≥RKhα} P` j=11{krk(x)−rk(Xj)k<hρ} 1{C` h(x)>0}. Therefore, E22 ≤ h (1−α)M +2L2C K` bRM K × 1{P` j=11{krk(x)−rk(Xj)k<hρ}>0} P` j=11{krk(x)−rk(Xj)k<hρ} . Again, applying the result of inequality (14), one has

E(E22) ≤ h(1−α)M +2L2CK` bRM K × C0 hM(` + 1) ≤ C1` (` + 1)h 2−αM (19)

for some C1 > 0 and α < 2/M.

ˆ Lastly with Ah(x) dened in (16), we bound the expectation of E23 by,

E(E23) ≤ E h (g∗(rk(x)))21{C` h(x)=0} i ≤ sup u∈Rd (g∗(rk(u)))2Eh1{C` h(x)=0} i = sup u∈Rd (g∗(rk(u)))2(1 − µ(Ah(x)))` ≤ sup u∈Rd (g∗(rk(u)))2e−`µ(Ah(x))

(37)

≤ sup

u∈Rd

(g∗(rk(u)))2

`µ(Ah(x))e−`µ(Ah(x))

`µ(Ah(x))

≤ sup

u∈Rd

(g∗(rk(u)))2

maxu∈Rdue−u

`µ(Ah(x)) ≤ sup u∈Rd (g∗(rk(u)))2 e−1 `µ(Ah(x)) ≤ C2 `µ(Ah(x))) (20) for some C2 > 0.

From (17), (18), (19) and (20), one has E[|gn(rk(X)) − g∗(rk(X))|2] ≤ Z Rd E[|gn(rk(x)) − g∗(rk(x))|2]µ(dx) ≤ Z Rd E(E1+ E21+ E22+ E23)µ(dx) ≤ Z Rd h4R2 b  δ + C0 hM(` + 1)  + L2h2αR2K + C1` (` + 1)h 2−αM+ C2 `µ(Ah(x))) i µ(dx). Therefore following the same procedure of proving inequality (14), one has

E[|gn(rk(X)) − g∗(rk(X))|2] ≤ 4R 2 b  δ + C0 hM(` + 1)  + L2h2αR2K+ C1` (` + 1)h 2−αM+ Z Rd C2µ(dx) `µ(Ah(x))) ≤ 4R 2 b  δ + C0 hM(` + 1)  + L2h2αR2K+ C1` (` + 1)h 2−αM + X j∈Jh,M Z krk(x)−xjk<hρ C2µ(dx) `µ({v ∈ Rd: kr k(v) −rk(x)k < hρ}) ≤ 4R 2 b  δ + C0 hM(` + 1)  + L2h2αR2K+ C1` (` + 1)h 2−αM + X j∈Jh,M Z krk(x)−xjk<hρ C2µ(dx) `µ({v ∈ Rd: kr k(v) − xjk < hρ})

(38)

≤ 4R 2 b  δ + C0 hM(` + 1)  + L2h2αR2K+ C1` (` + 1)h 2−αM +C2 ` X j∈Jh,M µ({v ∈ Rd: krk(v) − xjk < hρ}) µ({v ∈ Rd: kr k(v) − xjk < hρ}) ≤ 4R 2 b  δ + C0 hM(` + 1)  + L2h2αR2K+ C1` (` + 1)h 2−αM+C2|Jh,M| ` ≤ 4R 2 b  δ + C0 hM(` + 1)  + L2R2Kh2α+ C1` (` + 1)h 2−αM+ C20 hM`

where |Jh,M|denotes the number of balls covering the ball B (introduced in the proof

of A.2) by the cover {BM(xj, hρ) : j = 1, 2, ...}. Similarly, one has |Jh,M| ≤ hCM0

for some constant C0 > 0 proportional to the volume of B. Since δ > 0 can be

arbitrarily small, and with the choice of α = 2/(M + 2), we can deduce that

E[|gn(rk(X)) − g∗(rk(X))|2] ≤

˜ C1

hM`+ ˜C2h

4/(M +2). (21)

From this bound, for h ∝ `−(M +2)/(M2+2M +4)

we obtain the desire result with the upper bound of order O(`−M 2+2M +44 ) i.e.,

E[|gn(rk(X)) − g∗(rk(X))|2] ≤ C`

− 4

M 2+2M +4

for some constant C > 0 independent of `.

 Proof of Remark 1 To prove the result in this case, which means, under the following assumption:

∃RK, CK > 0 and α ∈ (0, 1) : K(z) ≤ CKe−kzk

α

, ∀z ∈ RM, kzk ≥ RK,

we only need to check the new bound of E2

2 dened in the previous case. One has

E22def= L2 ` X i=1 Kh(rk(x) −rk(Xi))krk(x) −rk(Xi)k21{D` h(x)>0} P` j=1Kh(rk(x) −rk(Xj)) × 1{krk(x)−rk(Xi)k≥hαRK} ≤ L2 ` X i=1 h2Kh(rk(x) −rk(Xi))k(rk(x) −rk(Xi))/hk21{D` h(x)>0} P` j=1Kh(rk(x) −rk(Xj)) 1{(krk(x)−rk(Xi))/hk≥RK/h1−α}

(39)

≤ h 2L2 b ` X i=1 CKe−k(rk(x)−rk(Xi))/hk α k(rk(x) −rk(Xi))/hk2 P` j=11{krk(x)−rk(Xj)k<hρ} × 1{k(rk(x)−rk(Xi))/hk≥RK/h1−α}1{C`h(x)>0} As for any α ∈ (0, 1), t 7→ λ(t) = t2e−tα

is strictly decreasing for all t ≥ (2/α)1/α.

Thus for h small enough such that RK/h1−α ≥ (2/α)1/α, one has

E22 ≤ h 2L2C K b ` X i=1 (RK/h1−α)2e−(RK/h 1−α)α 1{k(rk(x)−rk(Xi))/hk≥RK/h1−α} P` j=11{krk(x)−rk(Xj)k<hρ} 1{C` h(x)>0} ≤ h 2αL2C KR2Ke−R α Kh−α(1−α) b ` X i=1 1{P` j=11{krk(x)−rk(Xj)k<hρ}>0} P` j=11{krk(x)−rk(Xj)k<hρ} ≤ `h 2αL2C KR2Ke−R α Kh −α(1−α) b × 1{P` j=11{krk(x)−rk(Xj)k<hρ}>0} P` j=11{krk(x)−rk(Xj)k<hρ} . Applying the result of inequality (14), one has

E(E22) ≤ `h2αL2CKR2Ke−R α Kh −α(1−α) b × C0 hM(` + 1) ≤ C1h2α−Me−R α Kh −α(1−α) (22) for some C1 > 0 and α ∈ (0, 1).

Therefore from (17), (18), (20) and (22), one has E[|gn(rk(X)) − g∗(rk(X))|2] ≤ Z Rd E[|gn(rk(x)) − g∗(rk(x))|2]µ(dx) ≤ Z Rd E(E1+ E21+ E22+ E23)µ(dx) ≤ Z Rd h4R2 b  δ + C0 hM(` + 1)  + L2h2αR2K + C1h2α−Me−R α Kh−α(1−α)+ C2 `µ(Ah(x))) i µ(dx). Following the same procedure as in the previous proof of theorem 1, one has

E[|gn(rk(X)) − g∗(rk(X))|2] ≤ 4R 2 b  δ + C0 hM(` + 1)  + L2h2αR2K+ C1h2α−Me−R α Kh −α(1−α) + C 0 2 hM`.

(40)

Since δ > 0 is chosen arbitrarily and the third term of the last inequality decreases exponentially fast as h → 0 for any α ∈ (0, 1) xed, hence it is negligible comparing to other terms. Finally, with the choice of h ∝ `−1/(M +2α), we obtain the desire

result: E[|gn(rk(X)) − g∗(rk(X))|2] ≤ ˜ C1 hM`+ ˜C2h 2α≤ C`−2α/(M +2α)

for some C > 0 independent of `.



References

Audibert, J.Y., 2004. Aggregated estimators and empirical complexity for least square regression. Annales de l'Institut Henri Poincaré (B) Probabilités et Statis-tique 40, 685736.

Biau, G., Fischer, A., Guedj, B., Malley, J.D., 2016. COBRA: a combined regres-sion strategy. Journal of Multivariate Analysis 146, 1828.

Borchers, H.W., 2019. pracma: Practical numerical math functions. Breiman, L., 1995. Stacked regression. Machine Learning 24, 4964.

Bunea, F., Tsybakov, A.B., Wegkamp, M.H., 2006. Aggregation and sparsity via `1-penalized least squares, in: Lugosi, G., Simon, H.U. (Eds.), Proceedings of

19th Annual Conference on Learning Theory (COLT 2006), Lecture Notes in Articial Intelligence, Springer-Verlag, Berlin-Heidelberg. pp. 379391.

Bunea, F., Tsybakov, A.B., Wegkamp, M.H., 2007a. Aggregation for gaussian regression. The Annals of Statistics 35, 16741697.

Bunea, F., Tsybakov, A.B., Wegkamp, M.H., 2007b. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics 35, 169194.

Cadet, O., Harper, C., Mougeot, M., 2005. Monitoring energy performance of compressors with an innovative auto-adaptive approach., in: Instrumentation System and Automation -ISA- Chicago.

Catoni, O., 2004. Statistical Learning Theory and Stochastic Optimization. Lec-tures on Probability Theory and Statistics, Ecole d'Eté de Probabilités de Saint-Flour XXXI - 2001, Lecture Notes in Mathematics, Springer.

(41)

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis., J., 2009. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, Elsevier 47, 547553.

Dalalyan, A., Tsybakov, A.B., 2008. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72, 3961.

Devroye, L., Györ, L., Lugosi, G., 1997. A Probabilistic Theory of Pattern Recog-nition. Springer.

Devroye, L., Krzy»ak, A., 1989. An equivalence theorem for l1 convergence of the kernel regression estimate. Journal of Statistical Planning and Inference 23, 7182.

Dua, D., Gra, C., 2017a. UCI machine learning repository: Abalone data set. Dua, D., Gra, C., 2017b. UCI machine learning repository: Wine quality data

set.

Fischer, A., Montuelle, L., Mougeot, M., Picard, D., 2017. Statistical learning for wind power: A modeling and stability study towards forecasting. Wiley Online Library 20, 20372047. doi:10.1002/we.2139.

Fischer, A., Mougeot, M., 2019. Aggregation using input-output trade-o. Journal of Statistical Planning and Inference 200, 119.

Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 122. Guedj, B., 2013. COBRA: Nonlinear Aggregation of Predictors. R package version

0.99.4.

Guedj, B., Rengot, J., 2020. Non-linear aggregation of lters to improve image denoising, in: Arai, K., Kapoor, S., Bhatia, R. (Eds.), Intelligent Computing, Springer International Publishing, Cham. pp. 314327.

Guedj, B., Srinivasa Desikan, B., 2018. Pycobra: A python toolbox for ensemble learning and visualisation. Journal of Machine Learning Research 18, 15. Guedj, B., Srinivasa Desikan, B., 2020. Kernel-based ensemble learning in python.

Information 11, 63. doi:10.3390/info11020063.

Györ, L., Kohler, M., Krzy»ak, A., Walk, H., 2002. A Distribution-Free Theory of Nonparametric Regression. Springer.

(42)

Juditsky, A., Nemirovski, A., 2000. Functional aggregation for nonparametric es-timation. The Annals of Statistics 28, 681712.

Kaggle, 2016. House sales in king county, usa.

Li, S., 2019. Fnn: Fast nearest neighbor search algorithms and applications. Liaw, A., Wiener, M., 2002. Classication and regression by randomforest. R News

2, 1822.

Massart, P., 2007. Concentration Inequalities and Model Selection. École d'Été de Probabilités de Saint-Flour XXXIII  2003, Lecture Notes in Mathematics, Springer, Berlin, Heidelberg.

Mojirsheibani, M., 1999. Combined classiers via disretization. Journal of the American Statistical Association 94, 600609.

Mojirsheibani, M., 2000. A kernel-based combined classication rule. Journal of Statistics and Probability Letters 48, 411419.

Mojirsheibani, M., Kong, J., 2016. An asymptotically optimal kernel combined classier. Journal of Statistics and Probability Letters 119, 91100.

Nemirovski, A., 2000. Topics in Non-Parametric Statistics. École d'Été de Proba-bilités de Saint-Flour XXVIII  1998, Springer.

Ripley, B., 2019. tree: Classication and regression trees.

Stone, C.J., 1977. Consistent nonparametric regression. Ann. Statist. 5, 595620. doi:10.1214/aos/1176343886.

Wegkamp, M.H., 2003. Model selection in nonparametric regression. The Annals of Statistics 31, 252273.

Yang, Y., 2000. Combining dierent procedures for adaptive regression. Journal of multivariate analysis 74, 135161.

Yang, Y., 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96, 574588. doi:10.1198/016214501753168262.

Yang, Y., 2004. Aggregating regression procedures to improve performance. Bernoulli 10, 2547. doi:10.3150/bj/1077544602.

Figure

Figure 1: The shapes of some kernels.
Figure 2: Boxplots of runtimes of GD and grid search algorithm implemented on some models.
Table 4: Average RMSEs of real datasets.
Figure 3: Boxplots of runtimes of GD and grid search algorithm implemented on Wine and Turbine datasets.

Références

Documents relatifs

In particular, we discuss k-LinReg, a straightforward and easy to implement algorithm in the spirit of k-means for the nonconvex optimization problem at the core of switched

Thus, given a signed active set and a sign-compliant solution to the associated (local) QP, the set augmented with an over-correlated feature contains a better solution (lower

Ten non-disabled subjects performed a reaching task, while wearing an elbow prosthesis which was driven by several interjoint coordination models obtained through different

Avec VI planches. Mémoire publié à l'occasion du Jubilé de l'Université. Mémoire publié à l'occasion du Jubilé de l'Université.. On constate l'emploi régulier

Sign gradient descent method based bat searching algorithm with application to the economic load dispatch problem... Sign gradient descent method based bat searching algorithm

Table S3 show the average values of intensities and correlation function amplitudes from ICS analysis of a few images corresponding to those shown in Figure S3, confirming that

As a novelty, the sign gradient descent algorithm allows to converge in practice towards other minima than the closest minimum of the initial condition making these algorithms

Then a unit of `-th hidden layer has sub-Weibull distribution with optimal tail parameter θ = `/2, where ` is the number of convolutional and fully-connected