Behavior of linear L2-boosting algorithms in the vanishing learning rate asymptotic

(1)

HAL Id: hal-03078992

https://hal.archives-ouvertes.fr/hal-03078992

Preprint submitted on 20 Dec 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Behavior of linear L2-boosting algorithms in the vanishing learning rate asymptotic

Clément Dombry, Youssef Esstafa

To cite this version:

Clément Dombry, Youssef Esstafa. Behavior of linear L2-boosting algorithms in the vanishing learning rate asymptotic. 2020. �hal-03078992�

(2)

Behavior of linear L

²

-boosting algorithms in the vanishing learning rate asymptotic

Clément Dombry^∗ and Youssef Esstafa^∗∗

December 20, 2020

Abstract

We investigate the asymptotic behaviour of gradient boosting algorithms when the learning rate converges to zero and the number of iterations is rescaled accordingly. We mostly consider L²-boosting for regression with linear base learner as studied in Bühlmann and Yu (2003) and analyze also a stochastic version of the model where subsampling is used at each step (Friedman,2002). We prove a deterministic limit in the vanishing learning rate asymptotic and characterize the limit as the unique solution of a linear differential equation in an infinite dimensional function space. Besides, the training and test error of the limiting procedure are thoroughly analyzed. We finally illustrate and discuss our result on a simple numerical experiment where the linearL²-boosting operator is interpreted as a smoothed projection and time is related to its number of degrees of freedom.

Keywords: boosting, non parametric regression, statistical learning, stochastic algorithm, Markov chain, convergence of stochastic process.

Mathematics subject classification: 62G08, 60J20.

∗Université Bourgogne Franche-Comté, Laboratoire de Mathématiques de Besançon UMR6623, CNRS, F-25000 Besançon, France. Email: clement.dombry@univ-fcomte.fr

∗∗ENSAI, Campus de Ker-Lann, 51 Rue Blaise Pascal, BP 37203 - 35172 Bruz Cedex, France. Email: youssef.esstafa@ensai.fr

(3)

1 Introduction

In the past decades, boosting has become a major and powerful prediction method in machine learning. The success of the classification algorithm AdaBoost by Freund and Schapire (1999) demonstrated the possibility to combine many weak learners in a sequential way in order to produce better predictions, with widespread applications in gene expression (Dudoit et al., 2002) or music genre identification (Bergstra et al., 2006), to name only a few. Friedman et al. (2000) were able to see a wider statistical framework that lead to the gradient boosting (Friedman, 2001), where a weak learner (e.g., regression trees) is used to optimize a loss function in a sequential procedure akin to gradient descent. Choosing the loss function according to the statistical problem at hand results in a versatile and efficient tool that can handle classification, regression, quantile regression or survival analysis...

The popularity of gradient boosting is also due to its efficient implementation in the R package gbm byRidgeway (2007).

Along the methodological developments, strong theoretical results have justified the good performance of boosting. Consistency of boosting algorithm, i.e. their ability to achieve the optimal Bayes error rate for large samples, is considered in Breiman (2004), Zhang and Yu (2005) or Bartlett and Traskin (2007). The present paper is strongly influenced by Bühlmann

(4)

and Yu (2003) that proposes an analysis of regression boosting algorithms built on linear base learners thanks to explicit formulas for the boosted predictor and its error rate.

In this paper, we focus on gradient boosting for regression with square loss and we briefly describe the corresponding algorithm. Consider a regression model

Y =f(X) +ε (1)

where the responseY is real-valued, the predictorXtakes values in[0,1]^p, the regression function f : [0,1]^p → Ris measurable and the error ε is centered, square integrable and independent of X. Based on a sample (Y_i, X_i)1≤i≤n of independent observations of the regression model (1), we aim at estimating the regression function f. Given a weak learner L(x) =L(x; (Yi, Xi)1≤i≤n), the boosting algorithm with learning rate λ ∈(0,1) produces a sequence of models Fˆ_m^λ(x), m≥ 0, by recursively fitting the weak learner to the current residuals and updating the model with a shrunken version of the fitted model.

More formally, we define Fˆ₀^λ(x) = ¯Y_n,

Fˆ_m+1^λ (x) = ˆF_m^λ(x) +λL(x; (R^λ_m,i, X_i)_1≤i≤n), m ≥0, (2) where Y¯_n denotes the empirical mean of (Y_i)1≤i≤n and (R^λ_m,i)1≤i≤n the residuals

R^λ_m,i =Yi−Fˆ_m^λ(Xi), 1≤i≤n.

In practice, the shrinkage parameter λ and the number of iterations m are the main parameters and must be chosen suitably to achieve good performance. Common practice is to fix λ to a small value, typically λ = 0.01 or 0.001, and then to select m by cross-validation. Citing Ridgeway (2007), with slight modifications to match our notations:

"The issues that most new users of gbm struggle with are the choice of tree numbersmand shrinkageλ. It is important to know that smaller values of λ (almost) always give improved predictive performance. That is, setting λ = 0.001 will almost certainly result in a model with better out-of-sample predictive performance than setting λ = 0.01. However, there are computational costs, both storage and CPU time, associated with setting shrinkage to be low. The model with λ = 0.001 will likely require ten times as many iterations as the model with λ = 0.01, increasing storage and computation time by a factor of 10."

This citation clearly emphasizes the role of small learning rates in boosting. The purpose of the present paper is to prove the existence of a vanishing

(5)

learning rate limit (λ →0) for the boosting algorithm when the number of iterations is rescaled accordingly. To our best knowledge, this is the first result in this direction. More precisely, in the case when the base learner is linear, we prove the existence of the limit

Fˆ_t(x) = lim

λ↓0

Fˆ_[t/λ]^λ (x), t≥0. (3)

We furthermore characterize the limit as the solution of a linear differential equation in infinite dimensional space and also analyse the corresponding training and test errors. The case of stochastic gradient boosting (Friedman, 2002), where subsampling is introduced at each iteration, is also analysed:

we prove the existence of a deterministic vanishing learning rate limit that corresponds to a modified deterministic base learner defined in a natural way.

The analysis of this stochastic framework requires involved tools of Markov chain theory and the characterization of their convergence through generators (Ethier and Kurtz, 1986; Stroock and Varadhan, 2006). A limitation of our work is the strong assumption of linearity of the base learner: the ubiquitous regression tree does not satisfy this assumption and further work is needed to deal with this important case. Our results are of probabilistic nature: we focus on the existence and properties of the limit (3) for fixed sample size n ≥1, while statistical issues such as consistency as n → ∞ is left aside for further research.

The paper is structured as follows. In Section 2, we prove the existence of the vanishing learning rate limit (3) for the boosting procedure with linear base learner (Proposition 2.5), we characterize the limit as the solution of a linear differential equation in a function space (Theorem 2.7) and we analyze the training and test errors (Propositions 2.12 and 2.13). The stochastic gradient boosting where subsampling is introduced at each step is considered in Section 3. We prove that the vanishing learning rate limit still exists and that the convergence holds in quadratic mean (Corollary3.5) and also in the sense of functional weak convergence in Skorokhod space (Theorem 3.6). A simple numerical experiment is presented in Section 4 in order to illustrate our theoretical findings, leading us to the interpretation of linearL²-boosting as a smoothed projection where time is related to the degrees of freedom of the linear boosting operator. All the technical proofs are gathered in Section 5.

(6)

2 L

²

-boosting with linear base learner

2.1 Framework

We consider the framework of boosting for regression withL²-loss and linear base learner provided by Bühlmann and Yu (2003). This framework allows for explicit computations relying on linear algebra. The regression design is assumed deterministic, or equivalently, we formulate our results conditionally on the predictor values X_i = x_i, i = 1, . . . , n. The space of measurable and bounded functions on [0,1]^p is denoted by L^∞ = L^∞([0,1]^p,R). Our main hypothesis is the following linearity assumption of the base learner L.

Assumption 2.1. We assume that the base learner of the boosting algorithm (2) satisfies

L(x; (x_i, Y_i)1≤i≤n) =

n

X

j=1

Y_jg_j(x), x∈[0,1]^p, (4) where g₁, . . . , g_n ∈L^∞ may depend on (x_i)1≤i≤n.

It follows from Assumption 2.1 that g_j is the output of the base learner for input (Y_i)1≤i≤n = (δ_ij)1≤i≤n, where the Kroenecker symbol δ_ij is equal to 1 if i=j and 0 otherwise.

Under Assumption 2.1, the boosting algorithm with input (Y_i, x_i)_1≤i≤n and learning rateλ∈(0,1)outputs a sequence of bounded functions( ˆF_m^λ)m≥1. The sequence remains in the finite dimensional linear space spanned in L^∞ by the functions g₁, . . . , g_n and the constant functions (due to the initialization equal to the constant function Y¯_n). A straightforward recursion based on Equation (2) yields

Fˆ_m^λ(x) = ¯Yn+

n

X

i=1

w_m,i^λ gi(x) (5) where the weights w_m^λ = (w_m,i^λ )_1≤i≤n satisfy

w_0,i^λ ≡0

w_m+1,i^λ =w^λ_m,i+λ(Y_i−Y¯_n)−λPn

j=1w_m,j^λ g_j(x_i) . This linear recursion system can be rewritten in vector form as

w^λ₀ ≡0

w^λ_m+1 = (I −λS)w^λ_m+λY˜ , (6)

(7)

with S = (g_j(x_i))1≤i,j≤n,Y˜ = (Y_i−Y¯_n)1≤i≤n the centered observations and I the n×n identity matrix. This linear recursion is easily solved, yielding the following proposition.

Proposition 2.2. Under Assumption 2.1, the boosting algorithm output Fˆ_m^λ is given by Equation (5) with weights

w^λ_m =λ

m−1

X

j=0

(I−λS)^jY ,˜ m ≥0. (7) If the matrix S is invertible, then

w^λ_m =S⁻¹[I−(I−λS)^m] ˜Y , m≥0.

Note that this result is similar to Proposition 1 in Bühlmann and Yu (2003), but they consider only the values on the observed sample (x_i)1≤i≤n

while we provide the extrapolation tox∈[0,1]^p more explicitly. Also we consider a different initialization to the empirical mean instead of initialization to zero, which seems more relevant in practice.

Example 2.3. A simple example satisfying Assumption2.1is the Nadaraya- Watson estimator (see Nadaraya (1964) and Watson(1964))

L(x) = Pn

i=1K_h(x−x_i)Y_i Pn

i=1K_h(x−x_i) , x∈[0,1]^p,

whereh >0is the bandwidth,K :R^p →(0,+∞)is the kernel, i.e. a density function, and K_h(z) =h^−dK(z/h) the rescaled kernel.

Example 2.4. A more involved example of base learner, discussed inBühlmann and Yu (2003) Section 3.2, is the smoothing spline in dimension p = 1. For r ≥ 1and ν >0, the smoothing spline L is the unique minimizer over W₂^(r) of the penalized criterion

n

X

i=1

(Y_i−L(x_i))²+ν Z 1

0

(L^(r)(x))²dx,

where W₂^(r) denotes the Sobolev space of functions that are continuously differentiable of order r−1 with square integrable weak derivative of order r. Assuming 0 < x₁ < · · · < x_n <1, the solution is known to be piecewise polynomial function of degree r+ 1 with constant derivative of order r+ 1 on n+ 1 intervals (0, x₁), . . . ,(x_n,1). It is used in Bühlmann and Yu (2003) that the matrix S is symmetric definite positive with positive eigenvalues 1 = µ₁ =. . .=µ_r > . . . > µ_n>0, see Wahba (1990).

(8)

2.2 The vanishing learning rate asymptotic

We next consider the existence of a limit in the vanishing learning rate asymptotic λ → 0. The explicit simple formulas from Proposition 2.2 allows for a simple analysis. We recall that the exponential of a square matrix M is defined by

exp(M) =X

k≥0

1 k!M^k.

Proposition 2.5. Under Assumption 2.1, as λ→0, we have

Fˆ_[t/λ]^λ (x)−→Fˆ_t(x), t≥0, x∈[0,1]^p, (8) uniformly on compact sets [0, T]×[0,1]^p, T >0, where the limit satisfies

Fˆ_t(x) = ¯Y_n+

n

X

i=1

w_t,ig_i(x) (9)

with weights w_t = (w_t,i)_1≤i≤n given by w_t =−X

j≥1

(−t)^j

j! S^j−1Y ,˜ t≥0. (10) If the matrix S is invertible, then

w_t=S⁻¹ I−e^−tSY ,˜ t≥0. (11) The formulas are even more explicit in the case when S is a symmetric matrix because it can then be diagonalized in an orthonormal basis of eigenvectors.

Corollary 2.6. Suppose Assumption 2.1 is satisfied and S = (g_j(x_i))1≤i,j≤n

is a symmetric matrix. Denote by (µ_j)1≤j≤n the eigenvalues of S and by (uj)1≤j≤n the corresponding eigenvectors. Then the vanishing learning rate asymptotic yields the weights

wt =

n

X

j=1

1−e^−µ^j^t µ_j uju^T_jY˜ and the limit

Fˆ_t(x) = ¯Y_n+ X

1≤i,j≤n

1−e^−µ^j^t µ_j

v_i^Tu_ju^T_jY˜

g_i(x), (12) with (v_i)1≤i≤n the canonical basis in Rⁿ. When µ = 0, we use extension by continuity, that is the convention (1−e^−µt)/µ=t.

(9)

Interestingly, the limit function( ˆF_t)t≥0 appearing in the vanishing learning rate asymptotic can be characterized as the solution of a linear differential equation in infinite dimensional space. The intuition is quite clear from the following heuristic: the boosting dynamic

Fˆ_m+1^λ = ˆF_m^λ +λ

n

X

i=1

(Y_i−Fˆ_m^λ(x_i))g_i

implies, for t=λm, λ⁻¹

Fˆ_[(t+λ)/λ]^λ −Fˆ_[t/λ]^λ

=

n

X

i=1

(Y_i−Fˆ_[t/λ]^λ (x_i))g_i.

Letting λ→0, the convergence Fˆ_[t/λ]^λ →Fˆ_t suggests Fˆ_t⁰ =

n

X

i=1

(Y_i−Fˆ_t(x_i))g_i.

We make this heuristic rigorous in the following proposition. For t ≥ 0, we consider Fˆ_t as an element of the Banach space L^∞ = L^∞([0,1]^p,R) and prove that( ˆF_t)t≥0is the unique solution of a linear differential equation. More precisely, it is easily seen that the linear operator L:L^∞ →L^∞ defined by

L(Z) =

n

X

i=1

Z(x_i)g_i, Z ∈L^∞,

is bounded and we consider the differential equation in the Banach space L^∞ Z⁰(t) =−L(Z(t)) +G, t ≥0, (13) with G=Pn

i=1Y_ig_i.

Theorem 2.7. i) For all Z0 ∈ L^∞, the differential equation (13) has a unique solution satisfying Z(0) = Z₀. Furthermore, if there exists Y ∈ L^∞ such that L(Y) =G, this solution is explicitly given by

Z(t) = (e^−tL)Z₀+ (Id−e^−tL)Y, t≥0. (14) ii) The function ( ˆFt)t≥0 is the solution of (13) with initial condition Y¯n.

Assuming there exists Y ∈L^∞ such that L(Y) =G, we thus have Fˆ_t = (e^−tL) ¯Y_n+ (Id−e^−tL)Y, t ≥0.

(10)

Remark 2.8. The condition L(Y) = G is satisfied as soon as Y(x_i) = Y_i, 1 ≤i ≤n. In particular, it holds if the x_i’s are pairwise distinct. It is used mostly for convenience and elegance of notations. Indeed we have

(Id−e^−tL)(Y) = −X

k≥1

(−t)^k

k! L^k(Y) = X

k≥1

(−1)^k−1t^k

k! L^k−1(G)

and, if the existence of Y is not granted, one can replace in formula (14) the term involvingY by the series in the right hand side of the previous equation and check that this provides a solution of (13) in the general case.

Finally, we discuss the notion of stability of the boosting procedure. It requires that the output of the boosting algorithm does not explodes for large time values.

Definition 2.9. The boosting algorithm is called stable if, for all possible input (Y_i)1≤i≤n, the output ( ˆF_t)t≥0 remains uniformly bounded as t → ∞.

It is here convenient to assume the following:

Assumption 2.10. In Equation (4), the functions (g_i)_1≤i≤n are linearly independent and such that Pn

i=1g_i(x)≡1.

The linear independence is sensible if the points (x_i)1≤i≤n are pairwise distinct. The constant sum implies that for constant input Y_i = 1, 1≤ i ≤ n, the output L(x) ≡ 1 is also constant. Both are mild assumptions satisfied by most learners in practice.

The stability can be characterized in terms of the Jordan normal form of the matrix S, see for instance Horn and Johnson (2013). We recall that the Jordan normal form of S is a block diagonal matrix where each block, called Jordan block, is an upper triangular matrix of sizes×swith a complex eigenvalueµon the main diagonal and ones on the superdiagonal. The matrix can be diagonalized if and only if all its Jordan blocks have size 1.

Proposition 2.11. Suppose Assumptions 2.1 and 2.10 are satisfied. Then the boosting procedure algorithm is stable if and only if all the blocks of the Jordan normal form of S satisfy:

- the eigenvalue has a positive real part;

- the eigenvalue has a null real part and the block has size 1.

In particular, if S is symmetric, the boosting procedure is stable if and only if all the eigenvalues of S are non-negative.

(11)

2.3 Training and test error

We next consider the performance of the boosting regression algorithm in terms of L²-loss, also known as mean squared error. We focus mostly on the vanishing learning rate asymptotic, although version of the results below could be derived for positive learning rate λ.

The training error is assessed on the training set used to fit the boosting predictor and compares the observations Y_i to their predicted valuesFˆ_t(X_i), i.e.

err_train(t) = 1 n

n

X

i=1

(Y_i−Fˆ_t(x_i))². (15) The generalization capacity of the algorithm is assessed on new observations that are not used during the fitting procedure. For test observations (Y_i⁰, X_i⁰)1≤i≤n⁰, independent of the training sample, the test error is defined by

err_test(t) = 1 n⁰

n⁰

X

i=1

(Y_i⁰ −Fˆ_t(X_i⁰))². (16) We also consider a simpler version of the test error where extrapolation in the feature space is not evaluated and we take n⁰ = n and X_i⁰ = x_i. Then, the test error writes

err_test(t) = 1 n

n

X

i=1

(Y_i⁰−Fˆ_t(x_i))², (17) and allows for simpler formulas with nice interpretation.

We first consider the behavior of the training error as defined in Equa- tion (15). Note that

err_train(t) = 1 nkR_tk²

where R_t is the vector of residuals at timet defined by Rt= (Yi−Fˆt(xi))1≤i≤n, t ≥0,

and k · k denotes the Euclidean norm on Rⁿ. Furthermore, Proposition 2.5 implies R_t =e^−tSY˜, t≥0, so that

errtrain(t) = 1

nke^−tSY˜k², t≥0.

The following proposition is related to Proposition 3 and Theorem 1 in Bühlmann and Yu (2003).

(12)

Proposition 2.12. Suppose Assumptions 2.1 and 2.10 are satisfied.

i) We have limt→∞errtrain(t) = 0 for all possible input (Yi)1≤i≤n if and only if all the eigenvalues of S have a positive real part.

ii) The training error satisfies

E[errtrain(t)] = bias²(t) + vartrain(t), (18) bias²(t) = 1

nke^−tSf˜k², var_train(t) = σ²

n Trace

e^−tSJ e^−tS^T ,

withJ =I−_n¹1_n1^T_n,f˜=f−f1¯ _n,f = (f(x_i))1≤i≤nandf¯= ¹_nPn

i=1f(x_i).

iii) If S is symmetric with positive eigenvalues(µ_i)1≤i≤n and corresponding eigenvectors (u_i)1≤i≤n,

E[errtrain(t)] = 1 n

n

X

i=1

(u^T_i f˜)²e^−2tµⁱ +σ² n

n

X

i=1

kJ uik²e^−2tµⁱ.

The expected training error is strictly decreasing and converges to 0 exponentially fast as t→ ∞.

The convergence of the training error to zero implies that the boosting procedure is stable as considered in Proposition 2.11but the converse is not true since some eigenvalues may have a real part equal to zero. When S is symmetric definite positive, the expected training error converges exponentially fast to 0 (this was already proved in Bühlmann and Yu (2003) Theorem 1 for λ > 0) but this exponential rate of convergence has to be taken with care since S may have very small eigenvalues, see the numerical illustration in Section 4.

The fact that the residuals converge to zero suggests that the boosting procedure eventually overfits the training observations and loses generalization power. A simple analysis of this overfit is provided by the test error with fixed covariates X_i⁰ =x_i, as defined by Equation (17). For the sake of simplicity, we emphasize the case when S is symmetric.

Proposition 2.13. i) The test error with fixed covariates defined by Equa- tion (17) satisfies

E[err_test(t)] = bias²(t) + var_test(t), bias²(t) = 1

nke^−tSfk˜ ², var_test(t) = σ²

n + σ²

n Trace (I−e^−tS)J(I −e^−tS)^T .

(13)

ii) If S is symmetric with positive eigenvalues (µ_i)1≤i≤n and associated eigenvectors (u_i)_1≤i≤n,

bias²(t) = 1 n

n

X

i=1

(u^T_if˜)²e^−2tµⁱ,

var_test(t) = σ²+ σ² n + σ²

n

X

i=1

kJ u_ik²(1−e^−tµⁱ)²,

so that the the following properties hold:

- the squared bias is decreasing, convex and vanishes as t→ ∞;

- the variance is increasing and with limit 2σ² as t→ ∞;

- the expected test error is decreasing in the neighborhood of zero, eventually increasing and with limit 2σ² as t → ∞.

We retrieve with explicit theoretical formulas the known behavior of boosting in practice: the choice of t ≥ 0 is crucial in the bias/variance trade-off. Small values of t≥0lead to underfitting while overfitting appears for larger time values. In the early stage of the procedure, the bias decreases more rapidly that the variance increases, leading to a reduced test error. In practice, cross-validation and early stopping is used to estimate the test error and choose when to stop the boosting procedure, see Zhang and Yu (2005).

Remark 2.14. When the boosting algorithm is initialized at Fˆ₀ = 0 as in Bühlmann and Yu (2003), the expected training and test error from Propo- sitions 2.12 and 2.13 become

E[err_train(t)] = 1 n

n

X

i=1

(u^T_i f)²e^−2tµⁱ +σ² n

n

X

i=1

ku_ik²e^−2tµⁱ

and

E[err_test(t)] = 1 n

n

X

i=1

(u^T_i f)²e^−2tµⁱ +σ² n +σ²

n

X

i=1

ku_ik²e^−2tµⁱ.

These values are always larger than those with initializationFˆ₀ = ¯Y_n, whence we recommend initialization to the empirical mean.

When the test error includes extrapolation in the predictor space - i.e. the new test observations(Y_i⁰, X_i⁰)1≤i≤n⁰ are i.i.d. and independent of the training observation as in Equation (16) - the formula we obtain for its expectation is more difficult to analyse.

(14)

Proposition 2.15. Assume S is symmetric with positive eigenvalues. The test error defined by Equation (16) has expectation

E[err_test(t)] = n+ 1

n σ²+E

f(X⁰)−f¯−f˜^TS⁻¹ I−e^−tS

g(X⁰)2 +σ²E

g(X⁰)^T I−e^−tS

S⁻¹J S⁻¹ I−e^−tS

g(X⁰) with g(X⁰) = (g_i(X⁰))1≤i≤n.

3 Stochastic gradient boosting

FollowingFriedman (2002), it is common practice to use a stochastic version of the boosting algorithm where subsampling is introduced at each step of the procedure. The package gbm byRidgeway (2007) uses the subsampling rate equal to 50% by default, meaning that each step involves only a subsample with half of the observations randomly chosen. This subsampling is known to have a regularization effect and we consider in this section the existence of the vanishing learning rate limit for such stochastic boosting algorithms.

3.1 Framework

We consider the following stochastic boosting algorithm that encompasses stochastic gradient boosting, see Example 3.2 below. We assume the weak learner L(x) = L(x; (x_i, y_i)1≤i≤n, ξ)depends on the observations (x_i, y_i)1≤i≤n

and on an external source of randomness ξ with a finite set Ξ of possible values. We define the stochastic boosting algorithm by the recursion

Fˆ₀^λ(x) = ¯Y_n,

Fˆ_m+1^λ (x) = ˆF_m^λ(x) +λL(x; (R^λ_m,i, X_i)1≤i≤n, ξ_m+1), m≥0, (19) where ξ_m, m ≥ 1, are i.i.d. Ξ-valued random variables independent of (X_i, Y_i)1≤i≤n and R^λ_m,i =Y_i−Fˆ_m^λ(X_i), 1≤i≤n, are the residuals.

Assumption 3.1. We assume that the base learner of the stochastic boosting algorithm (19) satisfies

L(x; (x_i, Y_i)1≤i≤n, ξ) =

n

X

j=1

Y_jg_j(x, ξ), x∈[0,1]^p,

where g₁, . . . , g_n ∈L^∞ may depend on (x_i)1≤i≤n and ξ∈Ξ.

We assume that Ξ is finite mostly for simplicity and also because it is enough to cover two particularly important cases.

(15)

Example 3.2. Starting from a base learner L satisfying Assumption 2.1 (with n replaced by [sn]) and applying stochastic subsampling (Friedman, 2002), we obtain a stochastic setting that satisfies Assumption 3.1. Let the sample size n ≥ 1 be fixed and consider subsambling with rate s ∈ (0,1), e.g. s = 50%. Define Ξ as the set of all subsets ξ of {1, . . . , n} with fixed size [sn]. Note thatΞis finite with cardinality _[sn]ⁿ

. The learnerLfitted on subsample ξ ∈Ξis written

L(x; (xi, Yi)1≤i≤n, ξ) =L(x; (xi, Yi)i∈ξ).

We use here a mild abuse of notation: in the left hand side, L denotes the randomized learner, the sample size is n and subsampling is introduced by ξ; in the right hand side, L denotes the deterministic base learner and the sample size is [sn]. Stochastic boosting corresponds to Algorithm (19) with the sequence(ξ_m)m≥1 uniformly distributed onΞ, which corresponds to uniform subsampling.

Example 3.3. Another important example covered by the stochastic boosting algorithm (19) is the design of additive models. The idea is to provide an approximation of the regression function f(x), x ∈ [0,1]^p, by an additive model of the form f₁(x⁽¹⁾) +· · ·+f_p(x^(p)), where x^(j) denotes the jth component of x and f_j the principal effect of x^(j). Such an additive model does not include interactions between different components. Assume that a base learner L with one-dimensional covariate space [0,1] is given and that L satisfies Assumption 2.1 with p= 1. For instance, L can be a smoothing spline as in Example 2.4, see Bühlmann and Yu (2003) Section 4. We consider stochastic regression boosting where the base learner L is sequentially applied with a randomly chosen predictor. Formally, set

L(x; (x_i, Y_i)1≤i≤n, ξ) =L(x; (x^(ξ)_i , Y_i)1≤i≤n), ξ = 1, . . . , p.

It is easily checked that the learner in the left hand side satisfies Assump- tion 3.1 and that algorithm (19) with (ξ_m)m≥1 uniformly distributed on Ξ = {1, . . . , p}outputs a sequence of additive models. This strategy is often used with a more involved procedure where, at each step, thepdifferent possible predictors are considered and the best one is kept, see Bühlmann and Yu (2003) Section 4. But this falls beyond Assumption3.1 because choosing the optimal component is not a linear operation and the randomized choice proposed here is a sensible alternative satisfying Assumption 3.1.

3.2 Convergence of finite dimensional distributions

For fixed input (Y_i, x_i)1≤i≤n, the stochastic boosting algorithm (19) provides a sequence of stochastic processesFˆ_m^λ,m ≥1, and we consider the vanishing

(16)

learning rate limit (3) under Assumption 3.1. We first prove convergence of the finite dimensional distributions thanks to elementary moment computations formulated in the next proposition. Expectation and variance are considered with respect to(ξ_m)m≥1 while the input(x_i, Y_i)1≤i≤nis considered fixed and we note Eξ and Var_ξ to emphasize this. We define

¯

g_j(x) = Eξ[g_j(x, ξ)], x∈[0,1]^p, j = 1, . . . , n, and

S = (¯g_i(x_j))1≤i,j≤n. (20)

Note that g¯1, . . . ,g¯n are well-defined and in L^∞ because Ξ is finite so that there are no measurability or integrability issues.

Proposition 3.4. Consider the boosting algorithm (19)under Assumption3.1 and let the input (Yi, xi)1≤i≤n be fixed.

i) For x∈[0,1]^p and m≥0,

E^ξ[ ˆF_m^λ(x)] = ¯Y_n+

n

X

i=1

w_m,i^λ g¯_i(x), (21) where w_m^λ = (w_m,i^λ )1≤i≤n is defined by (7) with S given by (20).

ii) There exists a positive constant K such that, for all x∈[0,1]^p, m≥0 and λ <1,

Var_ξ[ ˆF_m^λ(x)]≤K(m+ 1)λ²(1 +Kλ)^mnkY˜k²_∞

1 + (λmK)²e^2λmkSk^∞ , where k · k∞ denotes here the maximum norm on Rⁿ. We use the same notation for the infinity-norm of n×n matrices.

As will be clear from the proof, the constant K can be taken as 2M₁+ M₁²+ (n+ 1)M2, where

M₁ = max

1≤j≤n+1 n

X

i=1

|¯g_i(x_j)| (22) and

M₂ = max

1≤j≤n+1 n

X

i=1

Var_ξ[g_i(x_j)]. (23)

Behavior of linear L2-boosting algorithms in the vanishing learning rate asymptotic

Behavior of linear L2-boosting algorithms in the vanishing learning rate asymptotic

Behavior of linear L

-boosting algorithms in the vanishing learning rate asymptotic

Contents

1 Introduction

2 L

-boosting with linear base learner

3 Stochastic gradient boosting