PAC-Bayesian Estimation and Prediction in Sparse Additive Models

(1)

HAL Id: hal-00722969

https://hal.archives-ouvertes.fr/hal-00722969v3

Submitted on 1 Feb 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

PAC-Bayesian Estimation and Prediction in Sparse

Additive Models

Benjamin Guedj, Pierre Alquier

To cite this version:

Benjamin Guedj, Pierre Alquier. PAC-Bayesian Estimation and Prediction in Sparse Additive Models.

Electronic Journal of Statistics , Shaker Heights, OH : Institute of Mathematical Statistics, 2013, 7,

pp.264–291. �10.1214/13-EJS771�. �hal-00722969v3�

(2)

Vol. 7 (2013) 264–291 ISSN: 1935-7524 DOI:10.1214/13-EJS771

PAC-Bayesian estimation and prediction

in sparse additive models

Benjamin Guedj∗

Laboratoire de Statistique Théorique et Appliquée Université Pierre et Marie Curie - UPMC

Tour 25 - 2`eme ´etage, boˆıte n◦₁₅₈

4, place Jussieu 75252 Paris Cedex 05, France e-mail:[email protected]

and Pierre Alquier† School of Mathematical Sciences

University College Dublin Room 528 - James Joyce Library

Belfield, Dublin 4, Ireland e-mail:[email protected]

Abstract: The present paper is about estimation and prediction in high-dimensional additive models under a sparsity assumption (p ≫ n paradigm). A PAC-Bayesian strategy is investigated, delivering oracle inequalities in probability. The implementation is performed through recent outcomes in high-dimensional MCMC algorithms, and the performance of our method is assessed on simulated data.

AMS 2000 subject classifications:Primary 62G08, 62J02, 65C40. Keywords and phrases:Additive models, sparsity, regression estimation, PAC-Bayesian bounds, oracle inequality, MCMC, stochastic search.

Received August 2012. Contents 1 Introduction . . . 265 2 PAC-Bayesian prediction . . . 267 3 MCMC implementation . . . 272 4 Numerical studies . . . 273 5 Proofs . . . 276 Acknowledgements . . . 288 References . . . 288 ∗_{Corresponding author.}

†_{Research partially supported by the French “Agence Nationale pour la Recherche” under}

grant ANR-09-BLAN-0128 “PARCIMONIE”. 264

(3)

1. Introduction

Substantial progress has been achieved over the last years in estimating high-dimensional regression models. A thorough introduction to this dynamic field of contemporary statistics is provided by the recent monographs Hastie, Tibshirani and Friedman (2009); B¨uhlmann and van de Geer (2011). In the popular frame-work of linear and generalized linear models, the Lasso estimator introduced by Tibshirani (1996) immediately proved successful. Its theoretical properties have been extensively studied and its popularity has never wavered since then, see for example Bunea, Tsybakov and Wegkamp (2006); van de Geer (2008); Bickel, Ritov and Tsybakov (2009); Meinshausen and Yu (2009). However, even though numerous phenomena are well captured within this linear context, re-straining high-dimensional statistics to this setting is unsatisfactory. To relax the strong assumptions required in the linear framework, one idea is to inves-tigate a more general class of models, such as nonparametric regression models of the form Y = f (X) + W , where Y denotes the response, X the predictor and W a zero-mean noise. A good compromise between complexity and effective-ness is the additive model. It has been extensively studied and formalized for thirty years now. Amongst many other references, the reader is invited to refer to Stone (1985); Hastie and Tibshirani (1986, 1990); H¨ardle (1990). The core of this model is that the regression function is written as a sum of univariate functions f =Pp

i=1fi, easing its interpretation. Indeed, each covariate’s effect

is assessed by a unique function. This class of nonparametric models is a pop-ular setting in statistics, despite the fact that classical estimation procedures are known to perform poorly as soon as the number of covariates p exceeds the number of observations n in that setting.

In the present paper, our goal is to investigate a PAC-Bayesian-based pre-diction strategy in the high-dimensional additive framework (p≫ n paradigm). In that context, estimation is essentially possible at the price of a sparsity as-sumption, i.e., most of the fi functions are zero. More precisely, our setting is

non-asymptotic. As empirical evidences of sparse representations accumulate, high-dimensional statistics are more and more coupled with a sparsity assump-tion, namely that the intrinsic dimension p0 of the data is much smaller than

p and n, see e.g. Giraud, Huet and Verzelen (2012). Additive modelling under a sparsity constraint has been essentially studied under the scope of the Lasso in Meier, van de Geer and B¨uhlmann (2009), Suzuki and Sugiyama (2012) and Koltchinskii and Yuan (2010) or of a combination of functional grouped Lasso and backfitting algorithm in Ravikumar et al. (2009). Those papers inaugurated the study of this problem and contain essential theoretical results consisting in asymptotics (see Meier, van de Geer and B¨uhlmann (2009); Ravikumar et al. (2009)) and non-asymptotics (see Suzuki and Sugiyama (2012); Koltchinskii and Yuan (2010)) oracle inequalities. The present article should be seen as a constructive contribution towards a deeper understanding of prediction prob-lems in the additive framework. It should also be stressed that our work is to be seen as an attempt to relax as much as possible assumptions made on the model, such as restrictive conditions on the regressors’ matrix. We consider

(4)

them too much of a non-realistic burden when it comes to prediction prob-lems.

Our modus operandi will be based on PAC-Bayesian results, which is original in that context to our knowledge. The PAC-Bayesian theory originates in the two seminal papers Shawe-Taylor and Williamson (1997); McAllester (1999) and has been extensively formalized in the context of classification (see Catoni (2004,

2007)) and regression (see Audibert (2004a,b); Alquier (2006,2008); Audibert and Catoni (2010,2011)). However, the methods presented in these references are not explicitly designed to cover the high-dimensional setting under the spar-sity assumption. Thus, the PAC-Bayesian theory has been worked out in the sparsity perspective lately, by Dalalyan and Tsybakov (2008,2012); Alquier and Lounici (2011); Rigollet and Tsybakov (2012). The main message of these stud-ies is that aggregation with a properly chosen prior is able to deal effectively with the sparsity issue. Interesting additional references addressing the aggregation outcomes would be Rigollet (2006); Audibert (2009). The former aggregation procedures rely on an exponential weights approach, achieving good statisti-cal properties. Our method should be seen as an extension of these techniques, and is particularly focused on additive modelling specificities. Contrary to pro-cedures such as the Lasso, the Dantzig selector and other penalized methods which are provably consistent under restrictive assumptions on the Gram matrix associated to the predictors, PAC-Bayesian aggregation requires only minimal assumptions on the model. Our method is supported by oracle inequalities in probability, that are valid in both asymptotic and non-asymptotic settings. We also show that our estimators achieve the optimal rate of convergence over tra-ditional smoothing classes such as Sobolev ellipsoids. It should be stressed that our work is inspired by Alquier and Biau (2011), which addresses the celebrated single-index model with similar tools and philosophy. Let us also mention that although the use of PAC-Bayesian techniques are original in this context, paral-lel work has been conducted in the deterministic design case by Suzuki (2012). A major difficulty when considering high-dimensional problems is to achieve a favorable compromise between statistical and computational performances. The recent and thorough monograph B¨uhlmann and van de Geer (2011) shall provide the reader with valuable insights that address this drawback. As a consequence, the explicit implementation of PAC-Bayesian techniques remains unsatisfactory as existing routines are only put to test with small values of p (typically p < 100), contradicting with the high-dimensional framework. In the meantime, as a so-lution of a convex problem the Lasso proves computable for large values of p in reasonable amounts of time. We therefore focused on improving the com-putational aspect of our PAC-Bayesian strategy. Monte Carlo Markov Chains (MCMC) techniques proved increasingly popular in the Bayesian community, for they probably are the best way of sampling from potentially complex probability distributions. The reader willing to find a thorough introduction to such tech-niques is invited to refer to the comprehensive monographs Marin and Robert (2007); Meyn and Tweedie (2009). While Alquier and Biau (2011); Alquier and Lounici (2011) explore versions of the reversible jump MCMC method (RJM-CMC) introduced by Green (1995), Dalalyan and Tsybakov (2008, 2012)

(5)

in-vestigate a Langevin-Monte Carlo-based method, however only a deterministic design is considered. We shall try to overcome those limitations by consider-ing adaptations of a recent procedure whose comprehensive description is to be found in Petralias (2010); Petralias and Dellaportas (2012). This procedure called Subspace Carlin and Chib algorithm originates in the seminal paper by Carlin and Chib (1995), and has a close philosophy of Hans, Dobra and West (2007), as it favors local moves for the Markov chain. We provide numerical evidence that our method is computationally efficient, on simulated data.

The paper is organized as follows.Section 2presents our PAC-Bayesian pre-diction strategy in additive models. In particular, it contains the main theoreti-cal results of this paper which consist in oracle inequalities.Section 3is devoted to the implementation of our procedure, along with numerical experiments on simulated data, presented inSection 4. Finally, and for the sake of clarity, proofs have been postponed to Section 5.

2. PAC-Bayesian prediction

Let (Ω, A, P) be a probability space on which we denote by _{(Xi, Yi)}ni=1 a

sample of n independent and identically distributed (i.i.d.) random vectors in (−1, 1)p_{× R, with X} i= (Xi1, . . . , Xip), satisfying Yi= ψ⋆(Xi) + ξi= p X j=1 ψ⋆ j(Xij) + ξi, i∈ {1, . . . , p},

where ψ1⋆, . . . , ψp⋆ are p continuous functions (−1, 1) → R and {ξi}ni=1 is a set

of i.i.d. (conditionaly to _{(Xi, Yi)}ni=1) real-valued random variables. Let P

de-note the distribution of the sample_{(Xi, Yi)}ni=1. Denote by E the expectation

computed with respect to P and let _{k · k}∞ be the supremum norm. We make

the two following assumptions.

(A1) For any integer k, E[_|ξ1|k] <∞, E[ξ1| X1] = 0 and there exist two positive

constants L and σ2 _{such that for any integer k}

≥ 2, E_[_|ξ₁_|k_{| X}₁_]_≤ k!

2σ

2_Lk−2_.

(A2) There exists a constant C > max(1, σ) such that_kψ⋆

k∞≤ C.

Note that A1implies that E ξ1= 0 and that the distribution of ξ1 may depend

on X1. In particular,A1holds if ξ1is a zero-mean gaussian with variance γ2(X1)

where x_{7→ γ}2_{(x) is bounded.}

Further, note that the boundedness assumption A2 plays a key role in our approach, as it allows to use a version of Bernstein’s inequality which is one of the two main technical tools we use to state our results. This assumption is not only a technical prerequisite since it proved crucial for critical regimes: indeed, if the intrinsic dimension p0 of the regression function ψ⋆ is still large,

the boundedness of the function class allows much faster estimation rates. This point is profusely discussed in Raskutti, Wainwright and Yu (2012).

(6)

We are mostly interested in sparse additive models, in which only a few {ψ⋆

j} p

j=1 are not identically zero. Let {ϕk}∞k=1 be a known countable set of

continuous functions R_{→ (−1, 1) called the dictionary. In the sequel, |H| stands} for the cardinality of a set H. For any p-th tuple m = (m1, . . . , mp)∈ Np, denote

by S(m)_{⊂ {1, . . . , p} the set of indices of nonzero elements of m, i.e.,} |S(m)| = p X j=1 1[mj > 0], and define Θm={θ ∈ R m1 × · · · × Rmp } ,

with the convention R0=_{∅. The set Θ}m is embedded with its canonical Borel

field B(Θm) = B(Rm1)⊗ · · · ⊗ B(Rmp). Denote by

Θdef= [

m∈M

Θm,

which is equipped with the σ-algebra T = σ W

m∈MB(Θm), where M is the

collection of models M =_{{m = (m}1, . . . , mp)∈ Np}. Consider the span of the

set_{ϕk}∞k=1, i.e., the set of functions

F₌    ψθ= X j∈S(m) ψj = X j∈S(m) mj X k=1 θjkϕk: θ∈ Θm, m∈ M    ,

equipped with a countable generated σ-algebra denoted by F. The risk and empirical risk associated to any ψθ∈ F are defined respectively as

R(ψθ) = E[Y1− ψθ(X1)]2 and Rn(ψθ) = rn({Xi, Yi}ni=1, ψθ),

where rn({xi, yi}ni=1, ψθ) = 1 n n X i=1 (yi− ψθ(xi))2.

Consider the probability ηαon the set M defined by

ηα: m7→ 1− α 1−α 1− α 1−α p+1 p |S(m)| −1 αPpj=1mj_,

for some α∈ (0, 1/2). Let us stress the fact that the probability ηα acts as a

penalization term over a model m, on the number of its active regressors through the combinatorial term _|S(m)|p −1

and on their expansion through αPpj=1mj_.

Our procedure relies on the following construction of the probability π, re-ferred to as the prior, in order to promote the sparsity properties of the target regression function ψ⋆_{. For any m}

∈ M, ζ > 0 and x ∈ Θm, denote by B 1 m(x, ζ)

(7)

the ℓ1_{-ball centered in x with radius ζ. For any m} _{∈ M, denote by π} m the

uniform distribution on B1_m(0, C). Define the probability π on (Θ, T),

π(A) = X

m∈M

ηα(m)πm(A), A∈ T.

Note that the volume Vm(C) of B 1 m(0, C) is given by Vm(C) = (2C)Pj∈S(m)mj ΓP j∈S(m)mj+ 1 = (2C)Pj∈S(m)mj P j∈S(m)mj !.

Finally, set δ > 0 (which may be interpreted as an inverse temperature param-eter) and the posterior Gibbs transition density is

ρδ({(xi, yi)}ni=1, θ) = X m∈M ηα(m) Vm(C) 1B1 m(0,C)(θ) exp[_−δrn({xi, yi}ni=1, ψθ)] R exp[−δrn({xi, yi}ni=1, ψθ)]π(dθ) . (2.1) We then consider two competing estimators. The first one is the randomized Gibbs estimator ˆΨ, constructed with parameters ˆθ sampled from the posterior Gibbs density, i.e., for any A_{∈ F,}

P_{( ˆ}_Ψ_{∈ A|{X}_i_{, Y}_i_}n_i=1_{) =} Z

A

ρδ({Xi, Yi}ni=1, θ)π(dθ), (2.2)

while the second one is the aggregated Gibbs estimator ˆΨa _{defined as the}

pos-terior mean ˆ Ψa₌

Z

ψθρδ({Xi, Yi}i=1n , θ)π(dθ) = E[ ˆΨ|{Xi, Yi}ni=1]. (2.3)

These estimators have been introduced in Catoni (2004,2007) and investigated in further work by Audibert (2004a); Alquier (2006, 2008); Dalalyan and Tsy-bakov (2008,2012).

For the sake of clarity, denote by D a generic numerical constant in the sequel. We are now in a position to write a PAC-Bayesian oracle inequality.

Theorem 2.1. Let ˆψ and ˆψa _{be realizations of the Gibbs estimators defined}

by (2.2)– (2.3), respectively. Let A1 and A2 hold. Set w = 8C max(L, C) and

δ = nℓ/[w +4(σ2_+C2_{)], for ℓ}

∈ (0, 1), and let ε ∈ (0, 1). Then with P-probability

at least 1_{− 2ε,} R( ˆψ)_{− R(ψ}⋆₎ R( ˆψa₎_{− R(ψ}⋆₎ ≤ D inf m∈M inf θ∈B1 m(0,C) R(ψθ)− R(ψ⋆) +_|S(m)|log(p/|S(m)|) n + log(n) n X j∈S(m) mj+ log(1/ε) n    , (2.4)

(8)

Under mild assumptions,Theorem 2.1provides inequalities which admit the following interpretation. If there exists a “small” model in the collection M,

i.e., a model m such thatP

j∈S(m)mjand|S(m)| are small, such that ψθ(with

θ _{∈ Θ}m) is close to ψ

⋆_{, then ˆ}_{ψ and ˆ}_ψa _{are also close to ψ}⋆ _{up to log(n)/n}

and log(p)/n terms. However, if no such model exists, at least one of the terms P

j∈S(m)mj/n and|S(m)|/n starts to emerge, thereby deteriorating the global

quality of the bound. A satisfying estimation of ψ⋆ _{is typically possible when}

ψ⋆ _{admits a sparse representation.}

To go further, we derive fromTheorem 2.1an inequality on Sobolev ellipsoids. We show that our procedure achieves the optimal rate of convergence in this setting. For the sake of shortness, we consider Sobolev spaces, however one can easily derive the following results in other functional spaces such as Besov spaces. See Tsybakov (2009) and the references therein.

The notation_{ϕk}∞k=1now refers to the (non-normalized) trigonometric

sys-tem, defined as

ϕ1: t7→ 1, ϕ2j: t7→ cos(πjt), ϕ2j+1: t7→ sin(πjt),

with j ∈ N∗ and t ∈ (−1, 1). Let us denote by S⋆ _{the set of indices of}

non-identically zero regressors. That is, the regression function ψ⋆_is

ψ⋆= X

j∈S⋆

ψ⋆j.

Assume that for any j ∈ S⋆_{, ψ}⋆

j belongs to the Sobolev ellipsoid W(rj, dj)

defined as W_(r_j_{, d}_j_{) =} ( f ∈ L2_([ −1, 1]): f = ∞ X k=1 θkϕk and ∞ X i=1 i2rj_θ2 i ≤ dj ) .

with dj chosen such that Pj∈S⋆pdj ≤ C√6/π and for unknown regularity

parameters r1, . . . , r|S⋆_| ≥ 1. Let us stress the fact that this assumption casts

our results onto the adaptive setting. It also implies that ψ⋆ _{belongs to the}

Sobolev ellipsoid W(r, d), with r = minj∈S⋆r_j and d =P

j∈S⋆dj , i.e., ψ⋆= X j∈S⋆ ∞ X k=1 θ⋆jkϕk. (2.5)

It is worth pointing out that in that setting, the Sobolev ellipsoid is better approximated by the ℓ1_{-ball B}1

m(0, C) as the dimension of m grows. Further,

make the following assumption.

(A3) The distribution of the data P has a probability density with respect to the corresponding Lebesgue measure, bounded from above by a constant B > 0.

(9)

Theorem 2.2. Let ˆψ and ˆψa _{be realizations of the Gibbs estimators defined by}

(2.2)– (2.3), respectively. Let A1,A2 andA3 hold. Set w = 8C max(L, C) and

δ = nℓ/[w +4(σ2_+C2_{)], for ℓ}

∈ (0, 1), and let ε ∈ (0, 1). Then with P-probability

at least 1_{− 2ε,} R( ˆψ)_{− R(ψ}⋆₎ R( ˆψa₎_{− R(ψ}⋆₎ ≤ D    X j∈S⋆ d 1 2rj +1 j log(n) 2nrj _{2rj +1}2rj +_|S⋆_{| log(p/|S}⋆_{|)/n +}log(1/ε) n    ,

where D is a constant depending only on w, σ, C, ℓ, α and B.

Theorem 2.2 illustrates that we obtain the minimax rate of convergence over Sobolev classes up to a log(n) term. Indeed, the minimax rate to estimate a single function with regularity r is n2r+12r , see for example Tsybakov (2009,

Chapter 2).Theorem 2.1andTheorem 2.2thus validate our method.

A salient fact aboutTheorem 2.2is its links with existing work: assume that all the ψ⋆

j belong to the same Sobolev ellipsoid W(r, d). The convergence rate is

now log(n)n−2r+12r + log(p)/n. This rate (down to a log(n) term) is the same as

the one exhibited by Koltchinskii and Yuan (2010) in the context of multiple ker-nel learning (n−2r+12r +log(p)/n). Suzuki and Sugiyama (2012) even obtain faster

rates which correspond to smaller functional spaces. However, the results pre-sented by both Koltchinskii and Yuan (2010) and Suzuki and Sugiyama (2012) are obtained under stringent conditions on the design, which are not necessary to proveTheorem 2.2.

A natural extension is to consider sparsity on both regressors and their ex-pansion, instead of sparse regressors and nested expansion as before. That is, we no longer consider the first mj dictionary functions for the expansion of

re-gressor j. To this aim, we slightly extend the previous notation. Let K _{∈ N}∗ be the length of the dictionary. A model is now denoted by m = (m1, . . . , mp) and

for any j _{∈ {1, . . . , p}, m}j = (mj1, . . . , mjK) is a K-sized vector whose entries

are 1 whenever the corresponding dictionary function is present in the model and 0 otherwise. Introduce the notation

S(m) =_{mj6= 0, j ∈ {1, . . . , p}}, S(mj) ={mjk6= 0, k ∈ {1, . . . , K}}.

The prior distribution on the models space M is now ηα: m7→ 1− α1−αK+1 1−α 1₋α1−αK+1 1−α p+1 p |S(m)| −1 Y j∈S(m) K |S(mj)| −1 α|S(mj)|_, for any α_{∈ (0, 1/2).}

δ = nℓ/[w +4(σ2_+C2_{)], for ℓ}

(10)

at least 1− 2ε, R( ˆψ)− R(ψ⋆₎ R( ˆψa₎_{− R(ψ}⋆₎ ≤ D inf m∈M inf θ∈B1 m(0,C) R(ψθ)− R(ψ⋆) +|S(m)|log(p/|S(m)|)_n +log(nK) n X j∈S(m) |S(mj)| + log(1/ε) n    ,

where D depends upon w, σ, C, ℓ and α defined above.

3. MCMC implementation

In this section, we describe an implementation of the method outlined in the previous section. Our goal is to sample from the Gibbs posterior distribu-tion ρδ. We use a version of the so-called Subspace Carlin and Chib (SCC)

developed by Petralias (2010); Petralias and Dellaportas (2012) which origi-nates in the Shotgun Stochastic Search algorithm (see Hans, Dobra and West (2007)). The key idea of the algorithm lies in a stochastic search heuristic that restricts moves in neighborhoods of the visited models. Let T _{∈ N}∗ and de-note by{θ(t), m(t)}T

t=0 the Markov chain of interest, with θ(t)∈ Θm(t). Define

i : t 7→ {+, −, =}, the three possible moves performed by the algorithm: an addition, a deletion or an adjustment of a regressor. Let {e1, . . . , ep} be the

canonical base of Rp. For any model m(t) = (m1(t), . . . , mp(t))∈ M, define its

neighborhood{V+[m(t)], V−[m(t)], V=[m(t)]}, where

V+_{[m(t)] =}_{{k ∈ M: k = m(t) + xe}_j_{, x}_{∈ N}∗_{, j}_{∈ {1, . . . , p}\S[m(t)]},} V−_{[m(t)] =}_{{k ∈ M: k = m(t) − m}_j_(t)e_j_{, j}_{∈ S[m(t)]},} and

V=_{[m(t)] =}_{{k ∈ M: S(k) = S[m(t)]}.}

A move i(t) is chosen with probability q[i(t)]. By convention, if S[m(t)] = p (respectively S[m(t)] = 1) the probability of performing an addition move (re-spectively a deletion move) is zero. Note ξ : _{{+, −} 7→ {−, +} and let D}m be

the design matrix in model m∈ M. Denote by LSEm the least square estimate

LSEm= (D′mDm)−1D′mY (with Y = (Y1, . . . , Yn)) in model m∈ M. For ease

of notation, let I denote the identity matrix. Finally, denote by φ(·; µ, Γ) the density of a Gaussian distribution N(µ, Γ) with mean µ and covariance matrix Γ. A description of the full algorithm is presented inAlgorithm 1.

The estimates ˆΨ and ˆΨa _{are obtained as}

ˆ Ψ = p X j=1 K X k=1 θjk(T )ϕk,

and for some burnin b_{∈ {1, . . . , T − 1},} ˆ Ψa= p X j=1 K X k=1 1 T_{− b} T X ℓ=b+1 θjk(ℓ) ! ϕk.

(11)

Algorithm 1 A Subspace Carlin and Chib-based algorithm

1: Initialize (θ(0), m(0)). 2: for t = 1 to T do

3: Choose a move i(t) with probability q[i(t)].

4: For any k ∈ Vi(t)[m(t − 1)], generate θkfrom the proposal density φ(·; LSEk, σ2I).

5: Propose a model k ∈ Vi(t)[m(t − 1)] with probability

γ(m(t − 1), k) = P Ak j∈Vi(t)_[m(t−1)]Aj , where Aj= ρδ(θj) φ(θj; LSEj, σ2I) . 6: ifi(t) ∈ {+, −} then

7: For any h ∈ Vξ(i(t))[k], generate θhfrom the proposal density φ(·; LSEh, σ2I). Note

that m(t − 1) ∈ Vξ(i(t))[k].

8: Accept model k, i.e., set m(t) = k and θ(t) = θk, with probability

α = min 1, Akq[i(t)]γ(k, m(t − 1)) Am(t−1)q[ξ(i(t))]γ(m(t − 1), k) ! = min 1,q[i(t)] P h∈Vi(t)_[m(t−1)]Ah q[ξ(i(t))]P h∈Vξ(i(t))_[k]Ah ! .

Otherwise, set m(t) = m(t − 1) and θ(t) = θm(t−1).

9: else

10: Generate θm(t−1)from the proposal density φ(·; LSEm(t−1), σ2I).

11: Accept model k, i.e., set m(t) = k and θ(t) = θk, with probability

α = min 1, Akγ(k, m(t − 1)) Am(t−1)γ(m(t − 1), k)

! .

Otherwise, set m(t) = m(t − 1) and θ(t) = θm(t−1).

12: end if 13: end for

The transition kernel of the chain defined above is reversible with respect to ρδ ⊗ ηα, hence this procedure ensures that {θ(t)}Tt=1 is a Markov Chain with

stationary distribution ρδ.

4. Numerical studies

In this section we validate the effectiveness of our method on simulated data. All our numerical studies have been performed with the software R (see R Core Team (2012)). The method is available on the CRAN website (http://www. cran.r-project.org/web/packages/pacbpred/index.html), under the name pacbpred(see Guedj (2012)).

Some comments are in order here about how to calibrate the constants C, σ2_{, δ and α. Clearly, a too small value for C will stuck the algorithm,}

pre-venting the chain to escape from the initial model. Indeed, most proposed models will be discarded since the acceptance ratio will frequently take the

(12)

Table 1

Each number is the mean (standard deviation) of the RSS over 10 independent runs

p = 50 p = 200 p = 400 MCMC 3000 it. 10000 it. 20000 it.

Model 1 0.0318 (0.0047) 0.0320 (0.0029) 0.0335 (0.0056)

Model 2 0.0411 (0.0061) 0.1746 (0.0639) 0.2201 (0.0992)

Model 3 0.0665 (0.0421) 0.1151 (0.0399) 0.1597 (0.0579)

value 0. Conversely, a large value for C deteriorates the quality of the bound in

Theorem 2.1, Theorem 2.2, Theorem 2.3 andTheorem 5.1. However, this only influences the theoretical bound, as its contribution to the acceptance ratio is limited to log(2C). We thereby proceeded with typically large values of C (such as C = 106_{). As the parameter σ}2_{is the variance of the proposal distribution φ,}

the practioner should tune it in accordance with the noise level of the data. The parameter requiring the finest calibration is δ: the convergence of the algorithm is sensitive to its choice. Dalalyan and Tsybakov (2008,2012) exhibit the theo-retical value δ = n/4σ2_{. This value leads to very good numerical performances,}

as it has been also noticed by Dalalyan and Tsybakov (2008,2012); Alquier and Biau (2011). The choice for α is guided by a similar reasoning to the one for C. Its contribution to the acceptance ratio is limited to a log(1/α) term. The value α = 0.25 was used in the simulations for its apparent good properties. Although it would be computationally costly, a finer calibration through methods such as cross-validation is possible.

Finally and as a general rule, we strongly encourage practitioners to run several chains of inequal lengths and to adjust the number of iterations needed by observing if the empirical risk is stabilized.

Model 1. n = 200 and S⋆₌_{{1, 2, 3, 4}. This model is similar to Meier, van de}

Geer and B¨uhlmann (2009, Section 3, Example 1) and is given by

Yi= ψ⋆1(Xi1) + ψ2⋆(Xi2) + ψ⋆3(Xi3) + ψ⋆4(Xi4) + ξi,

with

ψ⋆1: x7→ − sin(2x), ψ2⋆: x7→ x3, ψ3⋆: x7→ x,

ψ⋆4: x7→ e−x− e/2, ξi∼ N(0, 0.1), i∈ {1, . . . , n}.

The covariates are sampled from independent uniform distributions over (−1, 1).

Model 2. n = 200 and S⋆₌

{1, 2, 3, 4}. As above but correlated. The covariates

are sampled from a multivariate gaussian distribution with covariance matrix

Σij= 2−|i−j|−2, i, j∈ {1, . . . , p}.

Model 3. n = 200 and S⋆₌

{1, 2, 3, 4}. This model is similar to Meier, van de

Geer and B¨uhlmann (2009, Section 3, Example 3) and is given by

(13)

with

ψ⋆

1: x7→ x, ψ2⋆: x7→ 4(x2− x − 1), ψ⋆3: x7→

sin(2πx) 2− sin(2πx),

ψ4⋆: x7→ 0.1 sin(2πx) + 0.2 cos(2πx) + 0.3 sin2(2πx) + 0.4 cos3(2πx)

+ 0.5 sin3(2πx), ξi∼ N(0, 0.5), i∈ {1, . . . , n}.

The covariates are sampled from independent uniform distributions over (−1, 1).

The results of the simulations are summarized inTable 1and illustrated by

Figure 1andFigure 2. The reconstruction of the true regression function ψ⋆ _is

achieved even in very high-dimensional situations, pulling up our method at the level of the gold standard Lasso.

(a)Model 1, p = 200. −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x1 psi1 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x2 psi2 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x3 psi3 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 3 x4 psi4 (b)Model 1, p = 400. −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x1 psi1 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x2 psi2 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x3 psi3 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 3 x4 psi4 (c)Model 2, p = 50. −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x1 psi1 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x2 psi2 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 x3 psi3 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 3 x4 psi4 (d)Model 3, p = 50. −1.0 −0.5 0.0 0.5 1.0 −10 −5 0 5 10 x1 psi1 −1.0 −0.5 0.0 0.5 1.0 −15 −5 0 5 10 15 x2 psi2 −1.0 −0.5 0.0 0.5 1.0 −5 0 5 x3 psi3 −1.0 −0.5 0.0 0.5 1.0 −10 −5 0 5 10 x4 psi4

Fig 1. Estimates (red dashed lines) for ψ⋆

1, ψ⋆2, ψ3⋆and ψ4⋆(solid black lines). Other estimates

(for ψ⋆

(14)

(a)Model 1, p = 200. −2 −1 0 1 2 3 −2 −1 0 1 2 3 Responses Y Estimates (b)Model 1, p = 400. −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Responses Y Estimates (c)Model 2, p = 50. −2 −1 0 1 2 3 −2 −1 0 1 2 Responses Y Estimates (d)Model 3, p = 50. −10 0 10 20 −10 0 10 20 Responses Y Estimates

Fig 2. plot of the responses Y1, . . . , Yn against their estimates. The more points on the first

bisectrix (solid black line), the better the estimation.

5. Proofs

To start the chain of proofs leading toTheorem 2.1,Theorem 2.2andTheorem 2.3, we recall and prove some lemmas to establishTheorem 5.1which consists in a general PAC-Bayesian inequality in the spirit of Catoni (2004, Theorem 5.5.1) for classification or Catoni (2004, Lemma 5.8.2) for regression. Note also that Dalalyan and Tsybakov (2012, Theorem 1) provides a similar inequality in the deterministic design case. A salient fact on Theorem 5.1is that the validity of the oracle inequalities only involves the distribution of the noise variable ξ1, and

that distribution is independent of the sample size n.

The proofs of the following two classical results are omitted.Lemma 5.1is a version of Bernstein’s inequality which originates in Massart (2007, Proposition 2.19), whereasLemma 5.2 appears in Catoni (2004, Equation 5.2.1).

For x ∈ R, denote (x)+ = max(x, 0). Let µ1, µ2 be two probabilities. The

Kullback-Leibler divergence of µ1with respect to µ2is denoted KL(µ1, µ2) and

is KL_(µ₁_{, µ}₂_{) =}(R log dµ1 dµ2 dµ1 if µ1≪ µ2, ∞ otherwise.

Finally, for any measurable space (A, A) and any probability π on (A, A), denote by M1

+,π(A, A) the set of probabilities on (A, A) absolutely continuous with

(15)

Lemma 5.1. Let (Ti)ni=1 be independent real-valued variables. Assume that

there exist two positive constants v and w such that, for any integer k_{≥ 2,}

n X i=1 E_[(T_i₎k₊_]_≤ k! 2vw k−2_.

Then for any γ_{∈ 0,} 1

w, E " exp γ n X i=1 (Ti− E Ti) !# ≤ exp vγ2 2(1_{− wγ)} .

Lemma 5.2. Let (A, A) be a measurable space. For any probability µ on (A, A)

and any measurable function h : A_{→ R such that} _{R (exp ◦ h)dµ < ∞,}

log Z (exp_{◦ h)dµ =} sup m∈M1 +,π(A,A) Z hdm_{− KL(m, µ),}

with the convention _{∞ − ∞ = −∞. Moreover, as soon as h is upper-bounded}

on the support of µ, the supremum with respect to m on the right-hand side is reached for the Gibbs distribution g given by

dg dµ(a) =

exp(h(a))

R (exp ◦ h)dµ, a∈ A.

Theorem 5.1 is valid in the general regression framework. In the proofs of

Lemma 5.3, Lemma 5.4, Lemma 5.5 and Theorem 5.1, we consider a general regression function ψ⋆. Denote by (Θ, T) a space of functions equipped with a countable generated σ-algebra, and let π be a probability on (Θ, T), referred to as the prior.Lemma 5.3,Lemma 5.4,Lemma 5.5andTheorem 5.1follow from the work of Catoni (2004); Dalalyan and Tsybakov (2008,2012); Alquier (2008); Alquier and Biau (2011). Let δ > 0 and consider the so-called posterior Gibbs transition density ρδ with respect to π, defined as

ρδ({xi, yi}ni=1, ψ) =

exp[−δrn({xi, yi}ni=1, ψ)]

R exp[−δrn({xi, yi}n_i=1, ψ)]π(dψ). (5.1)

In the following three lemmas, denote by ρ a so-called posterior probability absolutely continuous with respect to π. Let ψ be a realization of a random variable Ψ sampled from ρ.

Lemma 5.3. LetA1andA2hold. Set w = 8C max(L, C), δ∈ (0, n/[w+4(σ2₊

C2)]) and ε∈ (0, 1). Then with P-probability at least 1 − ε

R(ψ)− R(ψ⋆₎ ≤ 1 1₋4δ(σ_n−wδ2+C2) Rn(ψ)− Rn(ψ ⋆_{) +}log dρ dπ(ψ) + log 1 ε δ ! .

(16)

Proof. ApplyLemma 5.1to the variables Ti defined as follow: for any ψ∈,

Ti=−(Yi− ψ(Xi))2+ (Yi− ψ⋆(Xi))2, i∈ {1, . . . , n}. (5.2)

First, let us note that R(ψ)− R(ψ⋆_{) = E[(Y}

1− ψ(X1))2]− E[(Y1− ψ⋆(X1))2]

= E[(2Y1− ψ(X1)− ψ⋆(X1))(ψ⋆(X1)− ψ(X1))]

= E [(ψ⋆_(X

1)− ψ(X1)) E[(2W1+ ψ⋆(X1)− ψ(X1))| X1]]

= 2 E[(ψ⋆(X1)− ψ(X1)) E[ξ1| X1]] + E[ψ⋆(X1)− ψ(X1)]2.

As E[ξ1| X1] = 0,

R(ψ)_{− R(ψ}⋆) = E[ψ⋆(X)_{− ψ(X)]}2. (5.3)

By (5.2), the random variables (Ti)ni=1 are independent. Using Lemma 5.1, we

get n X i=1 E_T2 i = n X i=1 E_(2Y_i_{− ψ(X}_i₎_{− ψ}⋆_(X_i₎₎2_(ψ(X_i₎_{− ψ}⋆_(X_i₎₎2 = n X i=1 E E_(2W_i_{+ ψ}⋆_(X_i₎_{− ψ(X}_i₎₎2_(ψ(X_i₎_{− ψ}⋆_(X_i₎₎2_{| X}_i_.

Next, using that_{|a + b|}k

≤ 2k−1₍

|a| + |b|) for any a, b ∈ R and k ∈ N∗, we get

n X i=1 E_T_i2_{≤ 2} n X i=1 E(ψ(X_i₎_{− ψ}⋆_(X_i₎₎2E(4W_i2_{+ 4C}2₎_{| X}_i ≤ 8 σ2_{+ C}2 n X i=1 E(ψ(X_i₎_{− ψ}⋆_(X_i₎₎2 = 8n σ2+ C2 (R(ψ) − R(ψ⋆₎₎def = v, (5.4)

where we have used (5.3) in the last equation. It follows that for any integer k≥ 3, n X i=1 E_[(T_i₎k₊_{] =} n X i=1 E E_[(T_i₎k₊_{| X}_i_] ≤ n X i=1 E E|2Y_i_{− ψ(X}_i₎_{− ψ}⋆_(X_i₎_|k_|ψ(X_i₎_{− ψ}⋆_(X_i₎_|k_{| X}_i = n X i=1 E E|2W_i_{+ ψ}⋆_(X_i₎_{− ψ(X}_i₎_|k_|ψ(X_i₎_{− ψ}⋆_(X_i₎_|k_{| X}_i ≤ 2k−1 n X i=1 E E 2k_|ξi|k+|ψ⋆(Xi)− ψ(Xi)|k |ψ(Xi)− ψ⋆(Xi)|k| Xi .

(17)

Using that |ψ(xi)− ψ⋆(xi)|k ≤ (2C)k−2|ψ(xi)− ψ⋆(xi)|2 and (5.3), we get n X i=1 E_[(T_i₎k₊_]_{≤ 2}k−1 n X i=1 2k−1k!σ2Lk−2+ (2C)k (2C)k−2_[R(ψ) − R(ψ⋆)] =k! 2v(2C) k−2 2 2k−4_σ2_Lk−2₊ 2 k!2 2k−4_Ck σ2_{+ C}2 ! . Recalling that C > max(1, σ) gives

22k−4_σ2_Lk−2₊ 2 k!2 2k−4_Ck σ2_{+ C}2 ≤ 4k−2_σ2_Lk−2 2σ2 + 2 k!4 k−2_Ck C2 ≤1₂(4L)k−2₊1 2(4C) k−2_{= [4 max(L, C)]}k−2_. Hence n X i=1 E_[(T_i₎k₊_]_≤k! 2vw k−2_, _{with w}def = 8C max(L, C). (5.5)

ApplyingLemma 5.1, we obtain, for any real δ∈ 0,n

w, with γ = δ n, E_exp[δ(R_n_(ψ⋆₎_{− R}_n_{(ψ) + R(ψ)}_{− R(ψ}⋆_))]_{≤ exp} vδ 2 2n2 ₁₋wδ n ! , that is, that for any real number ε_{∈ (0, 1),}

E_exp δ[Rn(ψ⋆)− Rn(ψ)] + δ[R(ψ)− R(ψ⋆)] 1−4δ(σ 2_{+ C}2₎ n_{− wδ} − log1_ε ≤ ε. (5.6) Next, we use a standard PAC-Bayesian approach (as developed in Audibert (2004a); Catoni (2004, 2007); Alquier (2008)). For any prior probability π on (Θ, T), Z E_exp δ[R(ψ)_{− R(ψ}⋆)] 1₋4δ(σ 2_{+ C}2₎ n_{− wδ} +δ[Rn(ψ⋆)− Rn(ψ)]− log1 ε π(dψ)≤ ε. By the Fubini-Tonelli theorem

E Z exp δ[R(ψ)− R(ψ⋆_)] 1−4δ(σ 2_{+ C}2₎ n− wδ +δ[Rn(ψ⋆)− Rn(ψ)]− log 1 ε π(dψ)_{≤ ε.}

(18)

Therefore, for any data-dependent posterior probability measure ρ absolutely continuous with respect to π, adopting the convention_{∞ × 0 = 0,}

E Z exp δ[R(ψ)_{− R(ψ}⋆_)] 1₋4δ(σ 2_{+ C}2₎ n− wδ +δ[Rn(ψ⋆)− Rn(ψ)]− log dρ dπ(ψ)− log 1 ε ρ(dψ)_{≤ ε. (5.7)} Recalling that E stands for the expectation computed with respect to P, the integration symbol may be omitted and we get

E_exp δ[R(ψ)_{− R(ψ}⋆)] 1₋4δ(σ 2_{+ C}2₎ n_{− wδ} +δ[Rn(ψ⋆)− Rn(ψ)]− logdρ dπ(ψ)− log 1 ε ≤ ε. Using the elementary inequality exp(δx)_≥1R

+(x), we get, with P-probability

at most ε 1−4δ(σ 2_{+ C}2₎ n− wδ [R(ψ)− R(ψ⋆_)] ≥ Rn(ψ)− Rn(ψ⋆) +log dρ dπ(ψ) + log 1 ε δ . Taking δ < n/[w + 4(σ2_{+ C}2_{)] implies} 1−4δ(σ 2_{+ C}2₎ n_{− wδ} > 0, and with P-probability at least 1_{− ε,}

R(ψ)_{− R(ψ}⋆)_≤ 1 1−4δ(σn−wδ2+C2) Rn(ψ)− Rn(ψ⋆) + logdρ_dπ(ψ) + log1 ε δ ! .

Lemma 5.4. LetA1andA2hold. Set w = 8C max(L, C), δ_{∈ (0, n/[w+4(σ}2₊

C2_{)]) and ε}

∈ (0, 1). Then with P-probability at least 1 − ε Z Rn(ψ)ρ(dψ)− Rn(ψ⋆)≤ 1 + 4δ(σ 2_{+ C}2₎ n_{− wδ} Z R(ψ)ρ(dψ) −R(ψ⋆) +KL(ρ, π) + log 1 ε δ . (5.8)

Proof. Set ψ_{∈ F and Z}i= (Yi− ψ(Xi))2− (Yi− ψ⋆(Xi))2, i∈ {1, . . . , n}. Since

(19)

(5.6), we get that for any δ∈ (0, ∈ n/w) and ε ∈ (0, 1) E Z exp −δ[R(ψ) − R(ψ⋆_)] 1 + 4δ(σ 2_{+ C}2₎ n− wδ +δ[Rn(ψ)− Rn(ψ⋆)]− log dρ dπ(ψ)− log 1 ε ρ(dψ)_{≤ ε.} Using Jensen’s inequality, we get

E_exp − Z δ[R(ψ)_{− R(ψ}⋆)] 1 + 4δ(σ 2_{+ C}2₎ n_{− wδ} +δ[Rn(ψ)− Rn(ψ⋆)]− logdρ dπ(ψ)− log 1 ε ρ(dψ) ≤ ε. Since exp(δx)≥1R

+(x), we obtain with P-probability at most ε

− Z R(ψ)ρ(dψ) + R(ψ⋆) 1 + 4δ(σ 2_{+ C}2₎ n_{− wδ} + Z Rn(ψ)ρ(dψ) − Rn(ψ⋆)− KL_{(ρ, π) + log}1 ε δ ≥ 0. Taking δ < n/[w + 4(σ2_{+ C}2_{)] yields (}_5.8_).

Lemma 5.5. LetA1andA2hold. Set w = 8C max(L, C), δ_{∈ (0, n/[w+4(σ}2₊

C2_{)]) and ε}

∈ (0, 1). Then with P-probability at least 1 − ε Z R(ψ)ρ(dψ)− R(ψ⋆₎ ≤ 1 1₋4δ(σ_n−wδ2+C2) Z Rn(ψ)ρ(dψ)− Rn(ψ⋆) +KL(ρ, π) + log 1 ε δ .

Proof. Recall (5.7). By Jensen’s inequality,

E_exp δ Z R(ψ)ρ(dψ)_{− R(ψ}⋆₎ 1₋4δ(σ 2_{+ C}2₎ n− wδ +δ Rn(ψ⋆)− Z Rn(ψ)ρ(dψ) − KL(ρ, π) − log1_ε ≤ ε. Using exp(δx)_≥1R

+(x) yields the expected result.

δ = nℓ/[w + 4(σ2_{+ C}2_{)], for ℓ}

(20)

at least 1− 2ε, R( ˆψ)− R(ψ⋆₎ R( ˆψa₎_{− R(ψ}⋆₎ ≤ D inf ρ∈M1 +,π(Θ,T) Z R(ψ)ρ(dψ) −R(ψ⋆_{) +}KL(ρ, π) + log 1 ε n , (5.9)

where D is a constant depending only upon w, σ, C and ℓ.

Proof. Recall that the randomized Gibbs estimator ˆΨ is sampled from ρδ.

De-note by ˆψ a realization of the variable ˆΨ. ByLemma 5.3, with P-probability at least 1− ε, R( ˆψ)_{− R(ψ}⋆)_≤ 1 1−4δ(σn−wδ2+C2) Rn( ˆψ)− Rn(ψ⋆) + logdρδ dπ( ˆψ) + log 1 ε δ ! . Note that logdρδ dπ( ˆψ) = log exp[−δRn( ˆψ)] R exp[−δRn(ψ)]π(dψ) =_−δRn( ˆψ)− log Z exp[_−δRn(ψ)]π(dψ).

Thus, with P-probability at least 1− ε, R( ˆψ)_{− R(ψ}⋆₎ ≤ 1 1−4δ(σn−wδ2+C2) −Rn(ψ⋆)− 1 δlog Z exp[_−δRn(ψ)]π(dψ) +1 δlog 1 ε . ByLemma 5.2, with P-probability at least 1_{− ε,}

R( ˆψ)_{− R(ψ}⋆₎ ≤ 1 1₋4δ(σ_n−wδ2+C2) ρ∈M1inf +,π(Θ,T) Z Rn(ψ)ρ(dψ)− Rn(ψ⋆) +KL(ρ, π) + log 1 ε δ . Finally, byLemma 5.4, with P-probability at least 1_{− 2ε,}

R( ˆψ)− R(ψ⋆₎ ≤ 1 + 4δ(σ2+C2) n−wδ 1₋4δ(σ_n−wδ2+C2) ρ∈M1inf +,π(Θ,T) Z R(ψ)ρ(dψ)− R(ψ⋆₎ + 2 1 + 4δ(σ_n−wδ2+C2) KL_{(ρ, π) + log}1 ε δ ) .

(21)

Apply Lemma 5.5with the Gibbs posterior probability defined by (5.1). With P_{-probability at least 1}_{− ε,} Z R(ψ)ρδ(dψ)− R(ψ⋆)≤ 1 1−4δ(σn−wδ2+C2) Z Rn(ψ)ρδ(dψ)− Rn(ψ⋆) +KL(ρδ, π) + log 1 ε δ . Note that KL_(ρ_δ_{, π) =} Z log exp[−δRn(ψ)] R exp[−δRn(ψ)]π(dψ) ρδ(dψ) =−δ Z Rn(ψ)ρδ(dψ)− log Z exp[−δRn(ψ)]π(dψ) . By Lemma 5.2, with P⊗n-probability at least 1− ε

Z R(ψ)ρδ(dψ)− R(ψ⋆)≤ 1 1₋4δ(σ_n−wδ2+C2)ρ∈M1inf +,π(Θ,T) Z Rn(ψ)ρ(dψ) −Rn(ψ⋆) + KL_{(ρ, π) + log}1 ε δ . By Lemma 5.4, with P⊗n-probability at least 1_{− 2ε}

Z R(ψ)ρδ(dψ)− R(ψ⋆)≤ 1 + 4δ(σ_n−wδ2+C2) 1₋4δ(σ_n−wδ2+C2)ρ∈M1inf +,π(Θ,T) Z R(ψ)ρ(dψ) −R(ψ⋆) + 2 1 + 4δ(σ_n−wδ2+C2) KL_{(ρ, π) + log}1 ε δ ) . As R is a convex function, applying Jensen’s inequality gives

Z

R(ψ)ρδ(dψ)≥ R( ˆψa).

Finally, note that

1 + 4δ(σ_n−wδ2+C2) 1−4δ(σn−wδ2+C2)

= 1 + 8ℓ(σ

2_{+ C}2₎

(1− ℓ)(w + 4σ2_{+ 4C}2₎.

Proof of Theorem 2.1. Let ρ _{∈ M}1

+,π(Θ, T). For any A∈ T, note that ρ(A) =

P

(22)

with P-probability at least 1− 2ε R( ˆψ)− R(ψ⋆₎ ≤ D inf m∈M inf ρ∈M1 +,π(Θ,T) Z R(ψ)ρm(dψ)− R(ψ ⋆_{) +}KL(ρm, π) + log 1 ε n . (5.10) Note that for any ρ_{∈ M}1

+,π(Θ, T) and any m∈ M, KL_(ρ m, π) = Z log dρm dπm dρm+ Z log dπm dπ dρm = KL(ρm, πm)+log(1/α) X j∈S(m) mj+log _p |S(m)| +log    1− α 1−α p+1 1₋ α 1−α   .

Next, using the elementary inequality log n_k ≤ k log(ne/k) and that α 1−α < 1, KL_(ρ m, π)≤ KL(ρm, πm) + log(1/α) X j∈S(m) mj+|S(m)| log pe |S(m)| + log 1 − α 1_{− 2α} . We restrict the set of all probabilities absolutely continuous with respect to πm

to uniform probabilities on the ball B1_m(x, ζ), with x∈ B1m(0, C) and 0 < ζ ≤

C− kθk1. Such a probability is denoted by µx,ζ. With P-probability at least

1− 2ε, it yields that R( ˆψ)_{− R(ψ}⋆)_{≤ D inf} m∈M inf θ∈B1 m(0,C) inf µθ,ζ,0<ζ≤C−kθk1 Z R(ψ_θ¯)µθ,ζ(d¯θ)− R(ψ⋆) + 1 n  KL(µθ,ζ, πm) + log 1 ε+|S(m)| log p |S(m)| + X j∈S(m) mj      .

Next, note that

KL_(µ_θ,ζ_{, π} m) = log Vm(C) Vm(ζ) = log C ζ X j∈S(m) mj.

Note also that Z R(ψθ¯)µθ,ζ(d¯θ) = Z E_[Y₁_{− ψ}_θ_¯_(X₁_)]2_µ_θ,ζ_(d¯_θ) = Z E_[Y₁_{− ψ}_θ_(X₁_{) + ψ}_θ_(X₁₎_{− ψ}_θ¯(X1)]2µθ,ζ(d¯θ),

(23)

and Z E_[Y₁_{− ψ}_θ_(X₁_{) + ψ}_θ_(X₁₎_{− ψ}_θ¯(X1)]2µθ,ζ(d¯θ) = Z R(ψθ)µθ,ζ(d¯θ) + Z E_[ψ_θ_(X₁₎_{− ψ}_θ¯(X1)]2µθ,ζ(d¯θ) + 2 Z E_{[Y₁_{− ψ}_θ_(X₁_)][ψ_θ_(X₁₎_{− ψ}_θ¯(X1)]}µθ,ζ(d¯θ). Since ¯θ_{∈ B}1_m(θ, ζ), Z E_[ψ_θ_(X₁₎_{− ψ}_θ¯(X1)]2µθ,ζ(d¯θ) = Z E   X j∈S(m) mj X k=1 (θjk− ¯θjk)ϕk(X1j)   2 µθ,ζ(d¯θ) ≤ kθ − ¯θk2 1max k kϕkk 2 ∞≤ ζ2,

and by the Fubini-Tonelli theorem, 2 Z E_{[Y₁_{− ψ}_θ_(X₁_)][ψ_θ_(X₁₎_{− ψ}_θ¯(X1)]}µθ,ζ(d¯θ) = 2 E [Y1− ψθ(X1)] Z [ψθ(X1)− ψθ¯(X1)]µθ,ζ(d¯θ) = 0, sinceR ψθ¯(X1)µθ,ζ(d¯θ) = ψθ(X1). Consequently, as Z R(ψθ)µθ,ζ(d¯θ) = R(ψθ), we get Z R(ψ_θ¯)µθ,ζ(d¯θ)≤ R(ψθ) + ζ2.

So with P-probability at least 1− 2ε,

R( ˆψ)− R(ψ⋆₎ ≤ D inf m∈M inf θ∈B1 m(0,C) inf µθ,ζ,0<ζ≤C−kθk1 ( R(ψθ) + ζ2− R(ψ⋆) + 1 n  log(C/ζ) X j∈S(m) mj+ log1 ε +|S(m)| log p |S(m)| + X j∈S(m) mj   ) .

The function t7→ t2_{+ log(C/t)}P

j∈S(m)mj/n is convex. Its minimum is unique

and is reached for t = [P

(24)

R( ˆψ)_{− R(ψ}⋆)_{≤ D inf} m∈M inf θ∈B1 m(0,C) R(ψθ)− R(ψ⋆) +_|S(m)|log(p/|S(m)|) n + log(n) n X j∈S(m) mj+ log(1/ε) n    ,

where D is a constant depending only on w, σ, C, ℓ and α. As the same inequality holds for ˆψa_{, this concludes the proof.}

Proof of Theorem 2.2. RecallTheorem 2.1.A3gives

R(ψθ)− R(ψ⋆) =

Z

(ψθ(x)− ψ⋆(x))2d P(x)≤ B

Z

(ψθ(x)− ψ⋆(x))2dx.

For any m∈ M, define

ψ⋆ m= X j∈S⋆ mj X k=1 θ⋆ jkϕk.

To proceed, we need to check that the projection of θ⋆ _{onto model m lies in}

B1 m(0, C), i.e., X j∈S⋆ mj X k=1 |θ⋆ jk| ≤ C.

Using the Cauchy-Schwarz inequality, we get X j∈S⋆ mj X k=1 |θ⋆ jk| = X j∈S⋆ mj X k=1 krj_|θ⋆ jk|k−rj ≤ X j∈S⋆   v u u t mj X k=1 k2rj(θ⋆ jk)2 v u u t mj X k=1 k−2rj  .

Since for any t ≥ 1, Pmj

k=1k−2t ≤

P∞

k=1k−2t = π2/6, the previous inequality

yields X j∈S⋆ mj X k=1 |θ⋆ jk| ≤ π √ 6 X j∈S⋆ pdj≤ C.

Recalling (5.3) andA3, for a m∈ M we may now write that inf θ∈Θm R(ψθ)− R(ψ⋆)≤ R(ψ_m⋆)− R(ψ⋆)≤ B Z (ψ⋆(x)− ψ⋆ m(x)) 2_dx = B Z   X j∈S⋆ ∞ X k=1+mj θ⋆ jkϕk(x)   2 dx.

(25)

As{ϕk}∞k=1forms an orthogonal basis, B Z   X j∈S⋆ ∞ X k=1+mj θ⋆ jkϕk(x)   2 dx = B X j∈S⋆ ∞ X k=1+mj (θ⋆ jk)2 ≤ B X j∈S⋆ dj(1 + mj)−2rj,

where the normalizing numerical factors are included in the now generic constant B. As a consequence, with P-probability at least 1− 2ε,

R( ˆψ)− R(ψ⋆₎ ≤ D inf m∈M    B X j∈S⋆ n dj(1 + mj)−2rj+ mj n log(n) o +_|S⋆ |log(p/|S ⋆ |) n + log(1/ε) n , where D is the same constant as in Theorem 2.1. For any r _{≥ 2, the function} t_{7→ d}j(1+t)−2rj+log(n)_n t is convex and admits a minimum in _2rlog(n)_j_d_j_n

− 1

2rj +1_−1.

Accordingly, choosing mj ∼ _2rlog(n)_j_d_j_n − 1

2rj +1 _{− 1 yields that with P-probability}

at least 1− 2ε, R( ˆψ)_{− R(ψ}⋆)_{≤ D}    X j∈S⋆ d 1 2rj +1 j log(n) 2nrj _{2rj +1}2rj +_|S⋆_|log _p |S⋆_| n + log(1/ε) n    , where D is a constant depending only on α, w, σ, C, ℓ and B, and that ends the proof.

Proof of Theorem 2.3. The proof is similar to the proof of Theorem 2.1. From

(5.10) and for any ρ∈ M1

+,π(Θ, T) and any m∈ M, KL_(ρ m, π) = KL(ρm, πm) + log(1/α)|S(m)| + log _p |S(m)| + log    1−α1−α_1−αK+1p+1 1− α1−αK+1 1−α   + X j∈S(m) log _K |S(mj)| .

Using the elementary inequality log n_k_{≤ k log(ne/k) and that α}1−αK+1

1−α ∈ (0, 1) since α < 1/2, KL_(ρ m, π)≤ KL(ρm, πm) +|S(m)| log(1/α) + log pe |S(m)| + X j∈S(m) |S(mj)| log Ke |S(mj)| + log 1 − α 1_{− 2α} .

(26)

Thus with P-probability at least 1− 2ε, R( ˆψ)− R(ψ⋆₎ ≤ D inf m∈M inf θ∈B1 m(0,C) inf µθ,ζ,0<ζ≤C−kθk1 ( R(ψθ) + ζ2− R(ψ⋆) + 1 n " [log(C/ζ) + log(K)] X j∈S(m) |S(mj)| + log1 ε+|S(m)| log p |S(m)| #) .

Hence with P-probability at least 1− 2ε, R( ˆψ)− R(ψ⋆₎ R( ˆψa₎_{− R(ψ}⋆₎ ≤ D inf m∈M inf θ∈B1 m(0,C) ( R(ψθ)− R(ψ⋆) +|S(m)|log(p/|S(m)|)_n +log(nK) n X j∈S(m) |S(mj)| + log(1/ε) n ) ,

where D is a numerical constant depending upon w, σ, C, ℓ and α. Acknowledgements

The authors are grateful to G´erard Biau and ´Eric Moulines for their constant implication, and to Christophe Giraud and Taiji Suzuki for valuable insights and comments. They also thank an anonymous referee and an associate editor for providing constructive and helpful remarks.

References

Alquier, P. (2006). Transductive and Inductive Adaptive Inference for Re-gression and Density Estimation PhD thesis, Universit´e Paris 6 - UPMC.

Alquier, P. (2008). PAC-Bayesian Bounds for Randomized

Empiri-cal Risk Minimizers. MathematiEmpiri-cal Methods of Statistics 17 279–304.

arXiv:0712.1698v3 MR2483458

Alquier, P. and Biau, G. (2011). Sparse Single-Index Model. To appear in Journal of Machine Learning Research.arXiv:1101.3229v2

Alquier, P.and Lounici, K. (2011). PAC-Bayesian Theorems for Sparse Re-gression Estimation with Exponential Weights. Electronic Journal of

Statis-tics 5 127–145. MR2786484

Audibert, J.-Y.(2004a). Aggregated estimators and empirical complexity for least square regression. Annales de l’Institut Henri Poincar´e: Probabilit´es et

Statistiques 40 685–736. MR2096215

Audibert, J.-Y.(2004b). Théorie statistique de l’apprentissage: une approche PAC-Bayésienne PhD thesis, Université Paris 6 - UPMC.

Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. The Annals of Statistics 37 1591–1646. MR2533466

(27)

Audibert, J.-Y. and Catoni, O. (2010). Robust linear regression through PAC-Bayesian truncation. Submitted. arXiv:1010.0072v2

Audibert, J.-Y. and Catoni, O. (2011). Robust linear least squares regres-sion. The Annals of Statistics 39 2766–2794. MR2906886

Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous anal-ysis of Lasso and Dantzig selector. The Annals of Statistics 37 1705–1732.

arXiv:0801.1095v3 MR2533469

Bunea, F., Tsybakov, A. B. and Wegkamp, M. (2006). Aggregation and sparsity via ℓ1-penalized least squares. In Proceedings of the 19th

an-nual conference on Computational Learning Theory 379–391. Springer-Verlag.

MR2280619

B¨uhlmann, P. and van de Geer, S. A. (2011). Statistics for

High-Dimensional Data. Springer. MR2807761

Carlin, B. P.and Chib, S. (1995). Bayesian Model choice via Markov Chain Monte Carlo Methods. Journal of the Royal Statistical Society, Series B 57 473–484.

Catoni, O. (2004). Statistical Learning Theory and Stochastic

Optimiza-tion. École d’ Été de Probabilités de Saint-Flour XXXI – 2001. Springer.

MR2163920

Catoni, O. (2007). PAC-Bayesian Supervised Classification: The

Thermody-namics of Statistical Learning. Lecture notes – Monograph Series 56. Institute

of Mathematical Statistics. MR2483528

Dalalyan, A. S. and Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72 39–61.arXiv:0803.2839v1

Dalalyan, A. S. and Tsybakov, A. B. (2012). Sparse Regression Learning by Aggregation and Langevin Monte-Carlo. Journal of Computer and System

Sciences 78 1423–1443.arXiv:0903.1223v3 MR2926142

Giraud, C., Huet, S. and Verzelen, N. (2012). High-dimensional regression with unknown variance. To appear in Statistical Science. arXiv:1109.5587v2 MR2934907

Green, P. J.(1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711–732. MR1380810

Guedj, B. (2012). pacbpred: PAC-Bayesian Estimation and Prediction in Sparse Additive Models R package version 0.92. http://cran.r-project. org/web/packages/pacbpred/index.html

Hans, C., Dobra, A. and West, M. (2007). Shotgun Stochastic Search for “Large p” Regression. Journal of the American Statistical Association 102 507–516. MR2370849

Hastie, T. and Tibshirani, R. (1986). Generalized Additive Models.

Statis-tical Science 1 297–318. MR0858512

Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models.

Mono-graphs on Statistics and Applied Probability 43. Chapman & Hall/CRC.

(28)

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of

Statis-tical Learning – Data mining, Inference, and Prediction, Second ed. Springer.

MR2722294

H¨ardle, W. K.(1990). Applied nonparametric regression. Cambridge Univer-sity Press. MR1161622

Koltchinskii, V.and Yuan, M. (2010). Sparsity in multiple kernel learning.

The Annals of Statistics 38 3660–3695. MR2766864

Marin, J.-M.and Robert, C. P. (2007). Bayesian Core: A Practical Approach

to Computational Bayesian Statistics. Springer. MR2289769

Massart, P. (2007). Concentration Inequalities and Model Selection. ´Ecole

d’ Été de Probabilités de Saint-Flour XXXIII – 2003. Springer. MR2319879

McAllester, D. A.(1999). Some PAC-Bayesian Theorems. Machine Learning 37355–363. MR1811587

Meier, L., van de Geer, S. A. and B¨uhlmann, P.(2009). High-dimensional additive modeling. The Annals of Statistics 37 3779–3821.arXiv:0806.4115 MR2572443

Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse repre-sentations for high-dimensional data. The Annals of Statistics 37 246–270.

arXiv:0806.0145v2 MR2488351

Meyn, S.and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability, 2nd ed. Cambridge University Press. MR2509253

Petralias, A.(2010). Bayesian model determination and nonlinear threshold volatility models PhD thesis, Athens University of Economics and Business. Petralias, A.and Dellaportas, P. (2012). An MCMC model search

algo-rithm for regression problems. Journal of Statistical Computation and

Simu-lation 0 1-19.

R Core Team (2012). R: A Language and Environment for Statistical Com-puting, Vienna, Austria ISBN 3-900051-07-0.http://www.R-project.org/

Raskutti, G., Wainwright, M. J. and Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming.

Journal of Machine Learning Research 13 389-427. MR2913704

Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. Journal of the Royal Statistical Society, Series B 71 1009– 1030.arXiv:0711.4555v2 MR2750255

Rigollet, P.(2006). Inégalités d’oracle, agrégation et adaptation PhD thesis, Université Paris 6 - UPMC.

Rigollet, P.and Tsybakov, A. B. (2012). Sparse estimation by exponential weighting. Statistical Science 27 558-575.

Shawe-Taylor, J.and Williamson, R. C. (1997). A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on Computational

Learning Theory 2–9. ACM.

Stone, C. J.(1985). Additive regression and other nonparametric models. The

Annals of Statistics 13 689–705. MR0790566

Suzuki, T.(2012). PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model. In Proceedings of the 25th annual conference

(29)

Suzuki, T. and Sugiyama, M. (2012). Fast learning rates of Multiple kernel learning: trade-off between sparsity and smoothness. Submitted.

arXiv.org/abs/1203.0565v1

Tibshirani, R.(1996). Regression shrinkage and selection via the Lasso.

Jour-nal of the Royal Statistical Society, Series B 58 267–288. MR1379242

Tsybakov, A. B.(2009). Introduction to Nonparametric Estimation. Statistics. Springer. MR2724359

van de Geer, S. A. (2008). High-dimensional generalized linear models and the Lasso. The Annals of Statistics 36 614–645. MR2396809