The promises and pitfalls of Stochastic Gradient Langevin Dynamics

(1)

HAL Id: hal-01934291

https://hal.inria.fr/hal-01934291

Submitted on 25 Nov 2018

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

The promises and pitfalls of Stochastic Gradient Langevin Dynamics

Nicolas Brosse, Alain Durmus, Éric Moulines

To cite this version:

Nicolas Brosse, Alain Durmus, Éric Moulines. The promises and pitfalls of Stochastic Gradient

Langevin Dynamics. NeurIPS 2018 (Advances in Neural Information Processing Systems 2018), Dec

2018, Montreal, France. �hal-01934291�

(2)

The promises and pitfalls of Stochastic Gradient Langevin Dynamics

Nicolas Brosse, Éric Moulines

Centre de Mathématiques Appliquées, UMR 7641, Ecole Polytechnique, Palaiseau, France.

nicolas.brosse@polytechnique.edu, eric.moulines@polytechnique.edu

Alain Durmus

Ecole Normale Supérieure CMLA,

61 Av. du Président Wilson 94235 Cachan Cedex, France.

alain.durmus@cmla.ens-cachan.fr

Abstract

Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC algorithm for Bayesian learning from large scale datasets. While SGLD with decreasing step sizes converges weakly to the posterior distribution, the algorithm is often used with a constant step size in practice and has demonstrated successes in machine learning tasks. The current practice is to set the step size inversely proportional toN whereN is the number of training samples. AsN becomes large, we show that the SGLD algorithm has an invariant probability measure which significantly departs from the target posterior and behaves like Stochastic Gradient Descent (SGD). This difference is inherently due to the high variance of the stochastic gradients. Several strategies have been suggested to reduce this effect;

among them, SGLD Fixed Point (SGLDFP) uses carefully designed control variates to reduce the variance of the stochastic gradients. We show that SGLDFP gives approximate samples from the posterior distribution, with an accuracy comparable to the Langevin Monte Carlo (LMC) algorithm for a computational cost sublinear in the number of data points. We provide a detailed analysis of the Wasserstein distances between LMC, SGLD, SGLDFP and SGD and explicit expressions of the means and covariance matrices of their invariant distributions. Our findings are supported by limited numerical experiments.

1 Introduction

Most MCMC algorithms have not been designed to process huge sample sizes, a typical setting in machine learning. As a result, many classical MCMC methods fail in this context, because the mixing time becomes prohibitively long and the cost per iteration increases proportionally to the number of training samplesN. The computational cost in standard Metropolis-Hastings algorithm comes from 1) the computation of the proposals, 2) the acceptance/rejection step. Several approaches to solve these issues have been recently proposed in machine learning and computational statistics.

Among them, the stochastic gradient langevin dynamics (SGLD) algorithm, introduced in [37], is a popular choice. This method is based on the Langevin Monte Carlo (LMC) algorithm proposed in [19, 20]. Standard versions of LMC require to compute the gradient of the log-posterior at the current fit of the parameter, but avoid the accept/reject step. The LMC algorithm is a discretization of a continuous-time process, the overdamped Langevin diffusion, which leaves invariant the target distributionπ. To further reduce the computational cost, SGLD uses unbiased estimators of the

(3)

gradient of the log-posterior based on subsampling. This method has triggered a huge number of works among others [1, 24, 2, 7, 9, 14, 27, 15, 4] and have been successfully applied to a range of state of the art machine learning problems [30, 26].

The properties of SGLD with decreasing step sizes have been studied in [34]. The two key findings in this work are that 1) the SGLD algorithm converges weakly to the target distributionπ, 2) the optimal rate of convergence to equilibrium scales asn^−1/3wherenis the number of iterations, see [34, Section 5]. However, in most of the applications, constant rather than decreasing step sizes are used, see [1, 9, 21, 25, 33, 36]. A natural question for the practical design of SGLD is the choice of the minibatch size. This size controls on the one hand the computational complexity of the algorithm per iteration and on the other hand the variance of the gradient estimator. Non-asymptotic bounds in Wasserstein distance between the marginal distribution of the SGLD iterates and the target distribution πhave been established in [11, 12]. These results highlight the cost of using stochastic gradients and show that, for a given precisionin Wasserstein distance, the computational cost of the plain SGLD algorithm does not improve over the LMC algorithm; Nagapetyan et al. [28] reports also similar results on the mean square error.

It has been suggested to use control variates to reduce the high variance of the stochastic gradients.

For strongly log-concave models, Nagapetyan et al. [28], Baker et al. [3] use the mode of the posterior distribution as a reference point and introduce the SGLDFP (Stochastic Gradient Langevin Dynamics Fixed Point) algorithm. Nagapetyan et al. [28], Baker et al. [3] provide upper bounds on the mean square error and the Wasserstein distance between the marginal distribution of the iterates of SGLDFP and the posterior distribution. In addition, Nagapetyan et al. [28], Baker et al. [3] show that the overall cost remains sublinear in the number of individual data points, up to a preprocessing step. Other control variates methodologies are provided for non-concave models in the form of SAGA-Langevin Dynamics and SVRG-Langevin Dynamics [15, 8], albeit a detailed analysis in Wasserstein distance of these algorithms is only available for strongly log-concave models [6].

In this paper, we provide further insights on the links between SGLD, SGLDFP, LMC and SGD (Stochastic Gradient Descent). In our analysis, the algorithms are used with a constant step size and the parameters are set to the standard values used in practice [1, 9, 21, 25, 33, 36]. The LMC, SGLD and SGLDFP algorithms define homogeneous Markov chains, each of which admits a unique stationary distribution used as a hopefully close proxy ofπ. The main contribution of this paper is to show that, while the invariant distributions of LMC and SGLDFP become closer toπas the number of data points increases, on the opposite, the invariant measure of SGLD never comes close to the target distributionπand is in fact very similar to the invariant measure of SGD.

In Section 3.1, we give an upper bound in Wasserstein distance of order2between the marginal distribution of the iterates of LMC and the Langevin diffusion, SGLDFP and LMC, and SGLD and SGD. We provide a lower bound on the Wasserstein distance between the marginal distribution of the iterates of SGLDFP and SGLD. In Section 3.2, we give a comparison of the means and covariance matrices of the invariant distributions of LMC, SGLDFP and SGLD with those of the target distributionπ. Our claims are supported by numerical experiments in Section 4.

2 Preliminaries

Denote byz={zi}^N_i=1the observations. We are interested in situations where the target distribution πarises as the posterior in a Bayesian inference problem with prior densityπ₀(θ)and a large number N 1of i.i.d. observationsziwith likelihoodsp(zi|θ). In this case,π(θ) =π0(θ)QN

i=1p(zi|θ).

We denoteUi(θ) =−log(p(zi|θ))fori∈ {1, . . . , N},U0(θ) =−log(π0(θ)),U =PN i=0Ui. Under mild conditions,πis the unique invariant probability measure of the Langevin Stochastic Differential Equation (SDE):

dθt=−∇U(θt)dt+√

2dBt, (1)

where(Bt)_t≥0is ad-dimensional Brownian motion. Based on this observation, Langevin Monte Carlo (LMC) is an MCMC algorithm that enables to sample (approximately) fromπusing an Euler discretization of the Langevin SDE:

θk+1=θk−γ∇U(θk) +p

2γZk+1, (2)

(4)

where γ > 0is a constant step size and(Zk)k≥1 is a sequence of i.i.d. standardd-dimensional Gaussian vectors. Discovered and popularised in the seminal works [19, 20, 32], LMC has recently received renewed attention [10, 17, 16, 12]. However, the cost of one iteration isN dwhich is prohibitively large for massive datasets. In order to scale up to the big data setting, Welling and Teh [37] suggested to replace∇U with an unbiased estimate∇U0+ (N/p)P

i∈S∇UiwhereSis a minibatch of{1, . . . , N}with replacement of sizep. A single update of SGLD is then given for k∈Nby

θk+1=θk−γ



∇U0(θk) +N p

X

i∈Sk+1

∇Ui(θk)



+p

2γZk+1. (3)

The idea of using only a fraction of data points to compute an unbiased estimate of the gradient at each iteration comes from Stochastic Gradient Descent (SGD) which is a popular algorithm to minimize the potentialU. SGD is very similar to SGLD because it is characterised by the same recursion as SGLD but without Gaussian noise:

θ_k+1=θ_k−γ



∇U0(θ_k) +N p

X

i∈Sk+1

∇Ui(θ_k)



 . (4) Assuming for simplicity thatUhas a minimizerθ^?, we can define a control variates version of SGLD, SGLDFP, see [15, 8], given fork∈Nby

θk+1=θk−γ



∇U0(θk)− ∇U0(θ^?) +N p

X

i∈S_k+1

{∇Ui(θk)− ∇Ui(θ^?)}



+p

2γZk+1. (5) It is worth mentioning that the objectives of the different algorithms presented so far are distinct. On the one hand, LMC, SGLD and SGDLFP are MCMC methods used to obtain approximate samples from the posterior distributionπ. On the other hand, SGD is a stochastic optimization algorithm used to find an estimate of the modeθ^?of the posterior distribution. In this paper, we focus on the fixed step-size SGLD algorithm and assess its ability to reliably sample fromπ. For that purpose and to quantify precisely the relation between LMC, SGLD, SGDFP and SGD, we make for simplicity the following assumptions onU.

H1. For alli∈ {0, . . . , N},Uiis four times continuously differentiable and for allj ∈ {2,3,4}, sup_θ∈_Rd

D^jUi(θ)

≤L. In particular for all˜ i∈ {0, . . . , N},UiisL-gradient Lipschitz, i.e. for˜ allθ1, θ2∈R^d,k∇Ui(θ1)− ∇Ui(θ2)k ≤L˜kθ1−θ2k.

H2. Uism-strongly convex, i.e. for allθ1, θ2∈R^d,h∇U(θ1)− ∇U(θ2), θ1−θ2i ≥mkθ1−θ2k². H3. For alli∈ {0, . . . , N},U_iis convex.

Note that under H 1, U is four times continuously differentiable and for j ∈ {2,3,4}, sup_θ∈Rd

D^jU(θ)

≤ L, with L = (N + 1) ˜L and where

D^jU(θ) = sup_ku₁_k≤1,...,ku_j_k≤1D^jU(θ)[u1, . . . , uj]. In particular,U isL-gradient Lipschitz. Furthermore, underH2,U has a unique minimizerθ^?. In this paper, we focus on the asymptoticN→+∞,. We assume thatlim inf_N_→+∞N⁻¹m >0, which is a common assumption for the analysis of SGLD and SGLDFP [3, 6]. In practice [1, 9, 21, 25, 33, 36],γis of order1/Nand we adopt this convention in this article.

For a practical implementation of SGLDFP, an estimatorθˆofθ^?is necessary. The theoretical analysis and the bounds remain unchanged if, instead of considering SGLDFP centered w.r.t.θ^?, we study SGLDFP centered w.r.t.θˆsatisfyingE[kθˆ−θ^?k²] =O(1/N). Such an estimatorθˆcan be computed using for example SGD with decreasing step sizes, see [29, eq.(2.8)] and [3, Section 3.4], for a computational cost linear inN.

3 Results

3.1 Analysis in Wasserstein distance

Before presenting the results, some notations and elements of Markov chain theory have to be introduced. Denote byP2(R^d)the set of probability measures with finite second moment and by

(5)

B(R^d)the Borelσ-algebra ofR^d. Forλ, ν∈ P2(R^d), define the Wasserstein distance of order2by W2(λ, ν) = inf

ξ∈Π(λ,ν)

Z

R^d×R^d

kθ−ϑk²ξ(dθ,dϑ) ^1/2

,

whereΠ(λ, ν)is the set of probability measuresξonB(R^d)⊗ B(R^d)satisfying for allA∈ B(R^d), ξ(A×R^d)) =λ(A)andξ(R^d×A) =ν(A).

A Markov kernelRonR^d× B(R^d)is a mappingR:R^d× B(R^d)→[0,1]satisfying the following conditions: (i) for everyθ∈R^d,R(θ,·) :A7→R(θ,A)is a probability measure onB(R^d)(ii) for everyA∈ B(R^d),R(·, A) :θ 7→R(θ, A)is a measurable function. For any probability measure λonB(R^d), we defineλRfor allA ∈ B(R^d)byλR(A) = R

R^dλ(dθ)R(θ,A). For allk ∈ N^∗, we define the Markov kernel R^k recursively byR¹ = Rand for all θ ∈ R^d andA ∈ B(R^d), R^k+1(θ,A) =R

R^dR^k(θ,dϑ)R(ϑ,A). A probability measure¯πis invariant forRifπR¯ = ¯π.

The LMC, SGLD, SGD and SGLDFP algorithms defined respectively by (2), (3), (4) and (5) are homogeneous Markov chains with Markov kernels denotedR_LMC, R_SGLD, R_SGD, andR_FP. To avoid overloading the notations, the dependence onγandNis implicit.

Lemma 1. Assume H1, H2 and H3. For any step size γ ∈ (0,2/L), RSGLD (respectively RLMC, RSGD, RFP) has a unique invariant measureπSGLD∈ P2(R^d)(respectivelyπLMC, πSGD, πFP).

In addition, for allγ∈(0,1/L],θ∈R^dandk∈N, W²₂(R^k_SGLD(θ,·), πSGLD)≤(1−mγ)^k

Z

R^d

kθ−ϑk²π_SGLD(dϑ) and the same inequality holds for LMC, SGD and SGLDFP.

Proof. The proof is postponed to Appendix A.1.

UnderH1, (1) has a unique strong solution(θt)t≥0for every initial conditionθ0∈R^d[23, Chapter 5, Theorems 2.5 and 2.9]. Denote by(Pt)_t≥0the semigroup of the Langevin diffusion defined for all θ₀∈R^dandA∈ B(R^d)byP_t(θ₀,A) =P(θ_t∈A).

Theorem 2. AssumeH1,H2 andH3. For allγ∈(0,1/L],λ, µ∈ P2(R^d)andn∈N, we have the following upper-bounds in Wasserstein distance between

i) LMC and SGLDFP,

W²₂(λRⁿ_LMC, µRⁿ_FP)≤(1−mγ)ⁿW²₂(λ, µ) +2L²γd pm² +L²γ²

p n(1−mγ)ⁿ⁻¹ Z

R^d

kϑ−θ^?k²µ(dϑ),

ii) the Langevin diffusion and LMC, W²₂(λRⁿ_LMC, µPnγ)≤2

1− mLγ m+L

n

W²₂(λ, µ) +dγm+L 2m

3 + L

m 13

6 + L m

+ne−(m/2)γ(n−1)L³γ³

1 +m+L 2m

Z

R^d

kϑ−θ^?k²µ(dϑ),

iii) SGLD and SGD,

W²₂(λRⁿ_SGLD, µRⁿ_SGD)≤(1−mγ)ⁿW²₂(λ, µ) + (2d)/m . Proof. The proof is postponed to Appendix A.2.

Corollary 3. AssumeH1, H2 andH3. Set γ = η/N with η ∈ (0,1/(2 ˜L)]and assume that lim infN→∞mN⁻¹>0. Then,

(6)

i) for all n ∈ N, we get W2(Rⁿ_LMC(θ^?,·), Rⁿ_FP(θ^?,·)) = √

dη O(N^−1/2) and W2(πLMC, πFP) =√

dη O(N^−1/2), W2(πLMC, π) =√

dη O(N^−1/2).

ii) for all n ∈ N, we get W₂(Rⁿ_SGLD(θ^?,·), R_SGDⁿ (θ^?,·)) = √

d O(N^−1/2) and W₂(π_SGLD, π_SGD) =√

d O(N^−1/2).

Theorem 2 implies that the number of iterations necessary to obtain a sampleε-close fromπin Wasserstein distance is the same for LMC and SGLDFP. However for LMC, the cost of one iteration isN dwhich is larger thanpdthe cost of one iteration for SGLDFP. In other words, to obtain an approximate sample from the target distribution at an accuracyO(1/√

N)in2-Wasserstein distance, LMC requiresΘ(N)operations, in contrast with SGLDFP that needs onlyΘ(1)operations.

We show in the sequel thatW₂(π_FP, π_SGLD) = Ω(1)whenN →+∞in the case of a Bayesian linear regression, where for two sequences(uN)N≥1,(vN)N≥1,uN = Ω(vN)iflim infN→+∞uN/vN >

0. The dataset is z= {(yi, xi)}^N_i=1 whereyi ∈ Ris the response variable andxi ∈ R^d are the covariates. Sety= (y1, . . . , yN)∈R^N andX∈R^N^×dthe matrix of covariates such that thei^throw ofXisx_i. Letσ²_y, σ_θ²>0. Fori∈ {1, . . . , N}, the conditional distribution ofy_igivenx_iis Gaussian with meanx^T_i θand varianceσ_y². The priorπ₀(θ)is a normal distribution of mean0and variance σ²_θId. The posterior distributionπis then proportional toπ(θ)∝exp −(1/2)(θ−θ^?)^TΣ(θ−θ^?) where

Σ = Id/σ²_θ+X^TX/σ²_y and θ^?= Σ⁻¹(X^Ty)/σ²_y.

We assume thatX^TXmId, withlim inf_N→+∞m/N >0. LetSbe a minibatch of{1, . . . , N} with replacement of sizep. Define

∇U0(θ) + (N/p)X

i∈S

∇Ui(θ) = Σ(θ−θ^?) +ρ(S)(θ−θ^?) +ξ(S) where

ρ(S) = Id σ_θ² + N

pσ²_y X

i∈S

x_ix^T_i −Σ, ξ(S) = θ^? σ²_θ + N

pσ²_y X

i∈S

x^T_iθ^?−y_i

x_i. (6) ρ(S)(θ−θ^?)is the multiplicative part of the noise in the stochastic gradient, andξ(S)the additive part that does not depend onθ. The additive part of the stochastic gradient for SGLDFP disappears since

∇U0(θ)− ∇U0(θ^?) + (N/p)X

i∈S

{∇Ui(θ)− ∇Ui(θ^?)}= Σ(θ−θ^?) +ρ(S)(θ−θ^?).

In this setting, the following theorem shows that the Wasserstein distances between the marginal distribution of the iterates of SGLD and SGLDFP, andπSGLDandπ, is of orderΩ(1)whenN→+∞.

This is in sharp contrast with the results of Corollary 3 where the Wasserstein distances tend to0as N →+∞at a rateN^−1/2. For simplicity, we state the result ford= 1.

Theorem 4. Consider the case of the Bayesian linear regression in dimension1.

i) For allγ∈(0,Σ⁻¹{1 +N/(pPN

i=1x²_i)}⁻¹]andn∈N^∗, 1−µ

1−µⁿ ^1/2

W₂(R_SGLDⁿ (θ^?,·), Rⁿ_FP(θ^?,·))

≥ (

2γ+γ²N p

N

X

i=1

(xiθ^?−yi)xi

σ²_y + θ^? N σ²_θ

²)^1/2

−p 2γ , whereµ∈(0,1−γΣ].

ii) Setγ=η/Nwithη∈(0,lim infN→+∞NΣ⁻¹{1 +N/(pPN

i=1x²_i)}⁻¹]and assume that lim infN→+∞N⁻¹PN

i=1x²_i >0. We haveW2(πSGLD, π) = Ω(1).

Proof. The proof is postponed to Appendix A.3.

(7)

The study in Wasserstein distance emphasizes the different behaviors of the LMC, SGLDFP, SGLD and SGD algorithms. WhenN → ∞andlim_N→+∞m/N >0, the marginal distributions of the k^thiterates of the LMC and SGLDFP algorithm are very close to the Langevin diffusion and their invariant probability measuresπLMCandπFP are similar to the posterior distribution of interestπ.

In contrast, the marginal distributions of thek^thiterates of SGLD and SGD are analogous and their invariant probability measuresπ_SGLDandπ_SGDare very different fromπwhenN →+∞.

Note that to fix the asymptotic bias of SGLD, other strategies can be considered: choosing a step sizeγ∝N^−βwhereβ >1and/or increasing the batch sizep∝N^αwhereα∈[0,1]. Using the Wasserstein (of order 2) bounds of SGLD w.r.t. the target distributionπ, see e.g. [12, Theorem 3], α+βshould be equal to2to guarantee theε-accuracy in Wasserstein distance of SGLD for a cost proportional toN(up to logarithmic terms), independently of the choice ofαandβ.

3.2 Mean and covariance matrix ofπLMC, πFP, πSGLD

We now establish an expansion of the mean and second moments ofπ_LMC, π_FP, π_SGLDandπ_SGDas N →+∞, and compare them. We first give an expansion of the mean and second moments ofπas N →+∞.

Proposition 5. AssumeH1 andH2 and thatlim inf_N_→+∞N⁻¹m >0. Then, Z

R^d

(θ−θ^?)^⊗2π(dθ) =∇²U(θ^?)⁻¹+ON→+∞(N^−3/2), Z

R^d

θ π(dθ)−θ^?=−(1/2)∇²U(θ^?)⁻¹D³U(θ^?)[∇²U(θ^?)⁻¹] +O_N→+∞(N^−3/2).

Proof. The proof is postponed to Appendix B.1.

Contrary to the Bayesian linear regression where the covariance matrices can be explicitly computed, see Appendix C, only approximate expressions are available in the general case. For that purpose, we consider two types of asymptotic. For LMC and SGLDFP, we assume thatlim_N→+∞m/N >0, γ =η/N, forη >0, and we develop an asymptotic whenN →+∞. Combining Proposition 5 and Theorem 6 , we show that the biases and covariance matrices ofπLMC andπFPare of order Θ(1/N)with remainder terms of the formO(N^−3/2), where for two sequences(uN)_N≥1,(vN)_N≥1, u= Θ(v)if0<lim inf_N_→+∞u_N/v_N ≤lim sup_N_→+∞u_N/v_N <+∞.

Regarding SGD and SGLD, we do not have such concentration properties whenN→+∞because of the high variance of the stochastic gradients. The biases and covariance matrices of SGLD and SGD are of orderΘ(1)whenN →+∞. To obtain approximate expressions of these quantities, we setγ =η/N whereη >0is the step size for the gradient descent over the normalized potential U/N. Assuming thatmis proportional toNandN ≥1/η, we show by combining Proposition 5 and Theorem 7 that the biases and covariance matrices of SGLD and SGD are of orderΘ(η)with remainder terms of the formO(η^3/2)whenη→0.

Before giving the results associated to πLMC, πFP, πSGLD andπSGD, we need to introduce some notations. For any matricesA₁, A₂∈R^d×d, we denote byA₁⊗A₂the Kronecker product defined onR^d×dbyA1⊗A2:Q7→A1QA2andA^⊗2=A⊗A. Besides, for allθ1∈R^dandθ2∈R^d, we denote byθ1⊗θ2∈R^d×dthe tensor product ofθ1andθ2. For any matrixA∈R^d×d,Tr(A)is the trace ofA.

DefineK :R^d×d→R^d×dfor allA∈R^d×dby

K(A) =N p

N

X

i=1



∇²U_i(θ^?)− 1 N

N

X

j=1

∇²U_j(θ^?)





⊗2

A . (7)

andHandG :R^d×d→R^d×dby

H =∇²U(θ^?)⊗Id + Id⊗∇²U(θ^?)−γ∇²U(θ^?)⊗ ∇²U(θ^?), (8) G =∇²U(θ^?)⊗Id + Id⊗∇²U(θ^?)−γ(∇²U(θ^?)⊗ ∇²U(θ^?) + K). (9)

(8)

K,HandGcan be interpreted as perturbations of∇²U(θ^?)^⊗2and∇²U(θ^?), respectively, due to the noise of the stochastic gradients. It can be shown, see Appendix B.2, that forγsmall enough,H andGare invertible.

Theorem 6. AssumeH1, H2 andH3. Setγ =η/N and assume thatlim inf_N_→+∞N⁻¹m >0.

There exists an (explicit)η0independent ofNsuch that for allη∈(0, η0), Z

R^d

(θ−θ^?)^⊗2πLMC(dθ) = H⁻¹(2 Id) +ON→+∞(N^−3/2), (10) Z

R^d

(θ−θ^?)^⊗2πFP(dθ) = G⁻¹(2 Id) +ON→+∞(N^−3/2), (11) and

Z

R^d

θπLMC(dθ)−θ^?=−∇²U(θ^?)⁻¹D³U(θ^?)[H⁻¹Id] +O_N→+∞(N^−3/2), Z

R^d

θπFP(dθ)−θ^?=−∇²U(θ^?)⁻¹D³U(θ^?)[G⁻¹Id] +ON→+∞(N^−3/2). Proof. The proof is postponed to Appendix B.2.2.

Theorem 7. AssumeH1, H2 andH3. Setγ =η/N and assume thatlim inf_N_→+∞N⁻¹m >0.

There exists an (explicit)η0independent ofNsuch that for allη∈(0, η0)andN ≥1/η, Z

R^d

(θ−θ^?)^⊗2πSGLD(dθ) = G⁻¹{2 Id +(η/p) M}+O_η→0(η^3/2), (12) Z

R^d

(θ−θ^?)^⊗2πSGD(dθ) = (η/p) G⁻¹M +Oη→0(η^3/2), (13) and

Z

R^d

θπSGLD(dθ)−θ^?=−(1/2)∇²U(θ^?)⁻¹D³U(θ^?)[G⁻¹{2 Id +(η/p) M}] +O_η→0(η^3/2), Z

R^d

θπSGD(dθ)−θ^?=−(η/2p)∇²U(θ^?)⁻¹D³U(θ^?)[G⁻¹M] +O_η→0(η^3/2), where

M =

N

X

i=1



∇Ui(θ^?)− 1 N

N

X

j=1

∇Uj(θ^?)





⊗2

, (14)

andGis defined in(9).

Proof. The proof is postponed to Appendix B.2.2.

Note that this result implies that the mean and the covariance matrix ofπ_SGLDandπ_SGDstay lower bounded by a positive constant for anyη >0asN →+∞. In Appendix D, a figure illustrates the results of Theorem 6 and Theorem 7 in the asymptoticN →+∞.

4 Numerical experiments

Simulated data For illustrative purposes, we consider a Bayesian logistic regression in dimension d= 2. We simulateN = 10⁵covariates{xi}^N_i=1drawn from a standard2-dimensional Gaussian distribution and we denote byX∈R^N^×dthe matrix of covariates such that thei^throw ofXisx_i. Our Bayesian regression model is specified by a Gaussian prior of mean0and covariance matrix the identity, and a likelihood given fory_i∈ {0,1}byp(y_i|x_i, θ) = (1 + e^−x^Tⁱ^θ)^−yⁱ(1 + e^x^Tⁱ^θ)^yⁱ⁻¹. We simulateNobservations{yi}^N_i=1under this model. In this setting,H1 andH3 are satisfied, andH2 holds if the state space is compact.

To illustrate the results of Section 3.2, we consider10regularly spaced values ofN between10² and10⁵and we truncate the dataset accordingly. We compute an estimatorθˆofθ^?using SGD [31]

(9)

10² 10³ 10⁴ 10⁵ 10³

10² 10¹

n

LMC

10² 10³ 10⁴ 10⁵

10³ 10²

10¹

SGLDFP

10² 10³ 10⁴ 10⁵

N

2 × 10¹

n

SGLD

10² 10³ 10⁴ 10⁵

N

2 × 10¹

SGD

Figure 1: Distance toθ^?,

θ¯n−θ^?

for LMC, SGLDFP, SGLD and SGD, function ofN, in logarithmic scale.

combined with the BFGS algorithm [22]. For the LMC, SGLDFP, SGLD and SGD algorithms, the step sizeγis set equal to(1 +δ/4)⁻¹whereδis the largest eigenvalue ofX^TX. We start the algorithms atθ₀ = ˆθand runn= 1/γiterations where the first10%samples are discarded as a burn-in period.

We estimate the means and covariance matrices ofπ_LMC, π_FP, π_SGLDandπ_SGDby their empirical averagesθ¯_n = (1/n)Pn−1

k=0θ_kand{1/(n−1)}Pn−1

k=0(θ_k−θ¯_n)^⊗2. We plot the mean and the trace of the covariance matrices for the different algorithms, averaged over100independent trajectories, in Figure 1 and Figure 2 in logarithmic scale.

The slope for LMC and SGLDFP is−1which confirms the convergence of

θ¯_n−θ^?

to0at a rate N⁻¹. On the other hand, we can observe that

θ¯n−θ^?

converges to a constant for SGD and SGLD.

Covertype dataset We then illustrate our results on the covertype dataset¹with a Bayesian logistic regression model. The prior is a standard multivariate Gaussian distribution. Given the size of the dataset and the dimension of the problem, LMC requires high computational resources and is not included in the simulations. We truncate the training dataset atN ∈

10³,10⁴,10⁵ . For all algorithms, the step sizeγis set equal to1/Nand the trajectories are started atθ, an estimator ofˆ θ^?, computed using SGD combined with the BFGS algorithm.

We empirically check that the variance of the stochastic gradients scale asN²for SGD and SGLD, and asN for SGLDFP. We compute the empirical variance estimator of the gradients, take the mean over the dimension and display the result in a logarithmic plot in Figure 3. The slopes are2for SGD and SGLD, and1for SGLDFP.

On the test dataset, we also evaluate the negative loglikelihood of the three algorithms for different values ofN ∈

10³,10⁴,10⁵ , as a function of the number of iterations. The plots are shown in Figure 4. We note that for largeN, SGLD and SGD give very similar results that are below the performance of SGLDFP.

1https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.

binary.scale.bz2

(10)

10² 10³ 10⁴ 10⁵ 10³

10² 10¹

Tr (co v(

n

))

LMC

10² 10³ 10⁴ 10⁵

10³ 10² 10¹

SGLDFP

10² 10³ 10⁴ 10⁵

N

6 × 10¹ 7 × 10¹

Tr (co v(

n

))

SGLD

10² 10³ 10⁴ 10⁵

N

5 × 10¹

SGD

Figure 2: Trace of the covariance matrices for LMC, SGLDFP, SGLD and SGD, function ofN, in logarithmic scale.

10³ 10⁴ 10⁵

N

10³ 10⁴ 10⁵ 10⁶

Variance of the gradients

sgd

10³ 10⁴ 10⁵

N

10³ 10⁴ 10⁵ 10⁶

sgld

10³ 10⁴ 10⁵

N

10² 10³

10⁴

sgldfp

Figure 3: Variance of the stochastic gradients of SGLD, SGLDFP and SGD function ofN, in logarithmic scale.

0 20000 40000 60000 80000 100000

iterations

0.534 0.536 0.538 0.540 0.542 0.544

Negative loglikelihood

N=10

³

0 200000 400000 600000 800000 1000000

iterations

0.5060 0.5065 0.5070 0.5075 0.5080 0.5085

0.5090

N=10

⁴

0.0 0.2 0.4 0.6 0.8 1.0

iterations

^1e7

0.5022 0.5024 0.5026 0.5028 0.5030 0.5032 0.5034 0.5036

N=10

⁵

sgldsgldfp sgd

Figure 4: Negative loglikelihood on the test dataset for SGLD, SGLDFP and SGD function of the number of iterations for different values ofN∈

10³,10⁴,10⁵ .

(11)

A Proofs of Section 3.1

A.1 Proof of Lemma 1

The convergence in Wasserstein distance is classically done via a standard synchronous coupling [13, Proposition 2]. We prove the statement for SGLD; the adaptation for LMC, SGLDFP and SGD is immediate. Letγ∈(0,2/L)andλ1, λ2∈ P2(R^d). By [35, Theorem 4.1], there exists a couple of random variables(θ⁽¹⁾₀ , θ₀⁽²⁾)such thatW²₂(λ1, λ2) =E

θ₀⁽¹⁾−θ⁽²⁾₀

2

. Let(θ⁽¹⁾_k , θ⁽²⁾_k )_k∈Nbe the SGLD iterates starting fromθ⁽¹⁾₀ andθ⁽²⁾₀ respectively and driven by the same noise,i.e.for all k∈N,







θ⁽¹⁾_k+1 =θ_k⁽¹⁾−γn

∇U0(θ_k⁽¹⁾) + (N/p)P

i∈Sk+1∇Ui(θ⁽¹⁾_k )o +√

2γZk+1, θ⁽²⁾_k+1 =θ_k⁽²⁾−γn

∇U0(θ_k⁽²⁾) + (N/p)P

i∈Sk+1∇Ui(θ⁽²⁾_k )o +√

2γZk+1,

where(Zk)k≥1is an i.i.d. sequence of standard Gaussian variables and(Sk)k≥1an i.i.d. sequence of subsamples of{1, . . . , N}of sizep. Denote by(Fk)_k∈Nthe filtration associated to(θ_k⁽¹⁾, θ⁽²⁾_k )_k∈N. We have fork∈N,

θ⁽¹⁾_k+1−θ⁽²⁾_k+1

2

=

θ⁽¹⁾_k −θ⁽²⁾_k

2

+γ²

∇U0(θ⁽¹⁾_k ) +N p

X

i∈S_k+1

∇Ui(θ⁽¹⁾_k )− ∇U0(θ_k⁽²⁾)−N p

X

i∈S_k+1

∇Ui(θ⁽²⁾_k )

2

−2γ

*

θ⁽¹⁾_k −θ⁽²⁾_k ,∇U₀(θ⁽¹⁾_k ) +N p

X

i∈Sk+1

∇U_i(θ_k⁽¹⁾)− ∇U₀(θ⁽²⁾_k )−N p

X

i∈Sk+1

∇U_i(θ⁽²⁾_k ) +

.

ByH1 andH3, θ 7→ ∇U0(θ) + (N/p)P

i∈S∇Ui(θ)isP-a.s. L-co-coercive [38]. Taking the conditional expectation w.r.t.Fk, we obtain

E

θ⁽¹⁾_k+1−θ_k+1⁽²⁾

2

Fk

≤

θ_k⁽¹⁾−θ⁽²⁾_k

2

−2γ{1−(γL)/2}D

θ_k⁽¹⁾−θ⁽²⁾_k ,∇U(θ⁽¹⁾_k )− ∇U(θ⁽²⁾_k )E ,

and byH2 E

θ⁽¹⁾_k+1−θ_k+1⁽²⁾

2

Fk

≤ {1−2mγ(1−(γL)/2)}

θ⁽¹⁾_k −θ⁽²⁾_k

2

.

Since for allk ≥ 0,(θ⁽¹⁾_k , θ⁽²⁾_k )belongs to Π(λ1R^k_SGLD, λ2R^k_SGLD), we get by a straightforward induction

W²₂(λ1R^k_SGLD, λ2R_SGLD^k )≤E

θ⁽¹⁾_k −θ_k⁽²⁾

2

≤ {1−2mγ(1−(γL)/2)}^kW²₂(λ1, λ2). (15) ByH1,λ1RSGLD ∈ P2(R^d)and takingλ2 =λ1RSGLD, we getP+∞

k=0W²₂(λ1R^k_SGLD, λ1R^k+1_SGLD)<

+∞.By [35, Theorem 6.16],P2(R^d)endowed withW2is a Polish space.(λ1R^k_SGLD)k≥0is a Cauchy sequence and converges to a limitπ_SGLD^λ¹ ∈ P2(R^d). The limitπ_SGLD^λ¹ does not depend onλ1because, givenλ2∈ P2(R^d), by the triangle inequality

W2(π_SGLD^λ¹ , π_SGLD^λ² )≤W2(π^λ_SGLD¹ , λ1R^k_SGLD) + W2(λ1R_SGLD^k , λ2R^k_SGLD) + W2(π_SGLD^λ² , λ2R^k_SGLD). Taking the limitk→+∞, we getW₂(π^λ_SGLD¹ , π^λ_SGLD² ) = 0. The limit is thus the same for all initial distributions and is denotedπSGLD.πSGLDis invariant forRSGLDsince we have for allk∈N^∗,

W₂(π_SGLD, π_SGLDR_SGLD)≤W₂(π_SGLD, π_SGLDR^k_SGLD) + W₂(π_SGLDR_SGLD, π_SGLDR^k_SGLD). Taking the limitk→+∞, we obtainW₂(π_SGLD, π_SGLDR_SGLD) = 0. Using (15),π_SGLDis the unique invariant probability measure forRSGLD.

(12)

A.2 Proof of Theorem 2

Proof of i). Letγ∈(0,1/L]andλ1, λ2∈ P2(R^d). By [35, Theorem 4.1], there exists a couple of random variables(θ₀, ϑ₀)such thatW²₂(λ₁, λ₂) = E

hkθ0−ϑ₀k²i

. Let(θ_k, ϑ_k)_k∈_Nbe the LMC and SGLDFP iterates starting fromθ0andϑ0respectively and driven by the same noise,i.e.for all k∈N,

(θ_k+1 =θ_k−γ∇U(θ_k) +√

2γZ_k+1, ϑ_k+1 =ϑ_k−γ

∇U₀(ϑ_k)− ∇U₀(θ^?) + (N/p)P

i∈S_k+1{∇U_i(ϑ_k)− ∇U_i(θ^?)}

+√

2γZ_k+1, where(Zk)_k≥1is an i.i.d. sequence of standard Gaussian variables and(Sk)_k≥1an i.i.d. sequence of subsamples with replacement of{1, . . . , N}of sizep. Denote by(Fk)k∈Nthe filtration associated to (θk, ϑk)_k∈N. We have fork∈N,

E

hkθk+1−ϑk+1k² Fk

i

=kθk−ϑkk²−2γhθk−ϑk,∇U(θk)− ∇U(ϑk)i+γ²A (16) where

A=E







∇U(θk)−



∇U0(ϑk)− ∇U0(θ^?) + (N/p) X

i∈Sk+1

{∇Ui(ϑk)− ∇Ui(θ^?)}





2

Fk







=A1+A2,

A1=k∇U(θk)− ∇U(ϑk)k² , A2=E







∇U(ϑk)−



∇U0(ϑk)− ∇U0(θ^?) + (N/p) X

i∈S_k+1

{∇Ui(ϑk)− ∇Ui(θ^?)}





2

Fk





 .

Denote by W the random variable equal to ∇U_i(ϑ_k)− ∇U_i(θ^?)−(1/N)PN

j=1{∇U_j(ϑ_k)−

∇Uj(θ^?)}fori∈ {1, . . . , N}with probability1/N. ByH1 and using the fact that the subsamples (Sk)_k≥1are drawn with replacement, we obtain

A2= (N²/p)E

hkWk²|Fk

i≤(L²/p)kϑk−θ^?k² .

Combining it with (16), and using theL-co-coercivity of∇U underH1 andH2, we get E

hkθk+1−ϑ_k+1k² Fk

i≤(1−mγ)kθk−ϑ_kk²+{(L²γ²)/p} kϑk−θ^?k² . Iterating and using Lemma 8-i), we have forn∈N

W²₂(λ1Rⁿ_LMC, λ2Rⁿ_FP)≤E

hkθn−ϑnk²i

≤(1−mγ)ⁿW²₂(λ₁, λ₂) +L²γ² p

n−1

X

k=0

(1−mγ)^n−1−kE

hkϑk−θ^?k²i

≤(1−mγ)ⁿW²₂(λ1, λ2) +L²γ²

p n(1−mγ)ⁿ⁻¹ Z

R^d

kϑ−θ^?k²λ2(dϑ) +2L²γd pm² .

Proof of ii). Denote byκ= (2mL)/(m+L). ByH1,H2 and [16, Theorem 5], we have for all n∈N,

W²₂(λ₁P_nγ, λ₂Rⁿ_LMC)≤2 (1−κγ/2)ⁿW²₂(λ₁, λ₂) +2L²γ

κ (κ⁻¹+γ)

2d+dL²γ² 6

+L⁴γ³(κ⁻¹+γ)

n

X

k=1

δ_k{1−κγ/2}^n−k

(13)

where for allk∈ {1, . . . , n},

δk≤e^{−2m(k−1)γ} Z

R^d

kϑ−θ^?k²λ1(dϑ) +d/m . We get the result by straightforward simplifications and usingγ≤1/L.

Proof of iii). Letγ∈(0,1/L]andλ1, λ2∈ P2(R^d). By [35, Thereom 4.1], there exists a couple of random variables(θ₀, ϑ₀)such thatW²₂(λ₁, λ₂) =E

hkθ0−ϑ₀k²i

. Let(θ_k, ϑ_k)_k∈_Nbe the SGLD and SGD iterates starting fromθ0andϑ0respectively and driven by the same noise,i.e.for allk∈N,







θk+1 =θk−γ

∇U0(θk) + (N/p)P

i∈Sk+1∇Ui(θk) +√

2γZk+1, ϑk+1 =ϑk−γ

∇U0(ϑk) + (N/p)P

i∈Sk+1∇Ui(ϑk) ,

where(Zk)k≥1is an i.i.d. sequence of standard Gaussian variables and(Sk)k≥1an i.i.d. sequence of subsamples with replacement of{1, . . . , N}of sizep. Denote by(Fk)_k∈_Nthe filtration associated to (θ_k, ϑ_k)_k∈_N. We have fork∈N,

E

i

=kθk−ϑkk²−2γhθk−ϑk,∇U(θk)− ∇U(ϑk)i+ 2γd

+γ²E







∇U0(θk) + (N/p) X

i∈Sk+1

∇Ui(θk)− ∇U0(ϑk)−(N/p) X

i∈Sk+1

∇Ui(ϑk)

2

Fk





 .

ByH1 andH3,θ7→ ∇U0(θ) + (N/p)P

i∈S∇Ui(θ)isP-a.s.L-co-coercive and we obtain E

i≤ {1−2mγ(1−γL/2)} kθk−ϑkk²+ 2γd , which concludes the proof by a straightforward induction.

A.3 Proof of Theorem 4 Proof of i). Letγ∈

0,Σ⁻¹{1 +N/(pPN

i=1x²_i)}⁻¹i

,(θ_k)_k∈_Nbe the iterates of SGLD (3) started atθ^?and(Fk)_k∈Nthe associated filtration. For allk∈N,E[θk] =θ^?. The variance ofθksatisfies the following recursion fork∈N

E

(θ_k+1−θ^?)² Fk

=E n

θ_k−θ^?−γ(Σ(θ_k−θ^?) +ρ(S_k+1)(θ_k−θ^?) +ξ(S_k+1)) +p

2γZ_k+1o2

F_k

=µ(θ_k−θ^?)²+ 2γ+γ²A , where

µ=E



 (

1−γ 1 σ²_θ + N

σ_y²p X

i∈S

x²_i

!)²

 , A=E



 (θ^?

σ_θ² + N σ_y²p

X

i∈S

(x_iθ^?−y_i)x_i )²

 .

We have forµ,

µ= 1−2γΣ +γ²E



 ( N

σ_y²p X

i∈S

x²_i − 1 σ_y²

N

X

i=1

x²_i )²

+γ²Σ²

= 1−2γΣ +γ²







Σ²+ N σ_y⁴p

N

X

i=1



x²_i − 1 N

N

X

j=1

x²_j











≤1−γΣ, and forA,

A=N p

N

X

i=1

(xiθ^?−yi)xi

σ²_y + θ^? N σ_θ²

² .

(14)

By a straightforward induction, we obtain that the variance of then^thiterate of SGLD started atθ^?is forn∈N^∗

Z

R

(θ−θ^?)²Rⁿ_SGLD(θ^?,dθ) =1−µⁿ

1−µ 2γ+1−µⁿ 1−µ

N γ² p

N

X

i=1

(xiθ^?−yi)xi

σ²_y + θ^? N σ²_θ

² .

For SGLDFP, the additive part of the noise in the stochastic gradient disappears and we obtain similarly forn∈N^∗

Z

R

(θ−θ^?)²Rⁿ_FP(θ^?,dθ) = 1−µⁿ 1−µ 2γ .

To conclude, we use that for two probability measures with given mean and covariance matrices, the Wasserstein distance between the two Gaussians with these respective parameters is a lower bound for the Wasserstein distance between the two measures [18, Theorem 2.1].

The proof of ii) is straightforward.

B Proofs of Section 3.2

B.1 Proof of Proposition 5

Letθbe distributed according toπ. ByH2, for allϑ∈R^d,U(ϑ)≥U(θ^?) + (m/2)kϑ−θ^?k²and E[∇U(θ)] = 0. By a Taylor expansion of∇Uaroundθ^?, we obtain

0 =E[∇U(θ)] =∇²U(θ^?) (E[θ]−θ^?) + (1/2) D³U(θ^?)[E

(θ−θ^?)^⊗2

] +E[R₁(θ)] , where byH1,R1:R^d→R^dsatisfies

sup

ϑ∈R^d

nkR1(ϑ)k/kϑ−θ^?k³o

≤L/6. (17)

Rearranging the terms, we get

E[θ]−θ^?=−(1/2)∇²U(θ^?)⁻¹D³U(θ^?)[E

(θ−θ^?)^⊗2

]− ∇²U(θ^?)⁻¹E[R1(θ)] . To estimate the covariance matrix ofπaroundθ^?, we start again from the Taylor expansion of∇U aroundθ^?and we obtain

E

∇U(θ)^⊗2

=E

h ∇²U(θ^?)(θ−θ^?) +R2(θ)⊗2i

=∇²U(θ^?)^⊗2E

(θ−θ^?)^⊗2

+E[R3(θ)]

(18) where byH1,R2:R^d→R^dsatisfies

sup

ϑ∈R^d

nkR2(ϑ)k/kϑ−θ^?k²o

≤L/2, (19)

andR3:R^d→R^d×dis defined for allϑ∈R^dby

R3(ϑ) =∇²U(θ^?)(ϑ−θ^?)⊗ R2(ϑ) +R2(ϑ)⊗ ∇²U(θ^?)(ϑ−θ^?) +R2(ϑ)^⊗2. (20) E

∇U(θ)^⊗2

is the Fisher information matrix and by a Taylor expansion of∇²U aroundθ^?and an integration by parts,

E

∇U(θ)^⊗2

=E

∇²U(θ)

=∇²U(θ^?) +E[R4(θ)]

where byH1,R4:R^d→R^d×dsatisfies sup

ϑ∈R^d

{kR4(ϑ)k/kϑ−θ^?k} ≤L . (21) Combining this result, (17), (18), (19), (20), (21) andE[kθ−θ^?k⁴]≤d(d+ 2)/m²by [5, Lemma 9] conclude the proof.