Sampling from a log-concave distribution with Projected Langevin Monte Carlo

(1)

HAL Id: hal-01428950

https://hal.archives-ouvertes.fr/hal-01428950

Submitted on 6 Jan 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Sampling from a log-concave distribution with Projected Langevin Monte Carlo

Sébastien Bubeck, Ronen Eldan, Joseph Lehec

To cite this version:

Sébastien Bubeck, Ronen Eldan, Joseph Lehec. Sampling from a log-concave distribution with Pro- jected Langevin Monte Carlo. Discrete and Computational Geometry, Springer Verlag, 2018, 59 (4), pp.757-783. �10.1007/s00454-018-9992-1�. �hal-01428950�

(2)

Sampling from a log-concave distribution with Projected Langevin Monte Carlo

S´ebastien Bubeck^∗ Ronen Eldan^† Joseph Lehec^‡ August 8, 2016

Abstract

We extend the Langevin Monte Carlo (LMC) algorithm to compactly supported measures via a projection step, akin to projected Stochastic Gradient Descent (SGD). We show that (projected) LMC allows to sample in polynomial time from a log-concave distribution with smooth potential. This gives a new Markov chain to sample from a log-concave distribution.

Our main result shows in particular that when the target distribution is uniform, LMC mixes in O(ne ⁷)steps (wherenis the dimension). We also provide preliminary experimental evidence that LMC performs at least as well as hit-and-run, for which a better mixing time of O(ne ⁴) was proved by Lov´asz and Vempala.

1 Introduction

LetK ⊂ Rⁿbe a convex body such that0 ∈ K, K contains a Euclidean ball of radiusr, andK is contained in a Euclidean ball of radius R. Denote PK for the Euclidean projection onK. Let f :K →Rbe aL-Lipschitz andβ-smooth convex function, that isf is differentiable and statisfies

∀x, y ∈ K,|∇f(x)− ∇f(y)| ≤ β|x−y|, and |∇f(x)| ≤ L. We are interested in the problem of sampling from the probability measure µ on Rⁿ whose density with respect to the Lebesgue measure is given by:

dµ dx = 1

Z exp(−f(x))1{x∈K}, where Z = Z

y∈K

exp(−f(y))dy.

In this paper we study the following Markov chain, which depends on a parameterη > 0, and whereξ₁, ξ₂, . . .is an i.i.d. sequence of standard Gaussian random variables inRⁿ:

Xk+1 =PK

Xk− η

2∇f(Xk) +√ ηξk

, (1)

withX₀ = 0.

∗Microsoft Research;sebubeck@microsoft.com.

†Weizmann Institute;roneneldan@gmail.com.

‡Universit´e Paris-Dauphine;lehec@ceremade.dauphine.fr.

arXiv:1507.02564v1 [math.PR] 9 Jul 2015

(3)

Recall that the total variation distance between two measures µ, ν is defined as TV(µ, ν) = sup_A|µ(A)−ν(A)| where the supremum is over all measurable setsA. With a slight abuse of notation we sometimes write TV(X, ν) where X is a random variable distributed according to µ. The notation v_n = O(ue _n) (respectively Ω) means that there existse c ∈ R, C > 0 such that v_n ≤ Cu_nlog^c(u_n) (respectively≥). We also say v_n = Θ(ue _n)if one has both v_n = O(ue _n)and v_n =Ω(ue _n). Our main result is the following:

Theorem 1 Assume that r = 1 and let ε > 0. Then one has TV(X_N, µ) ≤ ε provided that η=Θ(Re ²/N)and thatN satisfies the following: ifµis uniform then

N =Ωe

R⁶n⁷ ε⁸

,

and otherwise

N =Ωe

R⁶max(n, RL, Rβ)¹² ε¹²

.

1.1 Context and related works

There is a long line of works in theoretical computer science proving results similar to Theorem 1, starting with the breakthrough result of Dyer et al. [1991] who showed that the lattice walk mixes in O(ne ²³)steps. The current record for the mixing time is obtained by Lov´asz and Vem- pala[2007], who show a bound ofO(ne ⁴)for the hit-and-run walk. These chains (as well as other popular chains such as the ball walk or the Dikin walk, see e.g. Kannan and Narayanan [2012]

and references therein) all require azeroth-order oraclefor the potentialf, that is givenxone can calculate the valuef(x). On the other hand our proposed chain (1) works with afirst-order oracle, that is givenxone can calculate the value of∇f(x). The difference between zeroth-order oracle and first-order oracle has been extensively studied in the optimization literature (e.g.,Nemirovski and Yudin [1983]), but it has been largely ignored in the literature on polynomial-time sampling algorithms. We also note that hit-and-run and LMC are the only chains which are rapidly mixing from any starting point (seeLov´asz and Vempala[2006]), though they have this property for seem- ingly very different reasons. When initialized in a corner of the convex body, hit-and-run might take a long time to take a step, but once it moves it escapes very far (while a chain such as the ball walk would only do a small step). On the other hand LMC keeps moving at every step, even when initialized in a corner, thanks for the projection part of (1).

Our main motivation to study the chain (1) stems from its connection with the ubiquitous stochastic gradient descent (SGD) algorithm. In general this algorithm takes the form x_k+1 = P_K(x_k−η∇f(x_k) +ε_k) where ε₁, ε₂, . . . is a centered i.i.d. sequence. Standard results in ap- proximation theory, such as Robbins and Monro [1951], show that if the variance of the noise Var(ε₁) is of smaller order than the step-size η then the iterates (x_k) converge to the minimum of f onK (for a step-size decreasing sufficiently fast as a function of the number of iterations).

For the specific noise sequence that we study in (1), the variance is exactly equal to the step-size, which is why the chain deviates from its standard and well-understood behavior. We also note that other regimes where SGD does not converge to the minimum off have been studied in the optimization literature, such as the constant step-size case investigated inPflug[1986],Bach and

(4)

Moulines[2013].

The chain (1) is also closely related to a line of works in Bayesian statistics on Langevin Monte Carlo algorithms, starting essentially withTweedie and Roberts[1996]. The focus there is on the unconstrained case, that isK =Rⁿ. In this simpler situation, a variant of Theorem1was proven in the recent paperDalalyan[2014]. The latter result is the starting point of our work. A straightforward way to extend the analysis of Dalalyan to the constrained case is to run the unconstrained chain with an additional potential that diverges quickly as the distance from x to K increases.

However it seems much more natural to study directly the chain (1). Unfortunately the techniques used in Dalalyan[2014] cannot deal with the singularities in the diffusion process which are introduced by the projection. As we explain in Section1.2 our main contribution is to develop the appropriate machinery to study (1).

In the machine learning literature it was recently observed that Langevin Monte Carlo algorithms are particularly well-suited for large-scale applications because of the close connection to SGD. For instanceWelling and Teh[2011] suggest to use mini-batch to compute approximate gradients instead of exact gradients in (1), and they call the resulting algorithm SGLD (Stochastic Gradient Langevin Dynamics). It is conceivable that the techniques developed in this paper could be used to analyze SGLD and its refinements introduced in Ahn et al. [2012]. We leave this as an open problem for future work. Another interesting direction for future work is to improve the polynomial dependency on the dimension and the inverse accuracy in Theorem 1(our main goal here was to provide the simplest polynomial-time analysis).

1.2 Contribution and paper organization

As we pointed out above,Dalalyan[2014] proves the equivalent of Theorem1in the unconstrained case. His elegant approach is based on viewing LMC as a discretization of the diffusion process dX_t = dW_t− ¹₂∇f(X_t), where(W_t) is a Brownian motion. The analysis then proceeds in two steps, by deriving first the mixing time of the diffusion process, and then showing that the discretized process is ‘close’ to its continuous version. InDalalyan[2014] the first step is particularly clean as he assumesα-strong convexity for the potential, which in turns directly gives a mixing time of order1/α. The second step is also rather simple once one realizes that LMC can be viewed as the diffusion processdX_t = dW_t − ¹₂∇f(X_ηb^t

ηc). Using Pinsker’s inequality and Girsanov’s formula it is then a short calculation to show that the total variation distance betweenX_tandX_tis small.

The constrained case presents several challenges, arising from the reflection of the diffusion process on the boundary of K, and from the lack of curvature in the potential (indeed the constant potential case is particularly important for us as it corresponds toµ being the uniform distribution on K). Rather than a simple Brownian motion with drift, LMC with projection can be viewed as the discretization of reflected Brownian motion with drift, which is a process of the form dX_t = dW_t− ¹₂∇f(X_t)dt−ν_tL(dt), where X_t ∈ K,∀t ≥ 0, L is a measure supported on{t ≥ 0 : X_t ∈ ∂K}, andν_t is an outer normal unit vector ofK at X_t. The term ν_tL(dt)is referred to as theTanaka drift. FollowingDalalyan[2014] the analysis is again decomposed in two steps. We study the mixing time of the continuous process via a simple coupling argument, which

(5)

crucially uses the convexity ofK and of the potentialf. The main difficulty is in showing that the discretized process (X_t) is close to the continuous version(X_t), as the Tanaka drift prevents us from a straightforward application of Girsanov’s formula. Our approach around this issue is to first use a geometric argument to prove that the two processes are close in Wasserstein distance, and then to show that in fact for a reflected Brownian motion with drift one can deduce a total variation bound from a Wasserstein bound.

The paper is organized as follows. We start in Section 2 by proving Theorem1 for the case of a uniform distribution. We first remind the reader of Tanaka’s construction (Tanaka[1979]) of reflected Brownian motion in Subsection 2.1. We present our geometric argument to bound the Wasserstein distance between(X_t)and(X_t)in Subsection2.2, and we use our coupling argument to bound the mixing time of(X_t)in Subsection2.3. Then in Subsection2.4 we use properties of reflected Brownian to show that one can obtain a total variation bound from the Wasserstein bound of Subsection 2.2. We conclude the proof of the first part of Theorem 1 in Subsection 2.5. In Section3we generalize these arguments to an arbitrary smooth potential. Finally we conclude the paper in Section4with some preliminary experimental comparison between LMC and hit-and-run.

2 The constant potential case

In this section we prove Theorem 1 for the case whereµ is uniform, that is ∇f = 0. First we introduce some useful notation. For a pointx∈∂K we say thatνis an outer unit normal vector at xif|ν|= 1and

hx−x⁰, νi ≥0, ∀x⁰ ∈K.

Forx /∈∂K we say that0is an outer unit normal atx. Letk · k_K be the gauge ofK defined by kxk_K = inf{t≥0; x∈tK}, x∈Rⁿ,

andh_K the support function ofKby

h_K(y) = sup{hx, yi; x∈K}, y∈Rⁿ.

Note thath_K is also the gauge function of the polar body ofK. Finally we denotem =R

|x|µ(dx), andM =E[kθk_K], whereθis uniform on the sphereSⁿ⁻¹.

2.1 The Skorokhod problem

Let T ∈ R+ ∪ {+∞} and w: [0, T) → Rⁿ be a piecewise continuous path with w(0) ∈ K. We say thatx: [0, T) → Rⁿ andϕ: [0, T) → Rⁿ solve the Skorokhod problem forw if one has x(t)∈K,∀t∈[0, T),

x(t) = w(t) +ϕ(t), ∀t∈[0, T), and furthermoreϕis of the form

ϕ(t) =− Z t

0

νsL(ds), ∀t∈[0, T),

(6)

whereν_s is an outer unit normal atx(s), and Lis a measure on [0, T]supported on the set {t ∈ [0, T) : x(t)∈∂K}.

The pathxis called thereflectionofwat the boundary ofK, and the measureLis called the local time of x at the boundary of K. Skorokhod showed the existence of such a a pair (x, ϕ) in dimension 1 in Skorokhod [1961], and Tanaka extended this result to convex sets in higher dimensions inTanaka [1979]. Furthermore Tanaka also showed that the solution is unique, and if wis continuous then so isxandϕ. In particular the reflected Brownian motion inK, denoted(X_t), is defined as the reflection of the standard Brownian motion(W_t)at the boundary ofK (existence follows by continuity ofW_t). Observe that by Itˆo’s formula, for any smooth functiongonRⁿ,

g(X_t)−g(X₀) = Z t

0

h∇g(X_s), dW_si+ 1 2

Z t 0

∆g(X_s)ds− Z t

0

h∇g(X_s), ν_siL(ds). (2) To get a sense of what a solution typically looks like, let us work out the case where w is piecewise constant (this will also be useful to realize that LMC can be viewed as the solution to a Skorokhod problem). For a sequenceg1. . . gN ∈Rⁿ, and forη >0, we consider the path:

w(t) =

N

X

k=1

g_k1{t≥kη}, t∈[0,(N + 1)η).

Define(x_k)_k=0,...,N inductively byx₀ = 0and

x_k+1 =P_K(x_k+g_k).

It is easy to verify that the solution to the Skorokhod problem forw is given byx(t) = x_b^t

ηcand ϕ(t) =−Rt

0ν_sL(ds), where the measureLis defined by (denotingδ_sfor a dirac ats) L=

N

X

k=1

|x_k+g_k− P_K(x_k+g_k)|δ_kη, and fors=kη,

ν_s = x_k+g_k− P_K(x_k+g_k)

|x_k+g_k− P_K(x_k+g_k)|.

2.2 Discretization of reflected Brownian motion

Given the discussion above, it is clear that when f is a constant function, the chain (1) can be viewed as the reflection(X_t)of a discretized Brownian motion W_t := W_ηb^t

ηcat the boundary of K (more precisely the value ofX_kη coincides with the value ofX_kas defined by (1)). It is rather clear that the discretized Brownian motion(Wt)is “close” to the path(Wt), and we would like to carry this to the reflected paths(X_t)and(X_t). The following lemma extracted fromTanaka[1979]

allows to do exactly that.

Lemma 1 Letwandwbe piecewise continuous path and assume that(x, ϕ)and(x, ϕ)solve the Skorokhod problems forwandw, respectively. Then for all timetwe have

|x(t)−x(t)|² ≤ |w(t)−w(t)|² + 2

Z t 0

hw(t)−w(t)−w(s) +w(s), ϕ(ds)−ϕ(ds)i.

(7)

In the next lemma we control the local time at the boundary of the reflected Brownian motion(X_t).

Lemma 2 We have, for allt >0 E

Z t 0

h_K(ν_s)L(ds)

≤ nt 2. Proof By Itˆo’s formula

d|X_t|² = 2hX_t, dW_ti+n dt−2hX_t, ν_tiL(dt).

Now observe that by definition of the reflection, iftis in the support ofLthen hX_t, ν_ti ≥ hx, ν_ti, ∀x∈K.

In other wordshX_t, ν_ti ≥h_K(ν_t). Therefore 2

Z t 0

h_K(ν_s)L(ds)≤2 Z t

0

hX_s, dW_si+nt+|X₀|²− |X_t|².

The first term of the right–hand side is a martingale, so using thatX₀ = 0and taking expectation we get the result.

Lemma 3 There exists a universal constantC such that

E

"

sup

[0,T]

kW_t−W_tk_K

#

≤C M n^1/2η^1/2log(T /η)^1/2. Proof Note that

E

"

sup

[0,T]

kW_t−W_tk_K

#

=E

0≤i≤Nmax−1Y_i

where

Y_i = sup

t∈[iη,(i+1)η)

kW_t−W_iηk_K.

Observe that the variables(Y_i)are identically distributed, letp≥1and write

E

i≤N−1max Yi

≤E





N−1

X

i=0

|Yi|^p

!^1/p

≤N^1/pkY0kp.

We claim that

kY₀k_p ≤C√

p n η M (3)

for some constantC, and for all p ≥ 2. Taking this for granted and choosing p = log(N)in the previous inequality yields the result (recall thatN = T /η). So it is enough to prove (3). Observe that since(W_t)is a martingale, the process

Mt=kWtkK

(8)

is a sub–martingale. By Doob’s maximal inequality kY₀k_p =ksup

[0,η]

M_tk_p ≤2kM_ηk_p,

for every p ≥ 2. Letting γ_n be the standard Gaussian measure on Rⁿ and using Khintchin’s inequality we get

kM_ηk_p =√ η

Z

Rⁿ

kxk^p_Kγ_n(dx) 1/p

≤C√ pη

Z

Rⁿ

kxkKγn(dx) Lastly, integrating in polar coordinate, it is easily seen that

Z

Rⁿ

kxk_Kγ_n(dx)≤C√ n M.

Hence the result.

We are now in a position to bound the average distance betweenX_T and its discretizationX_T. Proposition 1 There exists a universal constantCsuch that for anyT ≥0we have

E[|X_T −X_T|]≤C(ηlog(T /η))^1/4n^3/4T^1/2M^1/2

Proof Applying Lemma 1 to the processes (W_t) and (W_t) at time T = N η yields (note that W_T =W_T)

|X_T −X_T|² ≤2 Z T

0

hW_t−W_t, ν_tiL(dt)−2 Z T

0

hW_t−W_t, ν_tiL(dt)

We claim that the second integral is equal to0. Indeed, since the discretized process is constant on the intervals[kη,(k+ 1)η)the local timeLis a positive combination of Dirac point masses at

η,2η, . . . , N η.

On the other handW_kη =W_kη for all integerk, hence the claim. Therefore

|X_T −X_T|² ≤2 Z T

0

hW_t−W_t, ν_tiL(dt) Using the inequalityhx, yi ≤ kxk_Kh_K(y)we get

|X_T −X_T|² ≤2 sup

[0,T]

kW_t−W_Tk_K Z T

0

h_K(ν_t)L(dt).

Taking the square root, expectation and using Cauchy–Schwarz we get E

|X_T −X_T|²

≤2E

"

sup

[0,T]

kW_t−W_Tk_K

# E

Z T 0

h_K(ν_t)L(dt)

.

Applying Lemma2and Lemma3, we get the result.

(9)

2.3 A mixing time estimate for the reflected Brownian motion

The reflected Brownian motion is a Markov process. We let(P_t)be the associated semi–group:

P_tf(x) = Ex[f(X_t)],

for every test functionf, where E^x means conditional expectation givenX₀ = x. Itˆo’s formula shows that the generator of the semigroup (P_t) is (1/2)∆ with Neumann boundary condition.

Then by Stokes’ formula, it is easily seen thatµ (the uniform measure onK normalized to be a probability measure) is the stationary measure of this process, and is even reversible. In this section we estimate the total variation between the law of(X_t)andµ.

Given a probability measureνsupported onK, we letνP_tbe the law ofX_twhenX₀as lawν.

The following lemma is the key result to estimate the mixing time of the process(X_t).

Lemma 4 Letx, x⁰ ∈K

TV(δxPt, δx⁰Pt)≤ |x−x⁰|

√2πt .

Proof Let(W_t)be a Brownian motion starting from0and let(X_t)be a reflected Brownian motion starting fromx:

X₀ =x

dX_t=dW_t−ν_tL(dt) (4)

where (ν_t) and L satisfy the appropriate conditions. We construct a reflected Brownian motion (X_t⁰)starting fromx⁰ as follows. Let

τ = inf{t ≥0; X_t =X_t⁰},

and fort < τ letS_tbe the orthogonal reflection with respect to the hyperplane(X_t−X_t⁰)^⊥. Then up to timeτ, the process(X_t⁰)is defined by







X₀⁰ =x⁰

dX_t⁰ =dW_t⁰ −ν_t⁰L⁰(dt) dW_t⁰ =S_t(dW_t)

(5) whereL⁰ is a measure supported on

{t≤τ; X_t⁰ ∈∂K}

andν_t⁰ is an outer unit normal atX_t⁰ for all sucht. After timeτ we just setX_t⁰ =X_t. SinceS_tis an orthogonal map(W_t⁰)is a Brownian motion and thus(X_t⁰)is a reflected Brownian motion starting fromx⁰. Therefore

TV(δ_xP_t, δ_x⁰P_t)≤P(X_t6=X_t⁰) = P(τ > t).

Observe that on[0, τ)

dWt−dW_t⁰ = (I−St)(dWt) = 2hVt, dWtiVt, where

V_t = Xt−X_t⁰

|X_t−X_t⁰|.

(10)

So

d(X_t−X_t⁰) = 2hV_t, dW_tiV_t−ν_tL(dt) +ν_t⁰L⁰(dt)

= 2(dB_t)V_t−ν_tL(dt) +ν_t⁰L⁰(dt), where

B_t= Z t

0

hV_s, dW_si, on[0, τ).

Observe that(B_t)is a one–dimensional Brownian motion. Itˆo’s formula then gives dg(Xt−X_t⁰) = 2h∇g(Xt−X_t⁰), VtidBt− h∇g(Xt−X_t⁰), νtiL(dt)

+h∇g(X_t−X_t⁰), ν⁰tiL⁰(dt) + 2∇²g(X_t−X_t⁰)(V_t, V_t)dt, for everyg which is smooth in a neighborhood ofX_t−X_t⁰. Now ifg(x) = |x|then

∇g(X_t−X_t⁰) = V_t so

h∇g(X_t−X_t⁰), V_ti= 1

h∇g(X_t−X_t⁰), ν_ti ≥0, on the support ofL h∇g(X_t−X_t⁰), ν_t⁰i ≤0, on the support ofL⁰.

(6)

Moreover

∇²g(X_t−X_t⁰) = 1

|Xt−Yt|P_(X_t_−Y_t₎^⊥ whereP_x^⊥denotes the orthogonal projection onx^⊥. In particular

∇²g(X_t−Y_t)(V_t) = 0.

We obtain

|X_t−X_t⁰| ≤ |x−x⁰|+ 2B_t, on[0, τ).

Therefore

P(τ > t)≤P(τ⁰ > t)

whereτ⁰is the first time the Brownian motion(B_t)hits the value−|x−x⁰|/2. Now by the reflection principle

P(τ⁰ > t) = 2P(0≤2Bt<|x−x⁰|)≤ |x−x⁰|

√2πt . Hence the result.

The above result clearly implies that for a probability measureνonK, TV(δ0Pt, νPt)≤

R

K|x|ν(dx)

√2πt . Sinceµis stationary, we obtain

TV(δ₀P_t, µ)≤ m

√2πt (7)

(11)

for any t > 0. In other words, starting from 0, the mixing time of (X_t) is of order at mostm². Notice also that Lemma 4 allows to bound the mixing time from any starting point: for every x∈K, we have

TV(δ_xP_t, µ)≤ R

√2πt,

whereRis the diameter ofK. Lettingτ_mixbe the mixing time of(X_t), namely the smallest timet for which

sup

x∈K

{TV(δ_xP_t, µ)} ≤ 1 e,

we obtain from the previous displayτ_mix ≤ 2R². Since for any xandt we haveTV(δ_xP_t, µ) ≤ e^−bt/τ^mix^c(see e.g., [Levin et al.,2008, Lemma 4.12]) we obtain in particular

TV(δ0Pt, µ)≤e^−bt/2R²^c

The advantage of this upon (7) is the exponential decay int. On the other hand, since obviously m ≤R, inequality (7) can be more precise for a certain range oft. The next proposition sums up the results of this section.

Proposition 2 For anyt >0, we have

TV(δ₀P_t, µ)≤C min

m t^−1/2, e^−t/2R² , whereCis a universal constant.

2.4 From Wasserstein distance to total variation

In the following lemma, which is a variation on the reflection principle,(W_t)is a Brownian motion, the notationP^xmeans probability givenW₀ =xand(Q_t)denotes the heat semigroup:

Q_th(x) =E^x[h(W_t)], for every test functionh.

Lemma 5 Letx∈K and letσbe the first time(W_t)hits the boundary ofK. Then for allt >0 Px(σ < t)≤2Px(W_t∈/ K) = 2Q_t(1K^c)(x).

Proof Let(F_t)be the natural filtration of the Brownian motion. Fixt >0. By the strong Markov property

Px(W_t∈/ K | F_σ) = u(σ, W_σ), (8) where

u(s, y) =1{s < t}P^y(W_t−s ∈/ K).

Lety∈∂K, sinceK is convex it admits a supporting hyperplaneHaty. LetH₊be the halfspace delimited byHcontainingK. Then for anyu >0

Py(W_u ∈/ K)≥Py(W_u ∈/ H₊) = 1 2.

(12)

Equality (8) thus yields

P^x(Wt∈/K | Fσ)≥ 1

21{σ < t}, almost surely. Taking expectation yields the result.

We also need the following elementary estimate for the heat semigroup.

Lemma 6 For anys≥0

Z

K

Q_s(1K^c)dx≤√

sHⁿ⁻¹(∂K), whereHⁿ⁻¹(∂K)is the Hausdorff measure of the boundary ofK.

Proof Letϕ(s) = R

KQ_s(1K^c)dx. Then by definition of the heat semigroup and Stokes’ formula ϕ⁰(s) = 1

2 Z

K

∆Q_s(1K^c)dx= 1 2

Z

∂K

h∇Q_s(1K^c)(x), ν(x)i Hⁿ⁻¹(dx),

for everys > 0and where ν(x)is an outer unit normal vector at pointx. On the other hand an elementary computation shows that for everys >0

|∇Q_s(1K^c)| ≤s^−1/2, (9) pointwise. We thus obtain

|ϕ⁰(s)| ≤ Hⁿ⁻¹(∂K) 2√

s ,

for everys >0. Integrating this inequality between0andsyields the result.

Proposition 3 LetT, S be integer multiples ofη. Then

TV(XT+S, XT+S)≤ 3E|X_T −X_T|

√S + TV(XT, µ) + 4√

SHⁿ⁻¹(∂K)|K|⁻¹.

Proof We use the coupling by reflection again. Fix x and x⁰ in K. Let (X_t) and(X_t⁰) be two Brownian motions reflected at the boundary of K starting fromx and x⁰ respectively, such that the underlying Brownian motions(Wt)and(W_t⁰)are coupled by reflection, just as in the proof of Lemma4. Let(X⁰_t)be the discretization of(X_t⁰), namely the solution of the Skorokhod problem for the process

W_ηbt/ηc⁰

. Let S be a integer multiple of η. Obviously, if (X_t) and (X_t⁰) have merged before timeSand in the meantime neither(X_t)nor(X_t⁰)has hit the boundary ofK then

X_S =X_S⁰ =X⁰_S.

Therefore, lettingτ be the first timeXt=X_t⁰ andσandσ⁰ be the first times(Xt)and(X_t⁰)hit the boundary ofK, respectively, we have

P(X_S 6=X⁰_S)≤P(τ > S) +P(σ < S) +P(σ⁰ < S), (10)