HAL Id: hal-02908790
https://hal.archives-ouvertes.fr/hal-02908790
Preprint submitted on 29 Jul 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Ergodicity of the underdamped mean-field Langevin dynamics
Anna Kazeykina, Zhenjie Ren, Xiaolu Tan, Junjian Yang
To cite this version:
Anna Kazeykina, Zhenjie Ren, Xiaolu Tan, Junjian Yang. Ergodicity of the underdamped mean-field
Langevin dynamics. 2020. �hal-02908790�
Ergodicity of the underdamped mean-field Langevin dynamics
Anna Kazeykina
∗Zhenjie Ren
†Xiaolu Tan
‡Junjian Yang
§July 23, 2020
Abstract
We study the long time behavior of an underdamped mean-field Langevin (MFL) equa- tion, and provide a general convergence as well as an exponential convergence rate result under different conditions. The results on the MFL equation can be applied to study the convergence of the Hamiltonian gradient descent algorithm for the overparametrized op- timization. We then provide a numerical example of the algorithm to train a generative adversarial networks (GAN).
1 Introduction
In this paper we study the ergodicity of the following underdamped mean-field Langevin (MFL) equation:
dX
t= V
tdt, dV
t= − D
mF (L(X
t), X
t) + γV
tdt + σdW
t, (1.1) where L(X
t) represents the law of X
t, F : P ( R
n) → R is a function on P ( R
n) (the space of all probability measures on R
n), D
mF is its intrinsic derivative (recalled in Section 2.1), and W is an n-dimensional standard Brownian motion. Note that the marginal distribution m
t= L(X
t, V
t) satisfies the nonlinear kinetic Fokker-Planck equation
∂
tm = −v · ∇
xm + ∇
v· (D
mF (m
X, x) + γv)m + 1
2 σ
2∆
vm, (1.2) where m
Xdenotes the marginal distribution of m on X
t.
Ignoring the mean-field interaction, the standard underdamped Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field ∇
xF subject to damping and random collisions, see e.g. [20, 31, 38]. It is well known that under mild conditions the Langevin dynamics has a unique invariant measure on R
n× R
nwith the density:
m
∞(x, v) = Ce
−σ22 F(x)+12|v|2, (1.3)
where C is the normalization constant. This observation brings up the interest in developing Hamiltonian Monte Carlo methods, based on various discrete time analogues to the under- damped Langevin dynamics, for sampling according to the distributions in form of (1.3), see e.g. Leli` evre, Rousset and Stoltz [32], Neal [37]. Nowadays this interest resurges in the com- munity of machine learning. Notably, the underdamped Langevin dynamics has been empiri- cally observed to converge more quickly to the invariant measure compared to the overdamped
∗Laboratoire de Math´ematique d’Orsay, Universit´e Paris-Sud, CNRS, Universit´e Paris-Saclay.
anna.kazeykina@math.u-psud.fr
†CEREMADE, Universit´e Paris-Dauphine, PSL. ren@ceremade.dauphine.fr.
‡Department of Mathematics, The Chinese University of Hong Kong. xiaolu.tan@cuhk.edu.hk.
§FAM, Fakult¨at f¨ur Mathematik und Geoinformation, Vienna University of Technology, A-1040 Vienna, Aus- tria. junjian.yang@tuwien.ac.at
Langevin dynamics (of which the related MCMC was studied in e.g. Dalalyan [12], Durmus and Moulines [16]), and it was theoretically justified by Cheng, Chatterji, Bartlett and Jordan in [9]
for some particular choice of coefficients.
More recently, the mean-field Langevin dynamics draws increasing attention among the attempts to rigorously prove the trainability of neural networks, in particular the two-layer networks (with one hidden layer). It becomes popular (see e.g. Chizat and Bach [10], Mei, Montanari and Nguyen [35], Rotskoff and Vanden-Eijnden [41], Hu, Ren, ˇ Siˇ ska and Szpruch [28]) to relax the optimization over the weights of the two-layer network, namely,
c,A,b
inf Z
y − X
i
c
iϕ(A
iz +b
i)
2
µ(dy, dz), with the distribution µ of the data z and the label y, by the optimization over the probability measures:
m∈P(
inf
Rn)Z
y − E
m[cϕ(Az + b)]
2
µ(dy, dz), where m is the law of the r.v. X := (c, A, b).
Denote by F (m) := R
y − E
m[cϕ(Az + b)]
2
µ(dy, dz). In Mei, Montanari and Nguyen [35]
and Hu, Ren, ˇ Siˇ ska and Szpruch [28] the authors further add an entropic regularization to the minimization:
m∈P
inf F (m) + σ
22 H(m), (1.4)
where H is the relative entropy with respect to Lebesgue measure. It follows by a variational calculus, see e.g. [28], that the necessary first order condition of the minimization above reads
D
mF (m
∗, x) + σ
22 ∇
xln m
∗(x) = 0.
Moreover, since F defined above is convex, this is also a sufficient condition for m
∗being a minimizer. It has been proved that such m
∗can be characterized as the invariant measure of the overdamped mean-field Langevin dynamics:
dX
t= −D
mF (L(X
t), X
t)dt + σdW
t.
Also it has been shown that the marginal laws m
tconverge towards m
∗in Wasserstein metric.
Notably, the (stochastic) gradient descent algorithm used in training the neural networks can be viewed as a numerical discretization scheme for the overdamped MFL dynamics. Similar mean-field analysis has been done to deep networks, optimal controls and games, see e.g. Hu, Kazeykina and Ren [27], Jabir, ˇ Siˇ ska and Szpruch [29], Conforti, Kazeykina and Ren [11], Domingo-Enrich, Jelassi and Mensch [15], ˇ Siˇ ska and Szpruch [42], Lu, Ma, Lu, Lu and Ying [33].
This paper is devoted to study the analog to the underdamped MFL dynamics. When considering the optimization (1.4), one may in addition introduce a velocity variable V and regularize the problem as
m∈P
inf F(m), with F(m) := F (m
X) + 1 2 E
mh |V |
2i + σ
22 H(m), (1.5) where m becomes the joint distribution of (X, V ), and m
Xrepresents its marginal distribution on X. By the same variational calculus as above, the first order condition reads
D
mF(m
∗,X, x) + σ
22 ∇
xln m
∗(x, v) = 0 and v + σ
22 ∇
vln m
∗(x, v) = 0. (1.6)
We are going to identify the minimizer m
∗as the unique invariant measure of the underdamped
MFL dynamics (1.1) in two cases: (i) F is convex; (ii) F is possibly non-convex but satisfies fur-
ther technical conditions. Moreover, in case (i) we prove the marginal laws m
tof (1.1) converge
to m
∗under very mild conditions. In case (ii) we show that the convergence is exponentially
quick, and notably the convergence rate is dimension-free under a Wassestein distance.
Related works The underdamped Langevin dynamics, even in case without mean-field in- teraction, is degenerate, so the classical approaches cannot be applied straightforwardly to show the (exponential) ergodicity. In [45, 46], Villani introduced the term “hypocoercivity” and prove the exponential convergence of m
tin H
m1∞. A more direct approach was later developed in Dolbeault, Mouhot and Schmeiser [13, 14], and it triggered many results on kinetic equations.
Note that both Villani’s and DMS’s results on the exponential convergence rate highly depends on the dimension, and (therefore) does not apply to the case with mean-field interaction. It is noteworthy that in the recent paper by Cao, Lu and Wang [7], they developed a new es- timate on the convergence rate based on the variational method proposed by Armstrong and Mourrat [2]. There are few articles in the literature studying the ergodicity of underdamped Langevin dynamics using more probabilistic arguments, see e.g. Wu [47], Rotskoff and Rey- Bellet and Thomas [40], Talay [44], Bakry, Cattiaux and Guillin [3]. These works are mostly based on Lyapunov conditions and the rates they obtained also depends on the dimension. In the recent work by Guillin, Liu, Wu and Zhang [23], it has been shown for the first time that the underdamped Langevin equation with non-convex potential is exponentially ergodic in H
m1∞. Their argument combines Villani’s hypocoercivity with the uniform functional inequality and Lyapunov conditions. To complete the brief literature review, we would draw special attention to the coupling argument applied in Bolley, Guillin and Malrieu [5] and Eberle, Guillin and Zimmer [18], which found transparent convergence rates in sense of Wasserstein-type distance.
Theoretical novelty Most of the articles concerning the ergodicity of underdamped Langevin dynamics obtain the convergence rates depending on the dimension, and in particular very few allow both non-convex potential and the mean-field interaction. One exception would be the paper of Guillin, Liu, Wu and Zhang [23], but it focuses on a particular convolution- type interaction and their assumption of uniform functional inequality is quite demanding. As mentioned, in the paper we address the ergodicity of the underdamped MFL dynamics in two cases.
In case that F is convex (on the probability measure space), we provide a general ergodicity result, which mainly relies on the observation in Theorem 2.6, namely, the value of the function F defined in (1.5) decreases along the dynamics of the MFL and the derivative
dF(mdtt)can be explicitly computed. This can viewed as an analog of the gradient flow for the overdamped Langevin equation, initiated in the seminal paper by Jordan, Kinderlehrer, and Otto [30], see also the monograph by Ambrosio, Gigli and Savar´ e [1]. Due to the degeneracy and the mean- field interaction of the underdamped MFL process, the proof for the claim is non-trivial. Base on this observation and using an argument similar to that in Mei, Montanari and Nguyen [35], Hu, Ren, ˇ Siˇ ska and Szpruch [28], we prove (in Lemma 4.9) that all cluster points of (m
t) should satisfy
v + σ
22 ∇
vln m
∗(x, v) = 0.
Finally, by intriguing LaSalle’s invariance principle for the dynamic system, we show that m
∗must satisfy the first order condition (1.6). Since F is convex, (1.6) is sufficient to identify m
∗as the unique minimizer of F. To our knowledge, this approach for proving the ergodicity of the underdamped MFL dynamics is original and the result holds true under very mild condition.
In case that F is possibly non-convex but satisfies further technical conditions, we adopt the reflection-synchronous coupling technique that initiated in Eberle, Guillin and Zimmer [18, 19]
to obtain an exponential contraction result. Note that [18] is not concerned with mean-field
interaction and the rate found there is dimension dependent. In our context, we design a new
Lyapunov function in a quadratic form (see Section 4.4.2) to obtain the contraction when the
coupled particles are far away, and as a result obtain a dimension-free convergence rate. The
construction of the quadratic form shares some flavor with the argument in Bolley, Guillin and
Malrieu [5]. Notably, our construction helps to capture the optimal rate in the area of interest
(see Remark 4.15), so may be more intrinsic.
The rest of the paper is organized as follows. In Section 2 we annonce the main results.
Before entering the detailed proofs, we study a numerical example concerning the nowadays popular generative adversarial networks (GAN). The main theorems in Section 2 guides us to propose a theoretical convergent algorithm for the GAN, and the numerical test in Section 3 shows a satisfactory result. Finally we report the proofs in Section 4.
2 Ergodicity of the mean-field Langevin dynamics
2.1 Preliminaries
Let P ( R
n) denote the space of the probability measures on R
n, and by P
p( R
n) the subspace of those with finite p-th moment. Without further specifying, in this paper the continuity on P (R
n) is with respect to the weak topology, while the continuity on P
p(R
n) is in the sense of W
p(p-Wasserstein) distance.
A function F : P ( R
n) → R is said to belong to C
1, if there exists a jointly continuous function
δmδF: P( R
n) × R
n→ R such that
F (m
0) − F (m) = Z
10
Z
Rn
δF
δm (1 − u)m + um
0, x
(m
0− m)(dx)du.
When
δmδFis continuously differentiable in x, we denote by D
mF (m, x) := ∇
xδFδm(m, x) the intrin- sic derivative of F . We say function F ∈ C
b∞if all the derivatives ∂
xi1,···,xkD
mkF (m, x
1, · · · , x
k) exist and are bounded.
Let (X, V ) denote the canonical variable on R
n× R
n. For m ∈ P( R
n× R
n), we denote by m
X:= L
m(X) = m ◦ X
−1the marginal law of the variable X under m. Denote by H(m) the relative entropy of the measure m ∈ P ( R
n× R
n) with respect to the Lebesgue measure, that is,
H(m) := E
mln ρ
m(X, V )
= Z
Rn×Rn
ln ρ
m(x, v)
ρ
m(x, v)dxdv,
if m has a density function ρ
m: R
n× R
n→ R
+; or H(m) := ∞ if m is not absolutely continuous.
2.2 Optimization with entropy regularizer
Throughout the paper, we fix a potential function F : P ( R
n) → R , and study the following optimization problem:
inf
m∈P(R×Rn)
F(m), with F(m) := F (m
X) + 1 2 E
m|V |
2+ σ
22γ H(m). (2.1) Assumption 2.1. The potential function F : P (R
n) → R is given by
F (m) = F
◦(m) + E
m[f (X)],
where F
◦∈ C
b∞, f : R
n→ R belongs to C
∞with bounded derivatives of all orders larger or equal to 2, and
|f (x)| ≥ λ|x|
2, for some λ > 0. (2.2)
The following result is due to a variational calculus argument, see e.g. Hu, Ren, ˇ Siˇ ska and
Szpruch [28] for a proof.
Lemma 2.2. Let Assumption 2.1 hold true. If m = argmin
µ∈PF(µ), then m admits a density and there exists a constant C ∈ R , such that
δF
δm (m
X, x) + |v|
22 + σ
22γ ln m(x, v) = C, for all (x, v) ∈ R
2n, (2.3) or equivalently
m(x, v) = C exp
− 2γ σ
2δF
δm (m
X, x) + |v|
22
. (2.4)
Moreover, if F is convex then (2.3) is sufficient for m = argmin
µ∈PF(µ).
Remark 2.3. In case F
◦≡ 0, note that m 7→ F (m) = E
m[f(X)] is linear and hence convex.
Moreover, one has
δmδF(m
X, x) = f (x), and the result above reduces to the classical result in the variational calculus literature.
Remark 2.4. To intuitively understand the first order condition (2.3), we may analyze the simple case without the terms
12E
m|V |
2and
σ2γ2H(m) in F(m). Given a convex function F : P ( R
n) → R , we have for m
ε= (1 − ε)m + εm
0that
F (m
0) − F (m) ≥ 1 ε
F (m
ε) − F (m)
= 1 ε
Z
ε0
Z
Rn
δF
δm (1 − u)m + um
0, x
(m
0− m)(dx)du
−→
Z
Rn
δF
δm m, x
(m
0− m)(dx)du, as ε → 0.
Therefore,
δFδmm, x
= C is sufficient for m being a minimizer of F.
2.3 Lyapunov function and ergodicity
From the first order optimization condition (2.3), one can check by direct computation that a solution to the optimization problem (2.1) is also an invariant measure of the mean-field Langevin dynamic (1.1), which we recall below:
dX
t= V
tdt, dV
t= − D
mF (L(X
t), X
t) + γV
tdt + σdW
t. (2.5) Assumption 2.5. The initial distribution of the MFL equation has finite p-th moment for all p ≥ 0, i.e. E [|X
0|
p+ |V
0|
p] < ∞.
Under Assumption 2.1 and 2.5, it is well known that the MFL equation (2.5) admits a unique strong solution (X
t, V
t)
t≥0, see e.g. Sznitman [43]. In this paper we first prove that the function F defined in (2.1) acts as a Lyapunov function for the marginal flow of the MFL dynamics (2.5) in the following sense.
Theorem 2.6. Let Assumptions 2.1 and 2.5 hold true, denote m
t:= L(X
t, V
t) for all t ≥ 0.
Then, for all t > s > 0,
F(m
t) − F(m
s) = − Z
ts
γ E h
V
r+ σ
22γ ∇
vln m
r(X
r, V
r)
2
i dr.
With the help of the Lyapunov function F, we may prove the convergence of the marginal
laws of (2.5) towards to the minimizer m := argmin
m∈PF(m), provided that the function F is
convex.
Theorem 2.7. Let Assumptions 2.1 and 2.5 hold true. Suppose in addition that the function F is convex. Then the MFL dynamic (2.5) has a unique invariant measure m, which is also the unique minimizer of (2.1), and moreover
t→∞
lim W
1(m
t, m) = 0,
Remark 2.8. The ergodicity of the diffusions with mean-field interaction is a long-standing problem. One may taste the non-triviality through the following simple example. Consider the process
dX
t= (−X
t+ αE[X
t])dt + dW
t.
It is not hard to show that the process X has a unique invariant measure N (0, 1/2) when α < 1 and has none of them when α > 1. Therefore a structural condition is inevitable to ensure the existence of a unique invariant measure and the convergence of the marginal distributions towards it. Theorem 2.7 shows that F being convex on the probability measure space is sufficient for the underdamped MFL dynamics to be ergodic. It is a sound analogue of Theorem 2.11 in Hu, Ren, ˇ Siˇ ska and Szpruch [28], where it has been proved that the convexity of the potential function ensures the ergodicity of the overdamped MFL dynamics.
2.4 Exponential ergodicity given small mean-field dependence
We next study the case where F is possibly non-convex, and are going to obtain an exponential convergence rate if the invariant measure exists.
Assumption 2.9. Assume that the function F
◦∈ C
1and D
mF
◦exists and is Lipschitz contin- uous. Further assume that for any ε > 0 there exists K > 0 such that for all m, m
0∈ P(R
n)
D
mF
◦(m, x) − D
mF
◦(m
0, x
0)
≤ ε|x − x
0| whenever |x − x
0| ≥ K, and the function f (x) =
λ2|x|
2with some λ > 0.
Note that that D
mE
m[f(X)](m, x) = ∇
xf(x) = λx.
Example 2.10. The function F
◦, of which the intrinsic derivative D
mF
◦is uniformly bounded, satisfies Assumption 2.9.
Given (X, V ), (X
0, V
0) ∈ R
2d, we denote
P := V − V
0+ γ (X − X
0), r := |X − X
0|, u := |P |, z := (X − X
0) · P, and define the function
ψ(X − X
0, V − V
0) := 1 + βG(X − X
0, P )
h(ηu + r), (2.6)
where the positive constants β, η, the quadratic form G and the function h : R → R will be determined later. Finally define the semi-metric:
W
ψ(m, m
0) = inf Z
ψ(x − x
0, v − v
0)dπ(x, v, x
0, v
0) : π is a coupling of m, m
0∈ P( R
2n)
. Theorem 2.11. Let Assumption 2.9 hold true. Further assume that
|D
mF
◦(m, x) − D
mF
◦(m
0, x)| ≤ ιW
1(m, m
0).
Then for ι > 0 small enough, we have
W
ψ(m
t, m
0t) ≤ e
−ctW
ψ(m
0, m
00),
for a constant c > 0 defined below in (4.41). In parituclar, the rate c does not depend on the
dimension n.
Remark 2.12. The proof of Theorem 2.11 are mainly based on the reflection-synchronous cou- pling technique developed by Eberle, Guillin and Zimmer in [18]. Note that the contraction obtained in [18] holds true under the assumptions more general than Assumption 2.9, but the convergence rate there is dimension-dependent. We manage to make this tradeoff by considering a new Lyapunov function (see Section 4.4.2) and a new semi-metric, allowing to obtain the ex- ponential ergodicity in the case with small mean-field dependence. Notice also that Guillin, Liu, Wu and Zhang proved the exponential ergodicity in [23] for the underdamped Langevin dynam- ics with a convolution-type interactions, by a completely different approach based on Villani’s hypocoercivity and the uniform functional inequality.
Remark 2.13. Since the function ψ is not concave, the semi-metric W
ψis not necessarily a metric, and therefore the contraction proved above does not imply the existence of the invariant measure, but only describes the convergence rate whenever the invariant measure exists, in particular when F is convex.
3 Application to GAN
Recently there is a strong interest in generating samplings according to a distribution only empirically known using the so-called generative adversarial networks (GAN). From a mathe- matical perspective, the GAN can be viewed as a (zero-sum) game between two players: the generator and the discriminator, and can be trained through an overdamped Langevin pro- cess, see e.g. Conforti, Kazeykina and Ren [11], Domingo-Enrich, Jelassi, Mensch, Rotskoff and Bruna [15]. On the other hand, it has been empirically observed and theoretically proved (in case with convex potentials) by Cheng, Chatterji, Bartlett and Jordan in [9] that the simulation of the underdamped Langevin process converges more quickly than that of the overdamped Langevin dynamics. Therefore, in this section we shall implement an algorithm to train the GAN through the underdamped mean-field Langevin dynamics.
Denote by µ the empirically known distribution. The generator aims at generating samplings of a random variable Y so that its distribution ` is eventually close to µ. Meanwhile, the discriminator trains a parametrized function y 7→ Φ(m
X, y) in the form:
Φ(m
X, y) = E
mX
[Cφ(Ay + b)], (3.1)
where φ is a fixed activation function and the random variable X := (C, A, b) satisfies the law m
X, to distinguish the distributions µ and `. Such parametrization is expressive enough to represent all continuous functions on a compact set according to the Universal Representation Theorem, see e.g. Hornik [26]. Since we are going to make use of the underdamped MFL process to train the discriminator, we also introduce the random variable V , so that together (X, V ) satisfies a distribution m.
Define
F(m, `) :=
Z
Φ(m
X, y)(` − µ)(dy) − η
2 E
m[|V |
2] + σ
022 H(`) − σ
212γ H(m).
Remark 3.1. To make the following discussion mathematically rigorous, one need to :
• truncate the variable C in (3.1) to make Φ bounded;
• add the small ridge rigularization to the function F:
F(m, `) := ˜ F(m, `) + λ
02 Z
|y|
2`(dy) − λ
12 E
m[|X|
2].
For the notational simplicity, we omit these technical details in this section.
Consider the zero-sum game between the two players:
( generator : inf
`F(m, `) discriminator : sup
mF(m, `) . One may view
d(`, µ) := sup
m
Z
Φ(m
X, y)(` − µ)(dy) − η
2 E
m[|V |
2] − σ
122γ H(m)
as a distance between the distributions ` and µ. Then by solving the zero-sum game above, one may achieve as a part of the equilibrium: `
∗∈ argmin
`{d(`, µ) +
σ220H(`)}, which is intuitively close to µ whereas σ
0is small.
In order to compute the equilibrium of the game, we observe as in Conforti, Kazeykina and Ren [11] that given the choice m of the discriminator, the optimal response of the generative can be computed explicitly: it has the density
`
∗[m](y) = C(m)e
−2 σ2
0
Φ(mX,y)
, (3.2)
where C(m) is the normalization constant depending on m. Then computing the value of the zero-sum game becomes an optimization over m:
sup
m
inf
`F(m, `) = sup
m
F(m, `
∗[m]).
As the main result of this paper goes, the optimizer of the problem above can be characterized by the invariant measure of the underdamped MFL dynamics
dX
t= ηV
tdt, dV
t= − D
mF(L(X
t), X
t) + γV
tdt + σ
1dW
t, (3.3) with the potential function:
F(m) := − Z
Φ(m
X, y)(`
∗[m] − µ)(dy) − σ
022 H(`
∗[m]).
Together with (3.2), we may calculate and obtain D
mF (m, x) =
Z
D
mΦ(m
X, y, x)(µ − `
∗[m])(dy).
Next we shall support the theoretical result with a simple numerical test. We set µ as the empirical law of 2000 samples of the distribution
12N (−1, 1) +
12N (4, 1), and the coefficients of game as:
φ(z) = max{−10, min{10, z}}, σ
0= 0.1, σ
1= 1, γ = 1, λ
0= 0.01, λ
1= 0.1.
In order to compute the optimal response of the generator `
∗[m], we use the Gaussian random walk Metropolis Hasting algorithm, with the optimal scaling proposed in Gelman, Roberts and Gilks [22]. Further, as the numerical scheme for the underdamped Langevin process (3.3), we adopt the well-known splitting procedure, the Br¨ unger-Brooks-Karplus integrator [6], see also Section 2.2.3.2 of Leli` evre, Rousset and Stoltz [32]. Also, in the numerical implementation, the marginal law L(X
t) in (3.3) is replaced by the empirical law of 2000 samples of X
t. Along the training (the underdamped MFL dynamics), we record the potential energy:
Z
Φ(m
X, y)(µ − `
∗[m])(dy),
as well as the kinetic energy
η2E
m[|V |
2], and we stop the iteration once the potential energy stays considerably small. The result of the numerical test is shown in the following Figure 1.
Observe that the total energy is almost monotonously decreasing, as foreseen by Theorem 2.6,
and that the samplings generated by the GAN is visibly close to the µ given.
Figure 1: Training errors and GAN samplings
4 Proofs
4.1 Some fine properties of the marginal distributions of the SDE
Let (Ω, F, P ) be an abstract probability space, equipped with a n-dimensional standard Brow- nian motion W . Let T > 0, b : [0, T ] × R
2n→ R
nbe a continuous function such that, for some constant C > 0,
|b(t, x, v) − b(t, x
0, v
0)| ≤ C(|x − x
0| + |v − v
0|), for all (t, x, v, x
0, v
0) ∈ [0, T ] × R
2n× R
2n, and σ > 0 be a positive constant, we consider the stochastic differential equation (SDE):
dX
t= V
tdt, dV
t= b(t, X
t, V
t) dt + σdW
t, (4.1) where the initial condition (X
0, V
0) satisfies
E [|X
0|
2+ |V
0|
2] < ∞.
The above SDE has a unique strong solution (X, V ), and the marginal distribution m
t:=
L(X
t, V
t) satisfies the corresponding Fokker-Planck equation:
∂
tm + v · ∇
xm + ∇
v· bm
− 1
2 σ
2∆
vm = 0. (4.2)
In this subsection we are going to prove some properties of the density function ρ
t(x, v) of m
t. Existence of positive densities Let us fix a time horizon T > 0, let C([0, T ], R
n) be the space of all R
n–valued continuous paths on [0, T ]. Denote by Ω := C([0, T ], R
n) × C([0, T ], R
n) the canonical space, with canonical process (X, V ) = (X
t, V
t)
0≤t≤1and canonical filtration F = (F
t)
t∈[0,T]defined by F
t:= σ(X
s, V
s: s ≤ t). Let P be a (Borel) probability measure on Ω, under which
X
t:= X
0+ Z
t0
V
sds, σ
−1V
tt≥0
is a Brownian motion, (4.3) and P ◦ (X
0, V
0)
−1= P ◦ (X
0, V
0)
−1.
Then under the measure P , (X
0, V
0) is independent of (X
t− X
0− V
0t, V
t− V
0), and the latter follows a Gaussian distribution with mean value 0 and 2n × 2n variance matrix
σ
2t
3I
n/3 t
2I
n/2 t
2I
n/2 tI
n. (4.4)
Let Q := P ◦ (X, V )
−1be the image measure of the solution (X, V ) to the SDE (4.1), so that dX
t= V
tdt, dV
t= b(t, X
t, V
t)dt + σdW
t, Q –a.s., (4.5) with a Q -Brownian motion W . We are going to prove that Q is equivalent to P and
d Q dP
FT= Z
T, with Z
t:= exp Z
t0
σ
−2¯ b
s· dV
s− 1 2
Z
t 0σ
−1¯ b
s2
ds
, (4.6)
where ¯ b
s:= b(s, X
s, V
s).
Lemma 4.1. The strictly positive random variable Z
Tis a density under P, i.e. E
P[Z
T] = 1.
Proof We follow the arguments in [28, Lemma A.1] by Hu, Ren, ˇ Siˇ ska and Szpruch. For simplification of the notation, we consider the case σ = 1. The general case follows by exactly the same arguments or simply by considering the corresponding SDE on (σ
−1X, σ
−1V ).
Let us denote ¯ b
t:= b(t, X
t, V
t), Y b
t:= Z
t(|X
t|
2+ |V
t|
2) and f
ε(x) :=
1+εxx. By Itˆ o formula, one has
db Y
t= Z
t2X
t· V
t+ 2¯ b
t· V
t+ n
dt + Z
t2V
t+ ¯ b
t(|X
t|
2+ |V
t|
2)
· dV
t, P –a.s.
and d E
Pf
ε( Y b
t)
= E
h Z
t2X
t· V
t+ 2¯ b
t· V
t+ n (1 + ε Y b
t)
2− εZ
t22V
t+ ¯ b
t|X
t|
2+ |V
t|
22
(1 + ε Y b
t)
3i dt
≤ C E
h Z
t|X
t|
2+ |V
t|
2+ Z
t1 + εb Y
ti dt,
where we use the fact that b(t, x, v) is of linear growth in (x, v), and C > 0 is a constant independent of ε.
Next, notice that Z = (Z
t)
0≤t≤Tis a positive local martingale under P , and hence a P -super martingale, so that E
P[Z
t] ≤ 1 for all t ∈ [0, T ]. Then
d E
Pf
ε( Y b
t)
≤ C
E
Pf
ε( Y b
t)
+ 1
dt = ⇒ sup
t∈[0,T]
E
Pf
ε( Y b
t)
< ∞.
Letting ε & 0, it follows by Fatou Lemma that sup
t∈[0,T]
E
PY b
t= sup
t∈[0,T]
E
PZ
t|X
t|
2+ |V
t|
2< ∞. (4.7)
By the Itˆ o formula, one obtains that, for all t ∈ [0, T ], d Z
t1 + εZ
t= Z
t¯ b
t(1 + εZ
t)
2· dW
t− εZ
t2| ¯ b
t|
2(1 + εZ
t)
3dt Taking expectation on both sides, we get
E
Ph Z
t1 + εZ
ti
− 1
1 + ε = − E
Ph Z
t0
εZ
s2| ¯ b
s|
2(1 + εZ
s)
3ds
i .
Together with the estimate (4.7), it follows from the monotone convergence and the dominated convergence theorem that E
PZ
t= 1 for all t ∈ [0, T ].
Lemma 4.2 (Existence of positive density). Let (X, V ) be the solution of (4.1). Then for all t ∈ (0, T ], (X
t, V
t) has a strictly positive density function, denoted by ρ
t.
Proof Notice that under P , (X, V ) can be written as the sum of a square integrable r.v.
and an independent Gaussian r.v. with variance (4.4), then P ◦ (X
t, V
t)
−1has strictly positive
and smooth density function. Besides, Q is equivalent to P , with strictly positive density
d Q /d P = Z
T, it follows that P ◦ (X
t, V
t)
−1= Q ◦ (X
t, V
t)
−1has also a strictly positive density
function.
Estimates on the densities We next provide an estimate on ∇
v(ln ρ
t(x, v)), which is crucial for proving Theorem 2.6.
Lemma 4.3 (Moment estimate). Suppose that E
|X
0|
2p+ |V
0|
2p< ∞ for p ≥ 1, then E
h sup
0≤t≤T
|X
t|
2p+ |V
t|
2pi
< ∞. (4.8)
Consequently, the relative entropy between Q and P is finite, i.e.
H Q P
:= E
Qh
log d Q d P
i
= E h 1
2 Z
T0
σ
−1b(t, X
t, V
t)
2
dt i
< ∞. (4.9) Proof Let us first consider (4.8). As b is of linear growth in (x, v), it is standard to apply Itˆ o formula on |X
t|
2p+ |V
t|
2p, and use BDG inequality and then Grownwall lemma to obtain (4.8).
Next, since E
|X
0|
2+ |V
0|
2< ∞, it follows by (4.6) and (4.8) that
H Q P
= E
Qh 1
2 Z
T0
σ
−1b(t, X
t, V
t)
2
dt i
= E h 1
2 Z
T0
σ
−1b(t, X
t, V
t)
2
dt i
< ∞.
Let us introduce the time reverse process ( X, e V e ) and time reverse probability measures e P and Q e on the canonical space Ω by
X e
t:= X
T−t, V e
t:= V
T−t, and e P := P ◦ ( X, e V e )
−1, Q e := Q ◦ ( X, e V e )
−1. Lemma 4.4. The density function ρ
t(x, v) is absolutely continuous in v, and it holds that
E h Z
Tt
∇
vln ρ
s(X
s, V
s)
2
ds i
< ∞, for all t > 0. (4.10) Proof This proof is largely based on the time-reversal argument in F¨ ollmer [21, Lemma 3.1 and Theorem 3.10], where the author sought a similar estimate for a non-degenerate diffusion.
For simplicity of notations, let us assume σ = 1.
Step 1. We first prove that, (X, V ) is an Itˆ o process under Q, and there exists a e F–predictable process ˜ b = (˜ b
s)
0≤s≤Tsuch that
E
Qeh Z
T−t 0˜ b
s2
ds i
< ∞, and V
t= V
0+ Z
t0
˜ b
sds + W f
t, for all t > 0, (4.11) with a ( F , Q e )–Brownian motion W f .
Let P
x0,v0be the conditional probability of P given X
0= x
0, V
0= v
0, P
x0,v0[·] = P
·
X
0= x
0, V
0= v
0, and P e
x0,v0:= P
x0,v0◦ (X, V )
−1.
Recall the dynamic of (X, V ) under P in (4.3) and note that the marginal distribution of (X
t, V
t) under P
x0,v0is Gaussian, in particular, its density function ρ
xt0,v0(x, v) is smooth. It follows from Theorem 2.1 of Haussmann and Pardoux [24]) (or Theorem 2.3 of Millet, Nualart and Sanz [36]) that V is still a diffusion process w.r.t. ( F , e P
x0,v0), and
V
t− V
0− Z
t0
∇
vln ρ
xT0−s,v0(X
s, V
s)ds is a ( F , P e
x0,v0)–Brownian motion,
where by direct computation we know
∇
vln ρ
xT0−s,v0(X
s, V
s) = 6(x
0+ (T − s)v
0− X
s)
(T − s)
2+ 4(v
0− V
s)
T − s =: ˜ c
s(x
0, v
0).
Therefore,
W f
t1:= V
t− V
0− Z
t0
˜
c
s(X
T, V
T)ds is a (F
∗, e P)–Brownian motion, where the enlarged filtration F
∗= (F
∗t)
0≤t≤Tis defined by
F
∗t:= σ X
T, V
T, X
s, V
s: s ∈ [0, t]
. By the moment estimate (4.8), we have
E
Qeh Z
T−t 0|˜ c
s(X
T, V
T)|
2ds i
= E
Qh Z
Tt
|˜ c
s(X
0, V
0)|
2ds i
< ∞, for t > 0.
Next note that the relative entropy satisfies
H( Q e |e P ) = H( Q | P ) < ∞.
Therefore, there exists a F
∗–predictable process ˜ a such that E
Qeh R
T0
|˜ a
t|
2dt i
< ∞ and
W f
t2:= W f
t1− Z
t0
˜
a
sds = V
t− V
0− Z
t0
˜
a
s+ ˜ c
s(X
T, V
T)
ds is a ( F
∗
, e P )–Brownian motion.
Finally we prove Claim (4.11), by letting ˜ b
tdenote an optional version of the process E
Qe˜ a
t+
˜
c
t(X
T, V
T) F
t.
Step 2. Let R : Ω → Ω be the reverse operator defined by R(¯ ω) = (¯ ω
T−t)
0≤t≤T. Then for every fixed t < T and ϕ ∈ C
c(R
2n), one has
E
Qh ˜ b
T−t◦ R
ϕ(X
t, V
t) i
= − lim
h&0
1 h E
Qh
V
t− V
t−hϕ X
t, V
ti .
Recall the dynamic of (X, V ) under Q in (4.5), and thus ϕ X
t, V
t= ϕ X
t−h, V
t−h+
Z
t t−h∇
xϕ(X
s, V
s) · V
sds +
Z
t t−h∇
vϕ(X
s, V
s) · dV
s+ 1 2
Z
t t−h∆
vϕ(X
s, V
s)ds, Q -a.s.
Denoting
¯ b
t:= b(t, X
t, V
t), which clearly satisfies that E
Qh Z
T0
¯ b
t2
dt i
< ∞, (4.12) we have
E
Qh ˜ b
T−t◦ R
ϕ(X
t, V
t) i
= − E
Qh ¯ b
tϕ(X
t, V
t) i
− E
Qh ∇
vϕ(X
s, V
s) i .
Therefore, denoting by ∇
vρ
t(x, v) the weak derivative of ρ in sense of distribution, one has Z
R2n
∇
vρ
t(x, v)ϕ(x, v)dxdv = − E
Qh
∇
vϕ(X
s, V
s) i
= E
Qh ˜ b
T−t◦ R + ¯ b
tϕ(X
t, V
t) i
.
As ϕ ∈ C
c( R
2n) is arbitrary, this implies that, for a.e. (x, v),
∇
vρ
t(x, v) = ρ
t(x, v) E
Qh ˜ b
T−t◦ R + ¯ b
tX
t= x, V
t= v i
.
Finally, it follows from the moment estimates in (4.11) and (4.12) that E
Qh Z
t1t0
∇
vln ρ
t(X
t, V
t)
2
dt i
= E
Qh Z
t1t0
∇
vρ
tρ
t(X
t, V
t)
2
dt i
< ∞.
We hence conclude the proof by the fact that P ◦ (X, V )
−1= Q ◦ (X, V )
−1.
From (4.11), we already know that V is a diffusion process w.r.t. ( F , Q e ). With the integra- bility result (4.10), we can say more on its dynamics.
Lemma 4.5. The reverse process ( X, e V e ) is a diffusion process under Q , or equivalently, the canonical process (X, V ) is a diffusion process under the reverse probability Q e . Moreover, Q e is a weak solution to the SDE:
dX
t= −V
tdt, dV
t= − b(t, X
t, V
t) + σ
2∇
vln ρ
T−t(X
t, V
t)
dt + σdf W
t, Q e –a.s., (4.13) where W f is a ( F , Q e )–Brownian motion.
Proof It follows from the Cauchy-Schwarz inequality and (4.10) that Z
Tt
Z
R2n
|∇
vρ
s(x, v)|dxdv ≤ Z
Tt
Z
R2n
|∇
vρ
s(x, v)|
2ρ
s(x, v)
2ρ
s(x, v)dxdv
12< ∞, for all T > t > 0.
Together with the Lipschitz assumption on the coefficient b(t, x, v), the desired result is a direct consequence of Theorem 2.1 of Haussmann and Pardoux [24], or Theorem 2.3 of Millet, Nualart and Sanz [36].
Finally, we provide a sufficient condition on b to ensure that the density function ρ of (X, V ) is a smooth function.
Lemma 4.6 (Regularity of the density). Assume in addition that b ∈ C
∞((0, T ) × R
2n) with all derivatives of order k bounded for all k ≥ 1. Then the function (t, x, v) 7→ ρ
t(x, v) belongs to C
∞((0, T ) × R
2n).
Proof Under the additional regularity conditions on b, it is easy to check that the co- efficients of SDE (4.1) satisfies the H¨ ormander’s conditions, and hence the density function ρ ∈ C
∞((0, T ) × R
2n) (see e.g. Bally [4, Theorem 5.1, Remark 5.2]).
Application to the MFL equation (2.5) We will apply the above technical results to the MFL equation (2.5). Let (X, V ) be the unique solution of (2.5), and m
Xt:= L(X
t), then (X, V ) is also the unique solution of SDE (4.1) with drift coefficient function
b(t, x, v) := D
mF
◦(m
Xt, x) + ∇
xf (x) + γv. (4.14) Proposition 4.7. (i) Let Assumption 2.1 hold true, then b(t, x, v) is a continuous function, uniformly Lipschitz in (x, v).
(ii) Suppose in addition that Assumption 2.5 holds true, then b ∈ C
∞((0, ∞) × R
n× R
n) and
and for each k ≥ 1, its derivative of order k is bounded.
Proof (i) For a diffusion process (X, V ), it is clear that t 7→ m
Xtis continuous, then (t, x, v) 7→
b(t, x, v) := D
mF (m
Xt, x) + γv is continuous. Moreover, it is clear that b is globally Lipschitz in (x, v) under Assumption 2.1.
(ii) Let us denote
b
◦(t, x) := D
mF
◦(m
Xt, x).
We claim that for the coefficient function b defined in (4.14), for all k ≥ 0, one has
∂
tkb
◦(t, x) = E
kX
i=0 k−i
X
j=0
ϕ
ni,jm
Xt, X
t, V
t, x X
tiV
tj, (4.15)
where ϕ
ni,jare smooth functions with bounded derivatives of any order.
Further, it follows by Lemma 4.3 that, under additional conditions in Assumption 2.5, one has E
sup
0≤t≤T(|X
t|
p+ |V
t|
p)
< ∞ for all T > 0 and p ≥ 1. Therefore, one has b
◦∈ C
∞((0, ∞) × R
n) and hence b ∈ C
∞((0, ∞) × R
n× R
n).
It is enough to prove (4.15). Recall (see e.g. Carmona and Delarue [8]) that for a smooth function ϕ : P ( R
n) × R
n× R
n→ R , one has the Itˆ o formula
dϕ(m
Xt, X
t, V
t) = E
D
mϕ(m
Xt)(X
t) · V
tdt + ∇
xϕ(m
Xt, X
t, V
t) · V
tdt
−∇
vϕ(m
Xt, X
t, V
t) D
mF (m
Xt, X
t), X
t) + γV
tdt + 1
2 σ
2∆
vϕ(m
Xt, X
t, V
t)dt +∇
vϕ(m
Xt, X
t, V
t) · σdW
t.
Then we can easily conclude the proof of (4.15) by the induction argument.
4.2 Proof of Theorem 2.6
Let us fix T > 0, and consider the reverse probability Q e given before Lemma 4.4 with coefficient function b in (4.14). Recall also the dynamic of (X, V ) under Q e in (4.13). Applying Itˆ o formula on ln ρ
T−t(X
t, V
t)
, and then using the Fokker-Planck equation (1.2), it follows that d ln ρ
T−t(X
t, V
t)
=
− ∂
tρ
T−tρ
T−t(X
t, V
t) − ∇
xln ρ
T−t(X
t, V
t)
· V
t+ 1
2 σ
2∆
vln ρ
T−t(X
t, V
t) +∇
vln ρ
T−t(X
t, V
t)
· − b(t, X
t, V
t) + σ
2∇
vln ρ
T−t(X
t, V
t) dt + ∇
vln ρ
T−t(X
t, V
t)
· σdW
t=
− nγ + 1 2
σ∇
vρ
T−tρ
T−t(X
t, V
t)
2
dt + ∇
vln ρ
T−t(X
t, V
t)
· σdf W
t, Q e –a.s.
By (4.10), it follows that for t > 0 dH (m
t) = d E
Qeh
ln ρ
t(X
T−t, V
T−t) i
=
− nγ + 1 2 E
h
σ∇
vln ρ
t(X
t, X
t)
2
i
dt. (4.16) On the other hand, recall that
F (m) = F
◦(m) + E
m[f (X)], and D
mF L(X
t)
= D
mF
◦L(X
t)
+ ∇f. (4.17)
By a direct computation, one has dF
◦L(X
t)
= E
D
mF
◦L(X
t), X
t· V
tdt. (4.18)
By Itˆ o formula and (4.17), one has d
f (X
t) + 1 2 |V
t|
2=
∇f (X
t) · V
t− V
t· D
mF (L(X
t), X
t) + γV
t+ 1 2 σ
2n
dt + V
t· σdW
t=
− D
mF
◦(L(X
t), X
t) · V
t− γ |V
t|
2+ 1 2 σ
2n
dt + V
t· σdW
t. (4.19) Combining (4.16), (4.18) and (4.19), we obtain
dF(m
t) = d
F L(X
t) + 1
2 E
|V
t|
2+ σ
22γ H(m
t)
= E
h − γ|V
t|
2+ σ
2n − σ
44γ
∇
vln ρ
t(X
t, V
t)
2
i
dt. (4.20)
Further, by Lemmas 4.3 and 4.4, it is clear that E
∇
vln ρ
t(X
t, V
t)
· V
t< ∞ and by integra- tion by parts we have
E h
∇
vln ρ
t(X
t, V
t)
· V
ti
= 1
2 Z
Rn
Z
Rn
∇
vρ
t(x, v) · ∇
v|v|
2dxdv
= − 1 2
Z
Rn
Z
Rn
ρ
t(x, v)∆
v|v|
2dxdv = −n.
Together with (4.20), it follows dF(m
t) = −γ E
"
V
t+ σ
22γ ∇
vln ρ
t(X
t, V
t)
2
# dt.
4.3 Proof of Theorem 2.7
Let (m
t)
t∈R+be the flow of marginal laws of the solution to (2.5), given an initial law m
0∈ P
22n. Define a dynamic system S(t)[m
0] := m
t. We shall consider the so-called w-limit set:
w(m
0) :=
n
µ ∈ P
22n: there exist t
k→ ∞ such that W
1S(t
k)[m
0], µ
→ 0 o
We recall LaSalle’s invariance principle.
Proposition 4.8. [Invariance Principle] Let Assumption 2.1 and 2.5 hold true. Then the set w(m
0) is nonempty, W
1-compact and invariant, that is,
1. for any µ ∈ w(m
0), we have S(t)[µ] ∈ w(m
0) for all t ≥ 0.
2. for any µ ∈ w(m
0) and all t ≥ 0, there exits µ
0∈ w(m
0) such that S(t)[µ
0] = µ.
Proof Under the upholding assumptions, it follows from Lemma 4.3 that t 7→ S(t) is contin- uous with respect to the W
1-topology. On the other side, due to Theorem 2.6 and the fact that the relative entropy H ≥ 0, we know that {F(m
t) +
12E [|V
t|
2]}
t≥0is bounded. Together with (2.2), we obtain
sup
t≥0
E
|X
t|
2+ |V
t|
2< ∞.
Therefore S(t)[m
0]
t≥0
= (m
t)
t≥0live in a W
1-compact subset of P
22n. The desired result
follows from the invariance principle, see e.g. Henry [25, Theorem 4.3.3].
Lemma 4.9. Let Assumption 2.1 and 2.5 hold true. Then, every m
∗∈ w(m
0) has a density and we have
v + σ
22γ ∇
vln m
∗(x, v) = 0, Leb
2n− a.s. (4.21) Proof Let m
∗∈ w(m
0) and denote by (m
tk)
k∈Nthe subesquence converging to m
∗in W
1. Step 1. We first prove that there exists a sequence δ
i→ 0 such that
lim inf
k→∞
E
h
V
tk+δi+ σ
22γ ∇
vln m
tk+δi(X
tk+δi, V
tk+δi)
2
i
= 0, for all i ∈ N . (4.22) Suppose the contrary. Then we would have for some δ > 0
0 <
Z
δ 0lim inf
k→∞
E
h
V
tk+s+ σ
22γ ∇
vln m
tk+s(X
tk+s, V
tk+s)
2
i ds
≤ lim inf
k→∞
Z
δ 0E h
V
tk+s+ σ
22γ ∇
vln m
tk+s(X
tk+s, V
tk+s)
2
i ds,
where the last inequality is due to Fatou’s lemma. This is a contradiction against Theorem 2.6 and the fact that F is bounded from below.
Step 2. Denote by t
ik:= t
k+ δ
iand m
∗t:= S(t)[m
∗]. Note that
k→∞
lim W
1m
tk, m
∗= 0 = ⇒ lim
k→∞
W
1m
tik
, m
∗δi= lim
k→∞
W
1S(δ
i)[m
tk], S (δ
i)[m
∗]
= 0.
Now fix i ∈ N . Due to Theorem 2.6 and the fact that {F (m
t) +
12E [|V
t|
2]}
t≥0is bounded from below, the set {H(m
tik
)}
k∈Nis uniformly bounded. Therefore the densities (m
tik
)
k∈Nare uniformly integrable with respect to Lebesgue measure, and thus m
∗has a density. Note that
E h
V
ti k+ σ
22γ ∇
vln m
ti k(X
tik
, V
ti k)
2
i
= σ
44γ
2Z
R2n
∇
vm
tik
(x, v)e
γ σ2|v|2
2
m
tik
(x, v)e
σγ2|v|2e
−γ σ2|v|2
dxdv
Denote by µ
∗v:= N (0,
σ2γ2I
n) and define the function h
ik(x, v) := m
tik
(x, v)e
σγ2|v|2. By logarithmic Sobolev inequality for the Gaussian distribution we obtain
Z Z
h
ikln h
ikdµ
∗v− Z
h
ikdµ
∗vln Z
h
ikdµ
∗vdx ≤ C
Z |∇
vh
ik|
2h
ikdµ
∗vdx.
Together with (4.22) we obtain 0 = lim
k→∞
E h
V
tik
+ σ
22γ ∇
vln m
tik
(X
tik
, V
tik
)
2
i
≥ C lim sup
k→∞