Some theoretical insights into WGANs
Gérard Biau GdR ISIS, June 2021
Team
Benoît Cadre University Rennes 2
Maxime Sangnier Sorbonne University
Ugo Tanielian
Criteo AI Lab 1
Source: https://towardsdatascience.com/how- i- got- a- computer- to- make- fake- people- using- ai- gans- a8e2f542e992
2
Source: https://www.whichfaceisreal.com Link
3
Source: https://en.wikipedia.org/wiki/Edmond_de_Belamy
4
Source: https://vue.ai
5
Original paper
Generative Adversarial Nets
Ian J. Goodfellow∗, Jean Pouget-Abadie†, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair‡, Aaron Courville, Yoshua Bengio§ D´epartement d’informatique et de recherche op´erationnelle
Universit´e de Montr´eal Montr´eal, QC H3C 3J7
Abstract
We propose a new framework for estimating generative models via an adversar- ial process, in which we simultaneously train two models: a generative modelG that captures the data distribution, and a discriminative modelDthat estimates the probability that a sample came from the training data rather thanG. The train- ing procedure forGis to maximize the probability ofDmaking a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functionsGandD, a unique solution exists, withGrecovering the training data distribution andDequal to1
2everywhere. In the case whereGandDare defined by multilayer perceptrons, the entire system can be trained with backpropagation.
There is no need for any Markov chains or unrolled approximate inference net- works during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
1 Introduction
The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 20]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [17, 8, 9] which have a particularly well-behaved gradient . Deepgenerativemodels have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties.1
In the proposedadversarial netsframework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.
∗Ian Goodfellow is now a research scientist at Google, but did this work earlier as a UdeM student
†Jean Pouget-Abadie did this work while visiting Universit´e de Montr´eal from Ecole Polytechnique.
‡Sherjil Ozair is visiting Universit´e de Montr´eal from Indian Institute of Technology Delhi
§Yoshua Bengio is a CIFAR Senior Fellow.
1All code and hyperparameters available athttp://www.github.com/goodfeli/adversarial
1
6
Source: https://www.oreilly.com
7
Outline
1. Mathematical context
2. Wasserstein GANs
3. Optimization properties
4. GroupSort neural networks
5. Asymptotic properties
8
Mathematical context
Objective
• Target: probability measureµ? onE ⊆RD.
• Goal: generate according toµ?.
• Data: X1, . . . ,Xn i.i.d. asµ?.
Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.
9
Objective
• Target: probability measureµ? onE ⊆RD.
• Goal: generate according toµ?.
• Data: X1, . . . ,Xn i.i.d. asµ?.
Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.
9
Objective
• Target: probability measureµ? onE ⊆RD.
• Goal: generate according toµ?.
• Data: X1, . . . ,Xn i.i.d. asµ?.
Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.
9
Objective
• Target: probability measureµ? onE ⊆RD.
• Goal: generate according toµ?.
• Data: X1, . . . ,Xn i.i.d. asµ?.
Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.
9
The generator
• Aparametricfamily of functions fromRd toE.
• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.
• Principle: Ω−→Z Rd −→Gθ E.
• Definition: Gθ(Z)L∼µθ.
• Associated family ofdistributions: P={µθ:θ∈Θ}.
• Eachµθ is acandidateto representµ?.
• Z is typically uniform or Gaussian, withd small.
10
The generator
• Aparametricfamily of functions fromRd toE.
• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.
• Principle: Ω−→Z Rd −→Gθ E.
• Definition: Gθ(Z)L∼µθ.
• Associated family ofdistributions: P={µθ:θ∈Θ}.
• Eachµθ is acandidateto representµ?.
• Z is typically uniform or Gaussian, withd small.
10
The generator
• Aparametricfamily of functions fromRd toE.
• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.
• Principle: Ω−→Z Rd −→Gθ E.
• Definition: Gθ(Z)L∼µθ.
• Associated family ofdistributions: P={µθ:θ∈Θ}.
• Eachµθ is acandidateto representµ?.
• Z is typically uniform or Gaussian, withd small.
10
The generator
• Aparametricfamily of functions fromRd toE.
• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.
• Principle: Ω−→Z Rd −→Gθ E.
• Definition: Gθ(Z)L∼µθ.
• Associated family ofdistributions: P={µθ:θ∈Θ}.
• Eachµθ is acandidateto representµ?.
• Z is typically uniform or Gaussian, withd small.
10
The generator
• Aparametricfamily of functions fromRd toE.
• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.
• Principle: Ω−→Z Rd −→Gθ E.
• Definition: Gθ(Z)L∼µθ.
• Associated family ofdistributions: P={µθ:θ∈Θ}.
• Eachµθ is acandidateto representµ?.
• Z is typically uniform or Gaussian, withd small.
10
The generator
• Aparametricfamily of functions fromRd toE.
• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.
• Principle: Ω−→Z Rd −→Gθ E.
• Definition: Gθ(Z)L∼µθ.
• Associated family ofdistributions: P={µθ:θ∈Θ}.
• Each µθ is acandidateto representµ?.
• Z is typically uniform or Gaussian, withd small.
10
The generator
• Aparametricfamily of functions fromRd toE.
• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.
• Principle: Ω−→Z Rd −→Gθ E.
• Definition: Gθ(Z)L∼µθ.
• Associated family ofdistributions: P={µθ:θ∈Θ}.
• Each µθ is acandidateto representµ?.
• Z is typically uniform or Gaussian, withd small.
10
A specific framework
• In GANs algorithms, eachGθ is a neural network.
• Bad ideas:
. Exhaustive description ofµ?by a classical parametric model. . Estimation by maximum likelihood.
. A strategy based on nonparametric density estimation.
• It isnotassumed thatµ? belongs toP.
• Forgetclassical rules.
11
A specific framework
• In GANs algorithms, eachGθ is a neural network.
• Bad ideas:
. Exhaustive description ofµ?by a classical parametric model. . Estimation by maximum likelihood.
. A strategy based on nonparametric density estimation.
• It isnotassumed thatµ? belongs toP.
• Forgetclassical rules.
11
A specific framework
• In GANs algorithms, eachGθ is a neural network.
• Bad ideas:
. Exhaustive description ofµ? by a classical parametric model.
. Estimation by maximum likelihood.
. A strategy based on nonparametric density estimation.
• It isnotassumed thatµ? belongs toP.
• Forgetclassical rules.
11
A specific framework
• In GANs algorithms, eachGθ is a neural network.
• Bad ideas:
. Exhaustive description ofµ? by a classical parametric model.
. Estimation by maximum likelihood.
. A strategy based on nonparametric density estimation.
• It isnotassumed thatµ? belongs toP.
• Forgetclassical rules.
11
A specific framework
• In GANs algorithms, eachGθ is a neural network.
• Bad ideas:
. Exhaustive description ofµ? by a classical parametric model.
. Estimation by maximum likelihood.
. A strategy based on nonparametric density estimation.
• It isnotassumed thatµ? belongs toP.
• Forgetclassical rules.
11
A specific framework
• In GANs algorithms, eachGθ is a neural network.
• Bad ideas:
. Exhaustive description ofµ? by a classical parametric model.
. Estimation by maximum likelihood.
. A strategy based on nonparametric density estimation.
• It isnotassumed thatµ? belongs toP.
• Forgetclassical rules.
11
A specific framework
• In GANs algorithms, eachGθ is a neural network.
• Bad ideas:
. Exhaustive description ofµ? by a classical parametric model.
. Estimation by maximum likelihood.
. A strategy based on nonparametric density estimation.
• It isnotassumed thatµ? belongs toP.
• Forgetclassical rules.
11
Generator’s architecture
12
Generator’s architecture
ReLU neural networks
Gθ(z) = Up D×up−1
σ Up−1 up−1×up−2
· · ·σ( U2 u2×u1
σ(U1 u1×d
z+ b1 u1×1)+b2
u2×1)· · ·+bp−1 up−1×1
+bp D×1
ReLUactivation: σ(x) =max(x,0)
13
Generator with p = 4
z(1) z(2)
z(3)
Hidden layer 3 Hidden
layer 2 Hidden
layer 1
u1=5 u2=4 u3=6
d=3 D=8 14
The discriminator
• Discriminator: a parametric family of functions fromE to[0,1].
• Notation: D={Dα:α∈Λ},Λ⊆RQ.
• In GANs algorithms, eachDα is aneural network.
• ThehigherD(x), thehigherthe probability thatx is drawn from µ?.
• The generator and the discriminator haveoppositeobjectives.
Source: https://www.wikihow.com
15
The discriminator
• Discriminator: a parametric family of functions fromE to[0,1].
• Notation: D={Dα:α∈Λ},Λ⊆RQ.
• In GANs algorithms, eachDα is aneural network.
• ThehigherD(x), thehigherthe probability thatx is drawn from µ?.
• The generator and the discriminator haveoppositeobjectives.
Source: https://www.wikihow.com
15
The discriminator
• Discriminator: a parametric family of functions fromE to[0,1].
• Notation: D={Dα:α∈Λ},Λ⊆RQ.
• In GANs algorithms, eachDα is aneural network.
• ThehigherD(x), thehigherthe probability thatx is drawn from µ?.
• The generator and the discriminator haveoppositeobjectives.
Source: https://www.wikihow.com
15
The discriminator
• Discriminator: a parametric family of functions fromE to[0,1].
• Notation: D={Dα:α∈Λ},Λ⊆RQ.
• In GANs algorithms, eachDα is aneural network.
• The higherD(x), thehigherthe probability thatx is drawn from µ?.
• The generator and the discriminator haveoppositeobjectives.
Source: https://www.wikihow.com
15
The discriminator
• Discriminator: a parametric family of functions fromE to[0,1].
• Notation: D={Dα:α∈Λ},Λ⊆RQ.
• In GANs algorithms, eachDα is aneural network.
• The higherD(x), thehigherthe probability thatx is drawn from µ?.
• The generator and the discriminator haveoppositeobjectives.
Source: https://www.wikihow.com
15
Discriminator’s architecture
16
Discriminator’s architecture
GroupSort neural networks
Dα(x) = Vq 1×vq−1
˜ σ Vq−1
vq−1×vq−2
· · ·˜σ( V2 v2×v1
˜ σ( V1
v1×D
x+ c1 v1×1
)+ c2 v2×1
)+· · ·+cq−1 vq−1×1
+cq 1×1
GroupSortactivation:
˜
σ(x1,x2, . . . ,x2n−1,x2n) = (max(x1,x2),min(x1,x2), . . . ,max(x2n−1,x2n),min(x2n−1,x2n))
17
Discriminator with q = 4
x(1) x(2)
x(3) x(4)
x(5) x(6)
Hidden layer 3 Hidden
layer 2 Hidden
layer 1
v1=4 v2=2 v3=6 D=6
18
Adversarial principle
• Objective: solve
θinf∈Θ sup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .
• Empirical version: inf
θ∈Θsup
α∈Λ
h1 n
n
X
i=1
log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .
• Generative principle: θˆn→Gθˆ
n →Gθˆ
n(Z1),Gθˆ
n(Z2). . .→new images
• Themin-max optimumis found by stochastic gradient descent.
19
Adversarial principle
• Objective: solve
θinf∈Θ sup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .
• Empirical version:
inf
θ∈Θsup
α∈Λ
h1 n
n
X
i=1
log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .
• Generative principle: θˆn→Gθˆ
n →Gθˆ
n(Z1),Gθˆ
n(Z2). . .→new images
• Themin-max optimumis found by stochastic gradient descent.
19
Adversarial principle
• Objective: solve
θinf∈Θ sup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .
• Empirical version:
inf
θ∈Θsup
α∈Λ
h1 n
n
X
i=1
log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .
• Generative principle:
θˆn→Gθˆ
n →Gθˆ
n(Z1),Gθˆ
n(Z2). . .→new images
• Themin-max optimumis found by stochastic gradient descent.
19
Adversarial principle
• Objective: solve
θinf∈Θ sup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .
• Empirical version:
inf
θ∈Θsup
α∈Λ
h1 n
n
X
i=1
log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .
• Generative principle: θˆn
→Gθˆ
n →Gθˆ
n(Z1),Gθˆ
n(Z2). . .→new images
• Themin-max optimumis found by stochastic gradient descent.
19
Adversarial principle
• Objective: solve
θinf∈Θ sup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .
• Empirical version:
inf
θ∈Θsup
α∈Λ
h1 n
n
X
i=1
log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .
• Generative principle: θˆn→Gθˆ
n
→Gθˆ
n(Z1),Gθˆ
n(Z2). . .→new images
• Themin-max optimumis found by stochastic gradient descent.
19
Adversarial principle
• Objective: solve
θinf∈Θ sup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .
• Empirical version:
inf
θ∈Θsup
α∈Λ
h1 n
n
X
i=1
log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .
• Generative principle: θˆn→Gθˆ
n →Gθˆ
n(Z1),Gθˆ
n(Z2). . .
→new images
• Themin-max optimumis found by stochastic gradient descent.
19
Adversarial principle
• Objective: solve
θinf∈Θ sup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .
• Empirical version:
inf
θ∈Θsup
α∈Λ
h1 n
n
X
i=1
log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .
• Generative principle: θˆn→Gθˆ
n →Gθˆ
n(Z1),Gθˆ
n(Z2). . .→new images
• Themin-max optimumis found by stochastic gradient descent.
19
Adversarial principle
• Objective: solve
θinf∈Θ sup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .
• Empirical version:
inf
θ∈Θsup
α∈Λ
h1 n
n
X
i=1
log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .
• Generative principle: θˆn→Gθˆ
n →Gθˆ
n(Z1),Gθˆ
n(Z2). . .→new images
• The min-max optimumis found by stochastic gradient descent. 19
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation. . Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.
• However...
. The training process of GANs isunstable. . Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation. . Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.
• However...
. The training process of GANs isunstable. . Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.
• However...
. The training process of GANs isunstable. . Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable. . Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable. . Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable. . Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable.
. Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable.
. Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable.
. Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)
→Wasserstein GANs. . WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable.
. Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs.
. WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable.
. Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs.
. WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
From GANs to WGANs
• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):
. Basicpropertiesof GANs.
. Impact of the discriminator on thequalityof the approximation.
. Statisticalconsistency,ratesof convergence,central limittheorem.
. Play withsimpleexamples.
• However...
. The training process of GANs isunstable.
. Mode collapsephenomenon.
. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs.
. WGANs have become astandardin machine learning.
• Ourgoal: make some theoretical advances in WGANs.
20
Wasserstein GANs
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
hElog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence: inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
hElog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence: inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
hElog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence: inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
θinf∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence: inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence: inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence: inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence:
inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence:
inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}:
inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence:
inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
At the origins of WGANs
• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1
2DKL
µ
µ+ν 2
+1
2DKL
ν
µ+ν 2
.
• Idealization: D=D∞, the set ofallfunctions fromE to[0,1].
• In this case:
inf
θ∈Θ
sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2DJS(µ?, µθ)−ln4.
• Consequence:
inf
θ∈Θ sup
D∈D∞
h
Elog(D(X))+Elog(1−D(Gθ(Z)))i
=2inf
θ∈ΘDJS(µ?, µθ)−ln4.
• In practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θsup
α∈Λ
h
Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i
= ??
21
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ
sup
f∈Lip1
|Eµ?f −Eµθf|=
inf
θ∈Θ
W1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|= ??
• Empirical version: inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i .
22
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ
sup
f∈Lip1
|Eµ?f −Eµθf|=
inf
θ∈Θ
W1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|= ??
• Empirical version: inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i .
22
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ
sup
f∈Lip1
|Eµ?f −Eµθf|=
inf
θ∈Θ
W1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|= ??
• Empirical version: inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i .
22
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ
sup
f∈Lip1
|Eµ?f −Eµθf|=
inf
θ∈Θ
W1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|= ??
• Empirical version: inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i .
22
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ sup
f∈Lip1
|Eµ?f −Eµθf|= inf
θ∈ΘW1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|= ??
• Empirical version: inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i .
22
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ sup
f∈Lip1
|Eµ?f −Eµθf|= inf
θ∈ΘW1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}:
inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|= ??
• Empirical version: inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i .
22
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ sup
f∈Lip1
|Eµ?f −Eµθf|= inf
θ∈ΘW1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|
= ??
• Empirical version: inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i .
22
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ sup
f∈Lip1
|Eµ?f −Eµθf|= inf
θ∈ΘW1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|= ??
• Empirical version: inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i .
22
General principle of WGANs
• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf
π∈Π(µ,ν)
Z
E×E
kx−ykπ(dx,dy).
• Dual form:
W1(µ, ν) = sup
f∈Lip1
|Eµf −Eνf|.
• T-WGANs:
inf
θ∈Θ sup
f∈Lip1
|Eµ?f −Eµθf|= inf
θ∈ΘW1(µ?, µθ).
• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf
θ∈Θ sup
α∈Λ
|Eµ?Dα−EµθDα|= ??
• Empirical version:
inf
θ∈Θ sup
α∈Λ
h1 n
n
X
i=1
Dα(Xi)−EDα(Gθ(Z))i
. 22
Notation
• For D⊆Lip1, theIntegral Probability MetricdD is dD(µ, ν) = sup
f∈D|Eµf −Eνf|.
• Unified notation: T-WGANs: inf
θ∈ΘdLip1(µ?, µθ) and Θ?= arg min
θ∈Θ
dLip1(µ?, µθ) WGANs: inf
θ∈ΘdD(µ?, µθ) and Θ¯ = arg min
θ∈Θ
dD(µ?, µθ) Empirical WGANs: inf
θ∈ΘdD(µn, µθ) and Θˆn= arg min
θ∈Θ
dD(µn, µθ).
• Properties ofdLip1 are well known. This isdifferent fordD.
23
Notation
• For D⊆Lip1, theIntegral Probability MetricdD is dD(µ, ν) = sup
f∈D|Eµf −Eνf|.
• Unified notation:
T-WGANs: inf
θ∈ΘdLip1(µ?, µθ) and Θ?= arg min
θ∈Θ
dLip1(µ?, µθ) WGANs: inf
θ∈ΘdD(µ?, µθ) and Θ¯ = arg min
θ∈Θ
dD(µ?, µθ) Empirical WGANs: inf
θ∈ΘdD(µn, µθ) and Θˆn= arg min
θ∈Θ
dD(µn, µθ).
• Properties ofdLip1 are well known. This isdifferent fordD.
23