Original paper

(1)

Some theoretical insights into WGANs

Gérard Biau GdR ISIS, June 2021

(2)

Team

Benoît Cadre University Rennes 2

Maxime Sangnier Sorbonne University

Ugo Tanielian

Criteo AI Lab ¹

(3)

Source: https://towardsdatascience.com/how- i- got- a- computer- to- make- fake- people- using- ai- gans- a8e2f542e992

2

(4)

Source: https://www.whichfaceisreal.com ^Link

3

(5)

Source: https://en.wikipedia.org/wiki/Edmond_de_Belamy

4

(6)

Source: https://vue.ai

5

(7)

Original paper

Generative Adversarial Nets

Ian J. Goodfellow^∗, Jean Pouget-Abadie^†, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair‡, Aaron Courville, Yoshua Bengio§ D´epartement d’informatique et de recherche op´erationnelle

Université de Montréal Montréal, QC H3C 3J7

Abstract

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative modelG that captures the data distribution, and a discriminative modelDthat estimates the probability that a sample came from the training data rather thanG. The training procedure forGis to maximize the probability ofDmaking a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functionsGandD, a unique solution exists, withGrecovering the training data distribution andDequal to1

2everywhere. In the case whereGandDare defined by multilayer perceptrons, the entire system can be trained with backpropagation.

There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

1 Introduction

The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 20]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [17, 8, 9] which have a particularly well-behaved gradient . Deepgenerativemodels have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties.1

In the proposedadversarial netsframework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.

∗Ian Goodfellow is now a research scientist at Google, but did this work earlier as a UdeM student

†Jean Pouget-Abadie did this work while visiting Universit´e de Montr´eal from Ecole Polytechnique.

‡Sherjil Ozair is visiting Universit´e de Montr´eal from Indian Institute of Technology Delhi

§Yoshua Bengio is a CIFAR Senior Fellow.

1All code and hyperparameters available athttp://www.github.com/goodfeli/adversarial

1

6

(8)

Source: https://www.oreilly.com

7

(9)

Outline

1. Mathematical context

2. Wasserstein GANs

3. Optimization properties

4. GroupSort neural networks

5. Asymptotic properties

8

(10)

Mathematical context

(11)

Objective

• Target: probability measureµ^? onE ⊆R^D.

• Goal: generate according toµ^?.

• Data: X1, . . . ,Xn i.i.d. asµ^?.

Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.

9

(12)

Objective

• Data: X1, . . . ,Xn i.i.d. asµ^?.

9

(13)

Objective

• Data: X1, . . . ,Xn i.i.d. asµ^?.

9

(14)

Objective

• Data: X1, . . . ,Xn i.i.d. asµ^?.

9

(15)

The generator

• Aparametricfamily of functions fromR^d toE.

• Notation: G ={Gθ:θ∈Θ},Θ⊂R^P.

• Principle: Ω−→^Z R^d −→^G^θ E.

• Definition: Gθ(Z)^L∼µθ.

• Associated family ofdistributions: P={µθ:θ∈Θ}.

• Eachµθ is acandidateto representµ^?.

• Z is typically uniform or Gaussian, withd small.

10

(16)

The generator

10

(17)

The generator

10

(18)

The generator

10

(19)

The generator

10

(20)

The generator

• Each µθ is acandidateto representµ^?.

10

(21)

The generator

• Each µθ is acandidateto representµ^?.

10

(22)

A specific framework

• In GANs algorithms, eachG_θ is a neural network.

• Bad ideas:

. Exhaustive description ofµ^?by a classical parametric model. . Estimation by maximum likelihood.

. A strategy based on nonparametric density estimation.

• It isnotassumed thatµ^? belongs toP.

• Forgetclassical rules.

11

(23)

A specific framework

• Bad ideas:

. Exhaustive description ofµ^?by a classical parametric model. . Estimation by maximum likelihood.

11

(24)

A specific framework

• Bad ideas:

. Exhaustive description ofµ^? by a classical parametric model.

. Estimation by maximum likelihood.

11

(25)

A specific framework

• Bad ideas:

11

(26)

A specific framework

• Bad ideas:

11

(27)

A specific framework

• Bad ideas:

11

(28)

A specific framework

• Bad ideas:

11

(29)

Generator’s architecture

12

(30)

Generator’s architecture

ReLU neural networks

Gθ(z) = Up D×up−1

σ Up−1 u_p−1×u_p−2

· · ·σ( U2 u2×u1

σ(U1 u1×d

z+ b1 u1×1)+b2

u2×1)· · ·+bp−1 u_p−1×1

+bp D×1

ReLUactivation: σ(x) =max(x,0)

13

(31)

Generator with p = 4

z⁽¹⁾ z⁽²⁾

z⁽³⁾

Hidden layer 3 Hidden

layer 2 Hidden

layer 1

u1=5 u2=4 u3=6

d=3 D=8 ₁₄

(32)

The discriminator

• Discriminator: a parametric family of functions fromE to[0,1].

• Notation: D={D_α:α∈Λ},Λ⊆R^Q.

• In GANs algorithms, eachD_α is aneural network.

• ThehigherD(x), thehigherthe probability thatx is drawn from µ^?.

• The generator and the discriminator haveoppositeobjectives.

Source: https://www.wikihow.com

15

(33)

The discriminator

• In GANs algorithms, eachD_α is aneural network.

15

(34)

The discriminator

• In GANs algorithms, eachDα is aneural network.

15

(35)

The discriminator

• The higherD(x), thehigherthe probability thatx is drawn from µ^?.

15

(36)

The discriminator

• The higherD(x), thehigherthe probability thatx is drawn from µ^?.

15

(37)

Discriminator’s architecture

16

(38)

Discriminator’s architecture

GroupSort neural networks

Dα(x) = Vq 1×vq−1

˜ σ Vq−1

vq−1×vq−2

· · ·˜σ( V2 v2×v1

˜ σ( V1

v₁×D

x+ c1 v1×1

)+ c2 v2×1

)+· · ·+cq−1 vq−1×1

+cq 1×1

GroupSortactivation:

˜

σ(x1,x2, . . . ,x2n−1,x2n) = (max(x1,x2),min(x1,x2), . . . ,max(x2n−1,x2n),min(x2n−1,x2n))

17

(39)

Discriminator with q = 4

x⁽¹⁾ x⁽²⁾

x⁽³⁾ x⁽⁴⁾

x⁽⁵⁾ x⁽⁶⁾

Hidden layer 3 Hidden

layer 2 Hidden

layer 1

v1=4 v2=2 v3=6 D=6

18

(40)

Adversarial principle

• Objective: solve

θinf∈Θ sup

α∈Λ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version: inf

θ∈Θsup

α∈Λ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle: θˆn→G_θ_ˆ

n →G_θ_ˆ

n(Z₁),G_θ_ˆ

n(Z₂). . .→new images

• Themin-max optimumis found by stochastic gradient descent.

19

(41)

Adversarial principle

θinf∈Θ sup

α∈Λ

h

• Empirical version:

inf

θ∈Θsup

α∈Λ

h1 n

n

X

i=1

n →G_θ_ˆ

n(Z₁),G_θ_ˆ

19

(42)

Adversarial principle

θinf∈Θ sup

α∈Λ

h

inf

θ∈Θsup

α∈Λ

h1 n

n

X

i=1

• Generative principle:

θˆn→G_θ_ˆ

n →G_θ_ˆ

n(Z₁),G_θ_ˆ

19

(43)

Adversarial principle

θinf∈Θ sup

α∈Λ

h

inf

θ∈Θsup

α∈Λ

h1 n

n

X

i=1

• Generative principle: θˆn

→G_θ_ˆ

n →G_θ_ˆ

n(Z₁),G_θ_ˆ

19

(44)

Adversarial principle

θinf∈Θ sup

α∈Λ

h

inf

θ∈Θsup

α∈Λ

h1 n

n

X

i=1

n

→G_θ_ˆ

n(Z₁),G_θ_ˆ

19

(45)

Adversarial principle

θinf∈Θ sup

α∈Λ

h

inf

θ∈Θsup

α∈Λ

h1 n

n

X

i=1

n →G_θ_ˆ

n(Z₁),G_θ_ˆ

n(Z₂). . .

→new images

19

(46)

Adversarial principle

θinf∈Θ sup

α∈Λ

h

inf

θ∈Θsup

α∈Λ

h1 n

n

X

i=1

n →G_θ_ˆ

n(Z₁),G_θ_ˆ

19

(47)

Adversarial principle

θinf∈Θ sup

α∈Λ

h

inf

θ∈Θsup

α∈Λ

h1 n

n

X

i=1

n →G_θ_ˆ

n(Z₁),G_θ_ˆ

• The min-max optimumis found by stochastic gradient descent. ¹⁹

(48)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation. . Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.

• However...

. The training process of GANs isunstable. . Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(49)

From GANs to WGANs

. Impact of the discriminator on thequalityof the approximation. . Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.

• However...

20

(50)

From GANs to WGANs

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.

• However...

20

(51)

From GANs to WGANs

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

20

(52)

From GANs to WGANs

• However...

20

(53)

From GANs to WGANs

• However...

20

(54)

From GANs to WGANs

• However...

. The training process of GANs isunstable.

. Mode collapsephenomenon.

20

(55)

From GANs to WGANs

• However...

20

(56)

From GANs to WGANs

• However...

. Arjovsky, Chintala, and Bottou (2017)

→Wasserstein GANs. . WGANs have become astandardin machine learning.

20

(57)

From GANs to WGANs

• However...

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs.

. WGANs have become astandardin machine learning.

20

(58)

From GANs to WGANs

• However...

20

(59)

From GANs to WGANs

• However...

20

(60)

Wasserstein GANs

(61)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D_∞, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θ∈Θ

sup

D∈D^∞

hElog(D(X))+Elog(1−D(Gθ(Z)))i

=2D_JS(µ^?, µθ)−ln4.

• Consequence: inf

θ∈Θ sup

D∈D∞

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θ∈ΘDJS(µ^?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θ∈Θsup

α∈Λ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(62)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

inf

θ∈Θ

sup

D∈D^∞

θ∈Θ sup

D∈D∞

h

=2inf

θ∈Θsup

α∈Λ

h

= ??

21

(63)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

inf

θ∈Θ

sup

D∈D^∞

θ∈Θ sup

D∈D∞

h

=2inf

θ∈Θsup

α∈Λ

h

= ??

21

(64)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

θinf∈Θ sup

D∈D^∞

h

θ∈Θ sup

D∈D∞

h

=2inf

θ∈Θsup

α∈Λ

h

= ??

21

(65)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

inf

θ∈Θ

sup

D∈D^∞

h

θ∈Θ sup

D∈D∞

h

=2inf

θ∈Θsup

α∈Λ

h

= ??

21

(66)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

inf

θ∈Θ

sup

D∈D^∞

h

=2DJS(µ^?, µθ)−ln4.

θ∈Θ sup

D∈D∞

h

=2inf

θ∈Θsup

α∈Λ

h

= ??

21

(67)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

inf

θ∈Θ

sup

D∈D^∞

h

• Consequence:

inf

θ∈Θ sup

D∈D∞

h

=2inf

θ∈Θsup

α∈Λ

h

= ??

21

(68)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

inf

θ∈Θ

sup

D∈D^∞

h

• Consequence:

inf

θ∈Θ sup

D∈D∞

h

=2inf

• In practice, one hasalwaysD={Dα:α∈Λ}:

inf

θ∈Θsup

α∈Λ

h

= ??

21

(69)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

inf

θ∈Θ

sup

D∈D^∞

h

• Consequence:

inf

θ∈Θ sup

D∈D∞

h

=2inf

θ∈Θsup

α∈Λ

h

= ??

21

(70)

At the origins of WGANs

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• In this case:

inf

θ∈Θ

sup

D∈D^∞

h

• Consequence:

inf

θ∈Θ sup

D∈D∞

h

=2inf

θ∈Θsup

α∈Λ

h

= ??

21

(71)

General principle of WGANs

• Reminder: forµandν probability measuresin P₁(E), W₁(µ, ν) = inf

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ

sup

f∈Lip₁

|Eµ^?f −Eµθf|=

inf

θ∈Θ

W₁(µ^?, µθ).

• WGANs: in practice, one hasalwaysD={D_α:α∈Λ}: inf

θ∈Θ sup

α∈Λ

|Eµ^?D_α−EµθD_α|= ??

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(72)

General principle of WGANs

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ

sup

f∈Lip₁

|Eµ^?f −Eµθf|=

inf

θ∈Θ

W₁(µ^?, µθ).

θ∈Θ sup

α∈Λ

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

22

(73)

General principle of WGANs

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ

sup

f∈Lip₁

|Eµ^?f −Eµθf|=

inf

θ∈Θ

W₁(µ^?, µθ).

θ∈Θ sup

α∈Λ

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

22

(74)

General principle of WGANs

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ

sup

f∈Lip₁

|Eµ^?f −Eµθf|=

inf

θ∈Θ

W₁(µ^?, µθ).

θ∈Θ sup

α∈Λ

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

22

(75)

General principle of WGANs

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ sup

f∈Lip₁

|Eµ^?f −Eµθf|= inf

θ∈ΘW₁(µ^?, µθ).

θ∈Θ sup

α∈Λ

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

22

(76)

General principle of WGANs

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ sup

f∈Lip₁

• WGANs: in practice, one hasalwaysD={D_α:α∈Λ}:

inf

θ∈Θ sup

α∈Λ

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

22

(77)

General principle of WGANs

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ sup

f∈Lip₁

θ∈Θ sup

α∈Λ

|Eµ^?D_α−EµθD_α|

= ??

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

22

(78)

General principle of WGANs

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ sup

f∈Lip₁

θ∈Θ sup

α∈Λ

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

22

(79)

General principle of WGANs

π∈Π(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

f∈Lip₁

|E^µf −E^νf|.

• T-WGANs:

inf

θ∈Θ sup

f∈Lip₁

θ∈Θ sup

α∈Λ

inf

θ∈Θ sup

α∈Λ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i

. 22

(80)

Notation

• For D⊆Lip1, theIntegral Probability Metricd_D is d_D(µ, ν) = sup

f∈D|E^µf −E^νf|.

• Unified notation: T-WGANs: inf

θ∈ΘdLip₁(µ^?, µθ) and Θ^?= arg min

θ∈Θ

dLip₁(µ^?, µθ) WGANs: inf

θ∈Θd_D(µ^?, µθ) and Θ¯ = arg min

θ∈Θ

d_D(µ^?, µθ) Empirical WGANs: inf

θ∈Θd_D(µn, µθ) and Θˆ_n= arg min

θ∈Θ

d_D(µn, µθ).

• Properties ofd_Lip₁ are well known. This isdifferent ford_D.

23

(81)

Notation

• For D⊆Lip1, theIntegral Probability Metricd_D is d_D(µ, ν) = sup

f∈D|E^µf −E^νf|.

• Unified notation:

T-WGANs: inf

θ∈ΘdLip₁(µ^?, µθ) and Θ^?= arg min

θ∈Θ

dLip₁(µ^?, µθ) WGANs: inf

θ∈Θd_D(µ^?, µθ) and Θ¯ = arg min

θ∈Θ

d_D(µ^?, µθ) Empirical WGANs: inf

θ∈Θd_D(µn, µθ) and Θˆ_n= arg min

θ∈Θ

d_D(µn, µθ).

• Properties ofd_Lip₁ are well known. This isdifferent ford_D.

23