• Aucun résultat trouvé

Original paper

N/A
N/A
Protected

Academic year: 2022

Partager "Original paper"

Copied!
184
0
0

Texte intégral

(1)

Some theoretical insights into WGANs

Gérard Biau GdR ISIS, June 2021

(2)

Team

Benoît Cadre University Rennes 2

Maxime Sangnier Sorbonne University

Ugo Tanielian

Criteo AI Lab 1

(3)

Source: https://towardsdatascience.com/how- i- got- a- computer- to- make- fake- people- using- ai- gans- a8e2f542e992

2

(4)

Source: https://www.whichfaceisreal.com Link

3

(5)

Source: https://en.wikipedia.org/wiki/Edmond_de_Belamy

4

(6)

Source: https://vue.ai

5

(7)

Original paper

Generative Adversarial Nets

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio§ D´epartement d’informatique et de recherche op´erationnelle

Universit´e de Montr´eal Montr´eal, QC H3C 3J7

Abstract

We propose a new framework for estimating generative models via an adversar- ial process, in which we simultaneously train two models: a generative modelG that captures the data distribution, and a discriminative modelDthat estimates the probability that a sample came from the training data rather thanG. The train- ing procedure forGis to maximize the probability ofDmaking a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functionsGandD, a unique solution exists, withGrecovering the training data distribution andDequal to1

2everywhere. In the case whereGandDare defined by multilayer perceptrons, the entire system can be trained with backpropagation.

There is no need for any Markov chains or unrolled approximate inference net- works during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

1 Introduction

The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 20]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [17, 8, 9] which have a particularly well-behaved gradient . Deepgenerativemodels have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties.1

In the proposedadversarial netsframework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.

Ian Goodfellow is now a research scientist at Google, but did this work earlier as a UdeM student

Jean Pouget-Abadie did this work while visiting Universit´e de Montr´eal from Ecole Polytechnique.

Sherjil Ozair is visiting Universit´e de Montr´eal from Indian Institute of Technology Delhi

§Yoshua Bengio is a CIFAR Senior Fellow.

1All code and hyperparameters available athttp://www.github.com/goodfeli/adversarial

1

6

(8)

Source: https://www.oreilly.com

7

(9)

Outline

1. Mathematical context

2. Wasserstein GANs

3. Optimization properties

4. GroupSort neural networks

5. Asymptotic properties

8

(10)

Mathematical context

(11)

Objective

• Target: probability measureµ? onE ⊆RD.

• Goal: generate according toµ?.

• Data: X1, . . . ,Xn i.i.d. asµ?.

Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.

9

(12)

Objective

• Target: probability measureµ? onE ⊆RD.

• Goal: generate according toµ?.

• Data: X1, . . . ,Xn i.i.d. asµ?.

Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.

9

(13)

Objective

• Target: probability measureµ? onE ⊆RD.

• Goal: generate according toµ?.

• Data: X1, . . . ,Xn i.i.d. asµ?.

Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.

9

(14)

Objective

• Target: probability measureµ? onE ⊆RD.

• Goal: generate according toµ?.

• Data: X1, . . . ,Xn i.i.d. asµ?.

Source: Shao et al. (2018). The Riemannian geometry of deep generative models,CVPR 2018.

9

(15)

The generator

• Aparametricfamily of functions fromRd toE.

• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.

• Principle: Ω−→Z Rd −→Gθ E.

• Definition: Gθ(Z)L∼µθ.

• Associated family ofdistributions: P={µθ:θ∈Θ}.

• Eachµθ is acandidateto representµ?.

• Z is typically uniform or Gaussian, withd small.

10

(16)

The generator

• Aparametricfamily of functions fromRd toE.

• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.

• Principle: Ω−→Z Rd −→Gθ E.

• Definition: Gθ(Z)L∼µθ.

• Associated family ofdistributions: P={µθ:θ∈Θ}.

• Eachµθ is acandidateto representµ?.

• Z is typically uniform or Gaussian, withd small.

10

(17)

The generator

• Aparametricfamily of functions fromRd toE.

• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.

• Principle: Ω−→Z Rd −→Gθ E.

• Definition: Gθ(Z)L∼µθ.

• Associated family ofdistributions: P={µθ:θ∈Θ}.

• Eachµθ is acandidateto representµ?.

• Z is typically uniform or Gaussian, withd small.

10

(18)

The generator

• Aparametricfamily of functions fromRd toE.

• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.

• Principle: Ω−→Z Rd −→Gθ E.

• Definition: Gθ(Z)L∼µθ.

• Associated family ofdistributions: P={µθ:θ∈Θ}.

• Eachµθ is acandidateto representµ?.

• Z is typically uniform or Gaussian, withd small.

10

(19)

The generator

• Aparametricfamily of functions fromRd toE.

• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.

• Principle: Ω−→Z Rd −→Gθ E.

• Definition: Gθ(Z)L∼µθ.

• Associated family ofdistributions: P={µθ:θ∈Θ}.

• Eachµθ is acandidateto representµ?.

• Z is typically uniform or Gaussian, withd small.

10

(20)

The generator

• Aparametricfamily of functions fromRd toE.

• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.

• Principle: Ω−→Z Rd −→Gθ E.

• Definition: Gθ(Z)L∼µθ.

• Associated family ofdistributions: P={µθ:θ∈Θ}.

• Each µθ is acandidateto representµ?.

• Z is typically uniform or Gaussian, withd small.

10

(21)

The generator

• Aparametricfamily of functions fromRd toE.

• Notation: G ={Gθ:θ∈Θ},Θ⊂RP.

• Principle: Ω−→Z Rd −→Gθ E.

• Definition: Gθ(Z)L∼µθ.

• Associated family ofdistributions: P={µθ:θ∈Θ}.

• Each µθ is acandidateto representµ?.

• Z is typically uniform or Gaussian, withd small.

10

(22)

A specific framework

• In GANs algorithms, eachGθ is a neural network.

• Bad ideas:

. Exhaustive description ofµ?by a classical parametric model. . Estimation by maximum likelihood.

. A strategy based on nonparametric density estimation.

• It isnotassumed thatµ? belongs toP.

• Forgetclassical rules.

11

(23)

A specific framework

• In GANs algorithms, eachGθ is a neural network.

• Bad ideas:

. Exhaustive description ofµ?by a classical parametric model. . Estimation by maximum likelihood.

. A strategy based on nonparametric density estimation.

• It isnotassumed thatµ? belongs toP.

• Forgetclassical rules.

11

(24)

A specific framework

• In GANs algorithms, eachGθ is a neural network.

• Bad ideas:

. Exhaustive description ofµ? by a classical parametric model.

. Estimation by maximum likelihood.

. A strategy based on nonparametric density estimation.

• It isnotassumed thatµ? belongs toP.

• Forgetclassical rules.

11

(25)

A specific framework

• In GANs algorithms, eachGθ is a neural network.

• Bad ideas:

. Exhaustive description ofµ? by a classical parametric model.

. Estimation by maximum likelihood.

. A strategy based on nonparametric density estimation.

• It isnotassumed thatµ? belongs toP.

• Forgetclassical rules.

11

(26)

A specific framework

• In GANs algorithms, eachGθ is a neural network.

• Bad ideas:

. Exhaustive description ofµ? by a classical parametric model.

. Estimation by maximum likelihood.

. A strategy based on nonparametric density estimation.

• It isnotassumed thatµ? belongs toP.

• Forgetclassical rules.

11

(27)

A specific framework

• In GANs algorithms, eachGθ is a neural network.

• Bad ideas:

. Exhaustive description ofµ? by a classical parametric model.

. Estimation by maximum likelihood.

. A strategy based on nonparametric density estimation.

• It isnotassumed thatµ? belongs toP.

• Forgetclassical rules.

11

(28)

A specific framework

• In GANs algorithms, eachGθ is a neural network.

• Bad ideas:

. Exhaustive description ofµ? by a classical parametric model.

. Estimation by maximum likelihood.

. A strategy based on nonparametric density estimation.

• It isnotassumed thatµ? belongs toP.

• Forgetclassical rules.

11

(29)

Generator’s architecture

12

(30)

Generator’s architecture

ReLU neural networks

Gθ(z) = Up D×up−1

σ Up1 up−1×up−2

· · ·σ( U2 u2×u1

σ(U1 u1×d

z+ b1 u1×1)+b2

u2×1)· · ·+bp1 up−1×1

+bp D×1

ReLUactivation: σ(x) =max(x,0)

13

(31)

Generator with p = 4

z(1) z(2)

z(3)

Hidden layer 3 Hidden

layer 2 Hidden

layer 1

u1=5 u2=4 u3=6

d=3 D=8 14

(32)

The discriminator

• Discriminator: a parametric family of functions fromE to[0,1].

• Notation: D={Dα:α∈Λ},Λ⊆RQ.

• In GANs algorithms, eachDα is aneural network.

• ThehigherD(x), thehigherthe probability thatx is drawn from µ?.

• The generator and the discriminator haveoppositeobjectives.

Source: https://www.wikihow.com

15

(33)

The discriminator

• Discriminator: a parametric family of functions fromE to[0,1].

• Notation: D={Dα:α∈Λ},Λ⊆RQ.

• In GANs algorithms, eachDα is aneural network.

• ThehigherD(x), thehigherthe probability thatx is drawn from µ?.

• The generator and the discriminator haveoppositeobjectives.

Source: https://www.wikihow.com

15

(34)

The discriminator

• Discriminator: a parametric family of functions fromE to[0,1].

• Notation: D={Dα:α∈Λ},Λ⊆RQ.

• In GANs algorithms, eachDα is aneural network.

• ThehigherD(x), thehigherthe probability thatx is drawn from µ?.

• The generator and the discriminator haveoppositeobjectives.

Source: https://www.wikihow.com

15

(35)

The discriminator

• Discriminator: a parametric family of functions fromE to[0,1].

• Notation: D={Dα:α∈Λ},Λ⊆RQ.

• In GANs algorithms, eachDα is aneural network.

• The higherD(x), thehigherthe probability thatx is drawn from µ?.

• The generator and the discriminator haveoppositeobjectives.

Source: https://www.wikihow.com

15

(36)

The discriminator

• Discriminator: a parametric family of functions fromE to[0,1].

• Notation: D={Dα:α∈Λ},Λ⊆RQ.

• In GANs algorithms, eachDα is aneural network.

• The higherD(x), thehigherthe probability thatx is drawn from µ?.

• The generator and the discriminator haveoppositeobjectives.

Source: https://www.wikihow.com

15

(37)

Discriminator’s architecture

16

(38)

Discriminator’s architecture

GroupSort neural networks

Dα(x) = Vq 1×vq−1

˜ σ Vq1

vq−1×vq−2

· · ·˜σ( V2 v2×v1

˜ σ( V1

v1×D

x+ c1 v1×1

)+ c2 v2×1

)+· · ·+cq1 vq−1×1

+cq 1×1

GroupSortactivation:

˜

σ(x1,x2, . . . ,x2n1,x2n) = (max(x1,x2),min(x1,x2), . . . ,max(x2n1,x2n),min(x2n1,x2n))

17

(39)

Discriminator with q = 4

x(1) x(2)

x(3) x(4)

x(5) x(6)

Hidden layer 3 Hidden

layer 2 Hidden

layer 1

v1=4 v2=2 v3=6 D=6

18

(40)

Adversarial principle

• Objective: solve

θinfΘ sup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version: inf

θΘsup

αΛ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle: θˆn→Gθˆ

n →Gθˆ

n(Z1),Gθˆ

n(Z2). . .→new images

• Themin-max optimumis found by stochastic gradient descent.

19

(41)

Adversarial principle

• Objective: solve

θinfΘ sup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version:

inf

θΘsup

αΛ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle: θˆn→Gθˆ

n →Gθˆ

n(Z1),Gθˆ

n(Z2). . .→new images

• Themin-max optimumis found by stochastic gradient descent.

19

(42)

Adversarial principle

• Objective: solve

θinfΘ sup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version:

inf

θΘsup

αΛ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle:

θˆn→Gθˆ

n →Gθˆ

n(Z1),Gθˆ

n(Z2). . .→new images

• Themin-max optimumis found by stochastic gradient descent.

19

(43)

Adversarial principle

• Objective: solve

θinfΘ sup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version:

inf

θΘsup

αΛ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle: θˆn

→Gθˆ

n →Gθˆ

n(Z1),Gθˆ

n(Z2). . .→new images

• Themin-max optimumis found by stochastic gradient descent.

19

(44)

Adversarial principle

• Objective: solve

θinfΘ sup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version:

inf

θΘsup

αΛ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle: θˆn→Gθˆ

n

→Gθˆ

n(Z1),Gθˆ

n(Z2). . .→new images

• Themin-max optimumis found by stochastic gradient descent.

19

(45)

Adversarial principle

• Objective: solve

θinfΘ sup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version:

inf

θΘsup

αΛ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle: θˆn→Gθˆ

n →Gθˆ

n(Z1),Gθˆ

n(Z2). . .

→new images

• Themin-max optimumis found by stochastic gradient descent.

19

(46)

Adversarial principle

• Objective: solve

θinfΘ sup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version:

inf

θΘsup

αΛ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle: θˆn→Gθˆ

n →Gθˆ

n(Z1),Gθˆ

n(Z2). . .→new images

• Themin-max optimumis found by stochastic gradient descent.

19

(47)

Adversarial principle

• Objective: solve

θinfΘ sup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i .

• Empirical version:

inf

θΘsup

αΛ

h1 n

n

X

i=1

log(Dα(Xi)) +Elog(1−Dα(Gθ(Z)))i .

• Generative principle: θˆn→Gθˆ

n →Gθˆ

n(Z1),Gθˆ

n(Z2). . .→new images

• The min-max optimumis found by stochastic gradient descent. 19

(48)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation. . Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.

• However...

. The training process of GANs isunstable. . Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(49)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation. . Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.

• However...

. The training process of GANs isunstable. . Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(50)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem. . Play withsimpleexamples.

• However...

. The training process of GANs isunstable. . Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(51)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable. . Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(52)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable. . Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(53)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable. . Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(54)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable.

. Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(55)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable.

. Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(56)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable.

. Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)

→Wasserstein GANs. . WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(57)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable.

. Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs.

. WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(58)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable.

. Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs.

. WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(59)

From GANs to WGANs

• Biau, Cadre, Sangnier, and Tanielian (AoS, 2020):

. Basicpropertiesof GANs.

. Impact of the discriminator on thequalityof the approximation.

. Statisticalconsistency,ratesof convergence,central limittheorem.

. Play withsimpleexamples.

• However...

. The training process of GANs isunstable.

. Mode collapsephenomenon.

. Arjovsky, Chintala, and Bottou (2017)→Wasserstein GANs.

. WGANs have become astandardin machine learning.

• Ourgoal: make some theoretical advances in WGANs.

20

(60)

Wasserstein GANs

(61)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

hElog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence: inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(62)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

hElog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence: inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(63)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

hElog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence: inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(64)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

θinfΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence: inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(65)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence: inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(66)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence: inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(67)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence:

inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(68)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence:

inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}:

inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(69)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence:

inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(70)

At the origins of WGANs

• Reminder: forµandν probability measuresonE, DJS(µ, ν) =1

2DKL

µ

µ+ν 2

+1

2DKL

ν

µ+ν 2

.

• Idealization: D=D, the set ofallfunctions fromE to[0,1].

• In this case:

inf

θΘ

sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2DJS?, µθ)−ln4.

• Consequence:

inf

θΘ sup

D∈D

h

Elog(D(X))+Elog(1−D(Gθ(Z)))i

=2inf

θΘDJS?, µθ)−ln4.

• In practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘsup

αΛ

h

Elog(Dα(X)) +Elog(1−Dα(Gθ(Z)))i

= ??

21

(71)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ

sup

fLip1

|Eµ?f −Eµθf|=

inf

θΘ

W1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|= ??

• Empirical version: inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(72)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ

sup

fLip1

|Eµ?f −Eµθf|=

inf

θΘ

W1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|= ??

• Empirical version: inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(73)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ

sup

fLip1

|Eµ?f −Eµθf|=

inf

θΘ

W1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|= ??

• Empirical version: inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(74)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ

sup

fLip1

|Eµ?f −Eµθf|=

inf

θΘ

W1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|= ??

• Empirical version: inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(75)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ sup

fLip1

|Eµ?f −Eµθf|= inf

θΘW1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|= ??

• Empirical version: inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(76)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ sup

fLip1

|Eµ?f −Eµθf|= inf

θΘW1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}:

inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|= ??

• Empirical version: inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(77)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ sup

fLip1

|Eµ?f −Eµθf|= inf

θΘW1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|

= ??

• Empirical version: inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(78)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ sup

fLip1

|Eµ?f −Eµθf|= inf

θΘW1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|= ??

• Empirical version: inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i .

22

(79)

General principle of WGANs

• Reminder: forµandν probability measuresin P1(E), W1(µ, ν) = inf

πΠ(µ,ν)

Z

E×E

kx−ykπ(dx,dy).

• Dual form:

W1(µ, ν) = sup

fLip1

|Eµf −Eνf|.

• T-WGANs:

inf

θΘ sup

fLip1

|Eµ?f −Eµθf|= inf

θΘW1?, µθ).

• WGANs: in practice, one hasalwaysD={Dα:α∈Λ}: inf

θΘ sup

αΛ

|Eµ?Dα−EµθDα|= ??

• Empirical version:

inf

θΘ sup

αΛ

h1 n

n

X

i=1

Dα(Xi)−EDα(Gθ(Z))i

. 22

(80)

Notation

• For D⊆Lip1, theIntegral Probability MetricdD is dD(µ, ν) = sup

f∈D|Eµf −Eνf|.

• Unified notation: T-WGANs: inf

θΘdLip1?, µθ) and Θ?= arg min

θΘ

dLip1?, µθ) WGANs: inf

θΘdD?, µθ) and Θ¯ = arg min

θΘ

dD?, µθ) Empirical WGANs: inf

θΘdDn, µθ) and Θˆn= arg min

θΘ

dDn, µθ).

• Properties ofdLip1 are well known. This isdifferent fordD.

23

(81)

Notation

• For D⊆Lip1, theIntegral Probability MetricdD is dD(µ, ν) = sup

f∈D|Eµf −Eνf|.

• Unified notation:

T-WGANs: inf

θΘdLip1?, µθ) and Θ?= arg min

θΘ

dLip1?, µθ) WGANs: inf

θΘdD?, µθ) and Θ¯ = arg min

θΘ

dD?, µθ) Empirical WGANs: inf

θΘdDn, µθ) and Θˆn= arg min

θΘ

dDn, µθ).

• Properties ofdLip1 are well known. This isdifferent fordD.

23

Références

Documents relatifs

Finally, the physical presentation level could also be a place of glitches by distortion of the rendering process such as the way standard glitches are generated with their

We used two measures of performance: Pearson’s ρ 2 between the logarithms of the frequencies as in [21], and the Kullback-Leibler divergence: D KL = hlog 2 [P data (σ)/P model

A possible usage of BTGM is to focus an overly general generative model (with support covering the data support) along an adversarial scheme, using a discriminator trained

The main contributions are (i) A new and versatile framework for image synthesis from a single example using multi-scale patch statistics and optimal transport; (ii) The computation

Tabular GAN (TGAN) (Xu & Veeramachaneni, 2018) presented a method to generate data composed by numerical and categorical variables, where the generator outputs variable values in

We compare the performance of PHom-GeM on two specificities: first, qualitative visualization of the persistence diagrams and barcodes and, secondly, quantitative estimation of

In this work, since the primary goal of learning the models is to verify properties over the systems, we evaluate the learning algorithms by checking whether we can reliably

The content of this paper is as follows: in section 2, we propose the Sparse Di- graph Generator (SDG) which is an algorithm that generates graphs that fit our requirements; we