Y Y g X X f ˆ (1) = '(1) f (1) ( X )

(1)

Artificial Neural Networks

From Perceptron to Deep Learning

Younès BENNANI Full Professor

Master of Science in Informatics

Exploration Informatique des Données et Décisionnel (EID

²

) Science des Données (WISD & MASD) Mathématiques des Données (MD)

© 2001-2021 @Y. Bennani : Ce document est la propriété de Younès Bennani, Professeur à l'USPN. Il ne peut être diffusé ou reproduit sans son autorisation écrite (younes.bennani@sorbonne-paris-nord.fr).

Unsupervised learning of representations Autoencoder, Autoassociator

f _θ ⁽¹⁾ g _θ ⁽¹⁾ _'

X X ˆ

Y = f

θ (1)

(X)

Y

An autoencoder, autoassociator or Diabolo network is an artificial neural network used for learning efficient coding.

The purpose of an autoencoder is to learn a compression, a distributed representation (encoding) for a set of data, usually for dimensionality reduction.

The autoencoder is based on the Sparse coding concept.

The network is trained to reproduce at the output, the input (learning the identity function).

Unsupervised learning of representations Autoencoder, Autoassociator

Unsupervised learning of representations

Autoencoder, Autoassociator

(2)

Deep learning through Stacked Autoencoder (SAE)

The autoencoders can be stacked to form a deep network by feeding the latent representation (output code).

The unsupervised learning of such an architecture is done layer by layer. Each layer is trained as an autoencoder by minimizing the error in the reconstruction of its input (which is the output code of the previous layer).

A stacked autoencoder model is obtained by stacking several autoencoders. The hidden layer of the SAE at layer 'i' becomes the input of the SAE at layer 'i + 1'.

Once all layers are trained, the network goes through a second learning step called "fine tuning".

f _θ ⁽¹⁾ g _θ ⁽¹⁾ _'

X X ˆ

Deep learning through Stacked Autoencoder (SAE)

New representation of input X

Y

f _θ ⁽¹⁾ g _θ ⁽¹⁾ _'

X X ˆ

f _θ ⁽¹⁾

X

Deep learning through Stacked Autoencoder (SAE)

Autoencoder

The network is trained to reproduce the input at the output (learning of the identity function).

Learning of a Sparse autoencoder.

From an unlabelled data set Y1, Y2, ...

f _θ ⁽²⁾ Y ˆ

Y g _θ ⁽²⁾ _'

Z = f

_θ⁽²⁾

(Y )

Z

Deep learning through Stacked Autoencoder

(SAE)

(3)

Z

From there, the procedure can be repeated.

f _θ ⁽²⁾ Y ˆ

Y g _θ ⁽²⁾ _'

Z = f

_θ⁽²⁾

(Y )

Z

f _θ ⁽²⁾

Y

Deep learning through Stacked Autoencoder (SAE)

New representation of input Y

f _θ ⁽¹⁾

X X ˆ

f _θ ⁽²⁾ g _θ' ⁽¹⁾

g _θ' ⁽²⁾

Input data reconstituted

Input data Representation

small size

Encoder Decoder

Deep learning through Stacked Autoencoder (SAE)

Reconstruction comparison

Original image Auto- encoder

PCA

Deep learning through Stacked Autoencoder

(SAE)

(4)

Comparison of separability of codes (New 2D representation) generated by an autoencoder (right) and PCA (left) from the MNIST database (large database of handwritten digits)

Comparison of separability of codes (New 2D representation) generated by an autoencoder (right) and PCAP (left) from the News Stories database.

Deep learning through Stacked Autoencoder (SAE)

Generative Adversarial Networks GAN

• System of two neural networks competing against each other in a zero-sum game framework.

• They were first introduced by Ian Goodfellow et al. in 2014.

• Can learn to draw samples from a model that is similar to data that we give them.

Generative Adversarial Networks GAN

• Generative

Learn a generative model

• Adversarial

Trained in an adversarial setting

• Networks

Use Deep Neural Networks

The idea behing GANS is to train two networks jointly:

• a discriminator D to classify samples as "real" or "fake »

• a generator G to map a fixed distribution to samples that fool D

Why Generative Models?

• Discriminative models:

• Given an image X, predict a label Y

• Estimates P(Y|X)

• Discriminative models have several key limitations

• Can’t model P(X), i.e. the probability of seeing a certain image

• Thus, can’t sample from P(X), i.e. can’t generate new images

• Generative models (in general) cope with all of above

• Tries to learn the joint probability of the input data and labels simultaneously i.e. P(X,Y).

• Potential to understand and explain the underlying structure of

the input data even when there are no labels.

(5)

Magic of GANs...

Which one is Computer generated?

Magic of GANs...

Generated bedrooms. Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” https://arxiv.org/abs/1511.06434v2

Magic of GANs...

A Style-Based Generator Architecture for Generative Adversarial Networks

Tero Karras, Samuli Laine, Timo Aila https://arxiv.org/abs/1812.04948

Magic of GANs...

(6)

GAN’s Architecture

Training Discriminator

The discriminator D is a classifier and D(x) is interpreted as the probability for x to be a real sample.

Training Discriminator

The task of D is to distinguish real points from

generated points .

The last layer of D is a Sigmoid layer, then learning of D is done thanks to the binary cross-entropy loss:

! _" , … , ! _%

&(( _" ), … , &(( _% )

ℒ ₊ = − .

/0"

%

log 4 ! _/ + log 1 − 4 & ( _/

For a fixed generator G, the optimal discriminator is:

4 ^∗ = 89:;<= ℒ ₊

Training Generator

The generator G takes as input a Gaussian random variable z and produces a fake sample G(z).

The learning of G is more subtle. The performance of G is evaluated thanks to the

discriminator D, i.e. the generator maximizes the loss of the discriminator.

(7)

Training Generator

The task of G is to fool the discriminator D.

One possible cost function for the generator G is the opposite of the discriminator’s:

! ^∗ = $%&'$( ℒ _* = $%&'$( − ,

-./

0 log 1 − 5 ! 6 _-

In practice, the loss for G is often replaced by:

! ^∗ = $%&'$( ,

-./

0 log 5 ! 6 _- ℒ ₇ = −ℒ _*

Or for a fixed discriminator D, the optimal generator is:

Loss function for Generator

When the generator is weak compared to the discriminator, i.e.

when D(G(z)) << 1, the modified loss boosts the learning of the generator thanks to the high slope of log around zero.

−log 1 − &

log &

Creating the generator and discriminator

import

torch

import

torch.nn

as

nn z_dim = 32

hidden_dim = 128

net_G = nn.Sequential(nn.Linear(z_dim,hidden_dim),nn.ReLU(),nn.Linear(hidden_dim, 2)) net_D = nn.Sequential(nn.Linear(2,hidden_dim),nn.ReLU(),nn.Linear(hidden_dim,1), nn.Sigmoid())

Creating the generator and discriminator

batch_size, lr = 50, 1e-3 nb_epochs = 500

optimizer_G = torch.optim.Adam(net_G.parameters(),lr=lr) optimizer_D = torch.optim.Adam(net_D.parameters(),lr=lr)

for

e in range(nb_epochs):

np.random.shuffle(X)

real_samples = torch.from_numpy(X).type(torch.FloatTensor)

for

real_batch

in

real_samples.split(batch_size):

#improving D

z = torch.empty(batch_size,z_dim).normal_() fake_batch = net_G(z)

D_scores_on_real = net_D(real_batch) D_scores_on_fake = net_D(fake_batch)

loss = -torch.mean(torch.log(1-D_scores_on_fake) + torch.log(D_scores_on_real)) optimizer_D.zero_grad()

loss.backward() optimizer_D.step()

# improving G

z = torch.empty(batch_size,z_dim).normal_() fake_batch = net_G(z)

D_scores_on_fake = net_D(fake_batch)

loss = -torch.mean(torch.log(D_scores_on_fake)) optimizer_G.zero_grad()

loss.backward()

optimizer_G.step()

(8)

GAN Example A GAN fitting double moons.

GAN Example

https://www.myheritage.fr/deep-nostalgia

GAN Example Play with GANs in your browser!

https://poloclub.github.io/ganlab/

(9)

Deep ConvolutionalGANs (DCGANs)

Key ideas:

• Replace FC hidden layers with Convolutions

• Generator: Fractional-Strided convolutions

• Use Batch Normalization after each layer

• Inside Generator

•Use ReLU for hidden layers

•Use Tanh for the output layer

Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv:1511.06434 (2015).

Problems with GANs

• Probability Distribution is Implicit

• Not straightforward to compute P(X).

• Thus GANs are only good for Sampling/Generation.

• Training is Hard

• Non-Convergence

• Mode-Collapse

• Discriminator is trying to maximize its reward.

• Generator is trying to minimize Discriminator’s reward.

• SGD was not designed to find the equilibrium of a game.

• We might not converge to the equilibrium at all.

min max ℒ ', )

) '

w RBM = Restricted Boltzmann Machine: a stochastic version of a Hopfield network (i.e., recurrent neural network); often used as an associative memory, dimension reduction.

w RBM is an unsupervised energy-based generative model that consists of a layer of binary visible units, v, and a layer of binary hidden units h.

w Can also be seen as a particular case of a Deep Belief Network (DBN).

w Why restricted ?

Because we restrict connectivity: no intra-layer connections

Deep learning through Restricted Boltzmann Machine (RBM)

w Given an observed state, the energy of the joint configuration of the visible and hidden units (v,h) is given by:

w The RBM defines a joint probability p(v,h) and the marginal p(v) are:

visible state

hidden state state of the k-th

visible unit weight of the j-k connection

state of the j-th hidden unit biases

Deep learning through Restricted Boltzmann

Machine (RBM)

(10)

Deep learning through Restricted Boltzmann Machine (RBM)

p h v ( ) ⁼ ^{p h,v} _{p v} ⁽ ⁾

( ) ⁼

1 Z exp { −E h, ( v ) }

p v,h ( )

h

∑

=

1 Z exp { b

^T

v + c

^T

h + v

^T

Wh }

1 Z exp { b

^T

v + c

^T

h + v

^T

Wh }

h

∑

=

1 Z exp { b

^T

v + c

^T

h + v

^T

Wh }

1 Z exp { b

^T

v + c

^T

h + v

^T

Wh }

h

∑

= exp { b

^T

v + c

^T

h + v

^T

Wh }

exp { } b

^T

v ^exp { ^c

^T

^h ⁺ ^v

^T

^Wh }

h

∑ ⁼

exp { } b

^T

v ^exp { } ^c

^T

^h ^exp { ^v

^T

^Wh }

exp { } b

^T

v ^exp { ^c

^T

^h ⁺ ^v

^T

^Wh }

h

∑

= exp { c

^T

h + v

^T

Wh }

exp { c

^T

h + v

^T

Wh }

h

∑ ⁼

exp { c

^T

h + v

^T

Wh }

Z '

Deep learning through Restricted Boltzmann Machine (RBM)

Given a random input configuration v, the state of the hidden unit j is set to 1 with probability:

Given v, all the h

_j

are conditionally independent:

Given h, all the v

_i

are conditionally independent:

Can be used as “features”

Deep learning through Restricted Boltzmann Machine (RBM)

Deep Belief Network (DBN)

(Hinton, 2002; Smolensky 1986) data pattern (binary vector) internal, or hidden representations hidden units

visible units

(11)

How to learn a set of features that are good for reconstructing images of the

digit 2

50 binary feature neurons

16 x 16 pixel image

50 binary feature neurons

16 x 16 pixel image Increment weights

between an active pixel and an active feature

Decrement weights between an active pixel and an active feature

data (reality)

reconstruction (better than reality)

The final 50 x 256 weights

Each neuron grabs a different feature.

Reconstruction from activated binary features Data

How well can we reconstruct the digit images from the binary feature

activations?

New test images from the digit class that the model was trained on

Images from an unfamiliar digit class (the network tries to see every image as a 2)

• Variants of the GBP algorithm

• Momentum

• Vogl's Method (Bold Driver)

• Quickprop

• Gradient Correlation Rprop: Resilient Spread

• Second order rules

• Newton's method

• Largest slope method

• Conjugate gradients

• Quasi-Newton

• Approximation of Hessian

• Pseudo-Newton

• Levenberg-Marquardt

(12)

Variants of the GBP algorithm:

Momentum *

Rumehart D.E., Hinton G.E., Williams R.J. (1986)

Momentum = heuristic method that provides good results.

Idea: prevent oscillations by attaching to weight an inertia.

*Rumehart D.E., Hinton G.E., Williams R.J. (1986) « Learning internal representations by error propagation », Parallel Distributed Processing, Vol. 1, Chapter 8, MIT Press.

Δw(t ) = w(t +1) − w(t ) Δw(t ) = −ε ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t ) +α Δw(t −1)

0≤

α

<0.9

−ε ∂C

e k

(w)

∂ w(t)

Δw(t )

Δw(t −1) w(t)

w(t −1)

w(t +1)

+α Δ w(t −1)

Momentum = gradient descent + smoothing.

a petit a grand

Variants of the GBP algorithm

Momentum

without momentum with momentum

0.01 0.1

without momentum with momentum

!"#$$ % & '() % &

Weight adjustment is a weighted average of all previous adaptations.

Δw(t) = −ε ∂ R %

_mlp⁽ⁿ⁾

(w)

∂w(t) + α Δw(t −1)

= −ε ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t) +α −ε ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t −1) +α Δw(t − 2)

⎛

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

= −ε ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t) +α −ε ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t −1) +α −ε ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t − 2) +KK

⎛

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

⎛

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

= −ε α

^k

∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t − k)

k=0

∞

∑

Variants of the GBP algorithm

Momentum

Variants of the GBP algorithm

learning accelerating effect of momentum

We assume that the gradient is constant:

Δw(t) = −ε α

^k

∂ R %

_mlp⁽ⁿ⁾

(w)

∂w(t − k)

k=0

∞

∑

= −ε ∇

_w

α

^k

k=0

∞

∑

= −ε

1− α

⎛

⎝ ⎜ ⎞

⎠ ⎟ ∇

w

∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t) = ∇

_w

Without momentum:

with momentum:

Amplify the effect of

Δw(t) = −ε 1 −α

⎛

⎝ ⎜ ⎞

⎠ ⎟ ∇ _w = − ε ∇ ˆ w

€

Δw(t) = −ε ∇ w

ε

α →1 ˆ ε

Step effective gradient

(13)

Variants of the GBP algorithm

Vogl s Method (Bold Driver)*

*Vogl T.P. et al. (1988) «Accelerating the convergence of the back-propagation method», Biological Cybernetics 59:257-263.

Gradient step is adaptive during learning:

€

ε(t) =

α ε(t −1) β ε(t −1) ε(t −1)

⎧

⎨ ⎪

⎩ ⎪

si R ˜

_mlp^k

(w(t)) < R ˜

_mlp^k

(w(t −1)) si R ˜

_mlp^k

(w(t)) > 1.05 ˜ R

_mlp^k

(w(t −1))

sinon

α >1 et β < 1

ex: α=1.05 et β=0.7

w

_ij

(t +1) = w

_ij

(t)− ε(t) ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w

_ij

(t)

Variants of the GBP algorithm

Silva & Almeida*

*Silva F.M. & Almeida L.B. (1990) «Acceleration techniques for the back-propagation algorithm», Proc. EURASIP Workshop, Vol. 412, pp. 110-119, Springer-Verlag.

The step of the gradient for each connection is adaptive during learning:

ε

ij

(t) =

α ε

ij

(t −1) si ∂ R %

_mlp⁽ⁿ⁾

(w) w

_ij

(t)

∂ R %

_mlp⁽ⁿ⁾

(w) w

_ij

(t −1) > 0 β ε

ij

(t −1) si ∂ R %

_mlp⁽ⁿ⁾

(w)

w

_ij

(t)

∂ R %

_mlp⁽ⁿ⁾

(w) w

_ij

(t −1) < 0 ε

ij

(t −1) sinon

⎧

⎨

⎪

⎪ ⎪

⎩

⎪

α > 1 et 0 < β <1

ex : 1.1< α < 1.3 et β = 0.7 w

_ij

(t +1) = w

_ij

(t)−ε

_ij

(t) ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w

_ij

(t)

Variants of the GBP algorithm

Quickprop*

*Fahlman S.E. (1988) «Faster-learning variations of back-propagation: an empirical study», Proc. Connectionist Models Summer School, pp. 38-51, Morgan Kaufmann.

The weights update rule during learning is:

Δw(t ) = β Δw(t − 1)

∂ R %

mlp

(n)

(w)

∂ w(t −1) − ∂ R %

mlp (n)

(w)

∂ w(t) Δw(t −1)

Approximation of the second derivative

β =

∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t)

∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t −1) − ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t)

ß> 0 => same direction ß <0 => we change direction

Variants of the GBP algorithm

Gradient Correlation*

*Chan L.W. & Fallside F. (1987) «An adaptive training algorithm for back-propagation networks», Computer speech and Language, Vol. 2 pp. 205-218.

Measures the cosine of the angle between 2 successive values of the gradient:

cos( θ ) = ∇

_w

(t −1)

^T

∇

_w

(t)

∇

w

(t −1) ∇

w

(t)

∇

w

(t) = ∂ R %

_mlp⁽ⁿ⁾

(w)

∂ w(t)

ε (t) =

ε(t − 1)β

⁺

cos(θ ) ε(t − 1)β

⁻

⎧

⎨

⎪

⎩ ⎪

si cos(θ ) > 0 sinon

β

⁺

= 1.005 et β

⁻

= 0.8

if

otherwise

(14)

Variants of the GBP algorithm

Rprop : Resilient propagation*

*Reidmiller M., Braun H. (1993) «A direct adaptive method for faster back-propagation learning: the Rprop algorithm», Proc. IEEE, ICNN, Vol. 1, pp. 586-591.

The adaptations of the gradient step and the weights depend only on the gradient sign.

Δ

ij

(t) =

ε

⁺

Δ

ij

(t −1) si ∂ R ^%

mlp

(n)

(w)

w

ij

(t)

∂ R ^%

mlp

(n)

(w)

w

ij

(t −1) > 0 ε

⁻

Δ

ij

(t −1) si ∂ R ^%

mlp

(n)

(w)

w

ij

(t)

∂ R ^%

mlp

(n)

(w)

w

ij

(t −1) < 0 Δ

ij

(t −1) sin on

⎧

⎨

⎪

⎪ ⎪

⎪

⎩

⎪ ⎪

0 < ε

⁻

<1< ε

⁺

ex : ε

⁻

= 0.5 et ε

⁺

=1.2 Δw

_ij

(t) =

−Δ

ij

(t) si ∂ R ^%

mlp

(n)

(w)

w

ij

(t) > 0

+Δ

ij

(t) si ∂ R ^%

mlp

(n)

(w)

w

ij

(t) < 0

0 sin on

⎧

⎨

⎪

⎪ ⎪

⎪

⎩

⎪ ⎪

Initialisation Δ

ij

= Δ

0

= 0.001

Adaptation depend only on the gradient sign.

if

otherwise if

initialization

The development of the Taylor series (second order), of the cost function in the neighborhood of is written as:

Second order rules

Newton's method

€

R ˜

_mlp

(w) = R ˜

_mlp

( ˆ w ) + ( w − w ˆ )

^T

^∇

w

R ˜

_mlp

( ˆ w ) + 1

2 ( w − w ˆ )

^T

^H ^{( ˆ} ^w ⁾ ( ^w ⁻ ^w ^ˆ )

^T

€

∇

w

R ˜

_mlp

(w) = ∇

w

R ˜

_mlp

( ˆ w ) + H( ˆ w ) ( w − w ˆ )

€

∇

w

R ˜

_mlp

(w) = 0

€

w(t + 1) = w(t) − H

⁻¹

(w(t))∇

w

R ˜

mlp

(w(t)) w = w ˆ

Deriving from , one gets:

At a minimum, we have:

From where:

Hessien

€

( )

Hij,kl≡∂²R ˜ mlp(w)

∂wij∂wkl

w − w ˆ

( )

Newton’s direction

w

^*

w

€

−H

⁻¹

(w) ∇

w

R ˜

_mlp

(w)

€

− ∇

w

R ˜

mlp

(w)

€

∇

_w

R ˜

_mlp

( ˆ w ) + H( ˆ w ) ( w − w ˆ ) ⁼ ⁰

€

R ˜

_mlp

(w)

• Direct research methods (line search)

• the highest slope algorithm

• the conjugated gradients method

• Hessian approximation methods

• the quasi-Newton method

• the pseudo-Newton algorithm

• the Levenberg-Marquardt method

Second order rules

€

w(t +1) = w(t) − H

⁻¹

(w(t))∇

w

R ˜

mlp

(w(t))

Impractical method Very expensive!

O w ( )

³

O w ( )

²

O w ( )

Directed research

Method of greatest slope

We do not compute a second derivative, but at every moment, we determine the best point in a given direction:

w(t +1) = w(t) + α d

t

We are looking for who minimize α

€

R ˜

mlp

(w)

The method of greatest slope is to look for the best point in the direction of the gradient:

**There are several methods* for determining the parameter** α

€

d

_t

= −∇

_w

R ˜

_mlp

(w)

* Watrous R.L. (1987) : « Learning algorithms for connectionist networks: applied gradient Methods of nonlinear optimization », ICNN, San Diego M. Caudill, C. Butler eds. Vol. 2, pp. 619-627.

O w ( )

(15)

Directed Research

Conjugated Gradients

This diagram illustrates the concept of conjugate directions. Suppose a line search has been performed along the direction dt starting from the point w(t), to give an error minimum along the search path at the point w(t+1).

The direction d

t+1

is said to be conjugate to the direction dt if the component of the gradient parallel to the direction dt, which has just be made zero, remains zero as we move along the direction d

t+1

.

d

_i^T

H d

_j

= 0 i ≠ j

d

_t₊₁

= −g

_t₊₁

+ β

t

d

_t

d

t

w(t)

g(w(t + 1))

^T

d

t

= 0

€

g

_t

= ∇

w

R ˜

_mlp

(w(t))

Direction du gradient Ancienne direction Nouvelle direction

compromis

O w ( )

* Makram-Ebeid S., Sirat J.A., Viala J.R. (1989) «A rationalized back-propagation learning algorithm», Proc. IEEE, IJCNN, Vol. 2, pp. 373-380.

w(t + 1) d

t+1

g(w(t + 1) + α d

_t+1

)

^T

d

_t

= 0

Old direction

New direction

Gradient direction

Directed Research

Conjugated Gradients: Algorithm

d

_t+1

= −g

_t₊₁

+ β

t

d

_t

1. Choose the initial weight vector

2. Calculate the gradient and initialize the descent direction 3. At iteration , calculate

4. Test whether the stopping criterion is satisfied 5. Calculate the new gradient

6. calculate

7. Create a new direction 8. do and go to 3.

w(0)

d

0

= −g

0

€

g

₀

= ∇

w

R ˜

_mlp

(w(0))

t

€

α

t

= min

α

R ˜

mlp

( w(t) + α d

t

)

w(t + 1) = w(t) + α

min

d

t

g

t+1

t = t +1 β

t

Various proposals have been made for the re-estimation of the factors and .

The choice of is usually made by a gradient descent in the direction .

For there are several rules:

Directed Research

Conjugate Gradients: and

β _t = g _t+1 ^T ( g _t+1 − g _t )

d _t ^T ( g _t+1 − g _t ) ⁽ ^Hestenes ⁻ ^Stiefel ⁾

β t = g _t+1 ^T ( g _t+1 − g _t )

g _t ^T g _t ( Polak − Ribiere ) β _t = g _t+1 ^T g _t

g _t ^T g _t ( Fletcher − Reeves ) β

t

α

t

α

t

d

t

β

t

β

t

α

t

Approximation of the Hessian

Quasi-Newton

Quasi-Newton methods calculates an approximation of (without calculating it explicitly).

Broyden-Fletcher-Goldfarb-Shanno (BFGS) :

H ʹ ≈ H

⁻¹

€

w(t + 1) = w(t) −ε ʹ H (t) ∇

w

R ˜

_mlp

(w(t)) ʹ

H (t +1) = H (t) ʹ + F ( H (t), ʹ w(t +1), w(t),∇

_w

R ˜

_mlp

(w(t)),∇

_w

R ˜

_mlp

(w(t +1)) )

ʹ

H (t + 1) = H (t) ʹ + pp

^T

p

^T

v − H (t)v v ʹ

^T

H (t) ʹ

v

^T

H (t)v ʹ + ( v

^T

H (t)v ʹ ) ^uu

^T

p = w(t + 1) − w(t)

v = ∇

w

R ˜

_mlp

(w(t + 1)) − ∇

w

R ˜

_mlp

(w(t)) u = p

p

^T

v − H (t)v ʹ v

^T

H (t)v ʹ

O w ( )

²

* Watrous R.L. (1987) : « Learning algorithms for connectionist networks: applied gradient Methods of nonlinear optimization », ICNN, San Diego M. Caudill, C. Butler eds. Vol. 2, pp. 619-627.

(16)

Approximation of Hessian

Pseudo-Newton

The pseudo-Newton algorithm calculates a diagonal approximation of the Hessian:

€

Δw

_ij

= − ∇

_w

R ˜

_mlp

(w

_ij

) σ

_ij

+ µ

= δ

i k

x

_j

σ

ij

+ µ

σ

_ij

= x

_j²

γ

_i

γ

_i

= ∂δ

_i^k

∂ a

_i

O w ( )

* Becker S. & Le Cun Y. (1988) : «Improving the convergence of back-propagation learning with second order methods », Connectionist models summer school, Morgan Kaufman, pp. 29-37.

Approximation of Hessien

Levenberg-Marquardt

The Levenberg-Marquardt method calculates an approximation of the positive definite Hessian:

€

Δw

_ij

= − ∇

_w

R ˜

_mlp

(w

_ij

) σ

ij

= δ

_i^k

x

_j

σ

_ij

σ

_ii

= ( 1 + λ ) ^H

ii

σ

_ij

= H

_ij

O w ( )

* Press W.H. & Flanney B.P. (1988) : «Numerical recipes », Cambridge University Press.

• Formalism for learning

• Behavioral Analysis Dilemma Bias-Variance

• VC-dim: a measure of complexity

• ERM: Empirical Risk Minimization

• Is the ERM principle consistent?

• SRM: Structural Risk Minimization

• Heuristics for the adjustment of the generalization capacity

• Formal regularization

• Learning wth noised inputs

• Early Stopping

• Structural Regularization

Learning and Generalization

Formalism

D _N = { ( x ⁽¹⁾ , d ⁽¹⁾ ) ^, ( ^x ⁽²⁾ ^, ^d ⁽²⁾ ) ^,..., ( ^x ^(N) ^, ^d ^(N) ) } ^/ ^p(x, ^d) ⁼ ^p(x)p(d ^/ ^x)

Data :

F = { ψ ( x, w) / w ∈Ω }

Problem :

Find in a family:

Procedure :

R(w) = ∫ L d,ψ(x, [ w) ] ^dp(x, ^d)

Theoretical risk:

(Generalization error)

ψ(x, w

^*

) = Arg min

ψ

(x,w)

R(w)

Choice:

Mean Squarred Error :

(MSE) R(w) = ∫ [ d −ψ (x, w) ]

²

^dp(x, ^d)

(17)

Learning and Generalization

Behavior Analysis

R(w) = ∫ [ d −ψ(x, w) ]

²

^dp(x, ^d)

= Ε ^⎡ _⎣ ( D −ψ(X, w) )

²

^⎤ _⎦

= Ε ^⎡ _⎣ ( D − Ε [ D / X ] )

²

^⎤ _⎦ ⁺ ^{Ε Ε} ^⎡ _⎣ ( [ ^D ^/ ^X ] ^−ψ(X, ^w) )

²

^⎤ _⎦

To estimate , we have samples of supposed realizations of the random variables .

D

_N

N

X, D

( )

ψ ( x,w)

Constant Minimizing R (w)

is equivalent to minimizing this term

The optimal solution is the best approximation in the least square sense of conditional expectation .

ψ (x,w ^* ) Ε [ D / X ]

ψ (x, w)

ψ ( x

₀

,w)

p(d / x

₀

)

x

0

x

d

R(w) = ∫ ⎡⎣ d −ψ

_D_N

(x, w) ⎤⎦

²

dp(x, d)

= Ε

X,D_N

^⎡ _⎣⎢ ( D −ψ

_D_N

(X, w) )

²

^⎤ _⎦⎥

= Ε

X,D

^⎡ _⎣ ( D − Ε [ D / X ] )

²

^⎤ _⎦

+Ε

X

^⎡ _⎣⎢ ( Ε [ D / X ] ^{− Ε} ⎡⎣ ^ψ

D_N

(X, w) ⎤⎦ )

²

^⎤ _⎦⎥

+Ε

X,D_N

^⎡ _⎣⎢ ( Ε ⎡⎣ ψ

_D_N

(X, w) ⎤⎦− ψ

_D_N

(X, w) )

²

^⎤ _⎦⎥

To show the dependence of the function and the sample , we will note this function :

€

D _N ψ (x,w)

ψ D

N

( x,w)

Bayesian error Constant depends on the problem

(Bias)² Measures the average deviation between the estimator and the optimal solution

Variance Measure how the gap varies with DN

Bias Variance

Ε [ ψ

DN

(X,w) ]

Ε [ D /X ]

€ E

_N

Learning and Generalization

Behavior Analysis

Learning and Generalization

Dilemma Biais-Variance*

* German S, Bienenstock E., Doursat R. (1992) : «Neural Networks and the Bias Variance Dilemma»

Neural Computation, Vol. 4, N 1, pp. 1-58.

Global error

Variance

Bias

Complexity of the model l^*

A good estimator = good precision + good stability

We must find a compromise

⇕

Dilemma Biais-Variance

Learning and Generalization

Example

(18)

Vapnik–Chervonenkis (VC-dim) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a set of functions that can be learned by a statistical binary classification algorithm.

It is defined as the cardinality of the largest set of points that the algorithm can shatter.

Known for simple models (e.g. linear systems).

On the other hand for complex models: approximations, bounds.

Learning and Generalization

VC-dim: a measure of complexity *

* Vapnik V.N., (1995), «The Nature of Statistical Learning Theory», Springer-Verlag, New York, Inc.

VC-dim

Learning and Generalization

VC-dim: a measure of complexity *

**VC-dim (MLP) with M threshold units and W weight is *:**

VC − dim ≤ 2W log ₂ (eM )

To have it necessary with N examples:

E apprentissage = ε 2

N ≥ O W

ε log ₂ M ε

⎛

⎝ ⎞

⎠

⎛

⎝

⎞

⎠ E _test < ε

if

VC-dim (MLP) with E input units and C hidden units is:

VC − dim ≥ 2 C 2

⎢ ⎣ ⎥

⎦ E

* Baum E.B. & Haussler D. (1989), «What size net gives valid generalization ?», Neural Computation, Vol.

1, pp. 151-160.

Learning and Generalization

VC-dim: an example

The theoretical risk is not calculable (p (x, y) is unknown) We can not minimize , we use an induction principle.

Most common = Empirical Risk Minimization (ERM):

Learning and Generalization

ERM: Empirical Risk Minimization

ψ(x, w

⁺

) = Arg min

ψ

(x,w)

R(w) %

Choice:

R(w) % = 1

N L d ⎡⎣

⁽ⁿ⁾

,ψ( x

⁽ⁿ⁾

, w) ⎤⎦

n=1 N

Empirical risk ∑

(Learning error)

€

R(w)

(19)

Question:

Does the function that minimizes , on , is close to the optimum for ?

Learning and generalization

Is the ERM principle consistent?

ψ( x,w ⁺ )

€

R ˜ (w)

€

R(w)

ψ (x,w ^* ) D

_N

Vapnik & Chervonenkis * show that the ERM principle is consistent if converges uniformly towards on :

€

N lim →∞ P sup

w∈Ω

R(w) − R ˜ (w) > ε

⎧ ⎨

⎩

⎫ ⎬

⎭ = 0, ∀ε > 0 ψ (x,w ⁺ ) ⎯ ⎯ ^{N→ ∞} ⎯ ⎯ → ψ (x,w ^* ) ?

€

R ˜ (w)

€

R(w)

Offer a confidence interval :**

with a probability for N observations, the generalization error is such that:

€

∀ w ∈ Ω , R ˜ (w) − ε ( N,VC − dim(w), δ ) ^≤ ^R(w) ^≤ ^R ^˜ ^(w) ⁺ ^ε ( ^N,VC ⁻ ^dim(w), ^δ )

Width of the confidence interval

* Vapnik V.N. & Chervonenkis A.Y. (1971) : «On the uniform convergence of relative frequencies of events to their probabilities», Theory of Probability and its Applications, Vol. 16, N 2, pp. 264-280.

** Vapnik V.N. & Chervonenkis A.Y. (1989) : «The necessary and sufficient conditions for consistency of the method of Empirical Risk Minimization», Pattern Recognition and Image Analysis, Vol. 1, N 3, pp. 264-305.

F

1−δ

( )

ψ ( x,w) ∈ { } 0,1

ε ( N ,VC − dim(w), δ ) = VC − dim( w)

N 1+ log 2N

VC − dim( w)

⎡

⎣ ⎢

⎤

⎦ ⎥ − 1 N log δ

4 ⎡

⎣

⎤

⎦

⎛

⎝

⎜ ⎞

⎠ ⎟

Example :

Model Capacity VC-dim(w)^*

€ R ˜ (w) ε

(

N,VC−dim(w),δ

)

€

R ˜ (w)+ε

(

N,VC−dim(w),δ

)

With a confidence level of

(

1−δ

)

Over-fitting

Learning and generalization

Is the ERM principle consistent?

Learning and Generalization:

SRM: Minimizing Structural Risk

* Vapnik V.N. (1992) : «Principles of Risk Minimization for Learning Theory»

In Advances in Neural Information Processing Systems, Morgan Kaufmann Pub., Vol. 4, pp. 831-840.

ERM is effective when the number of observations is large or the problem and the system are simple.

Vapnik proposed other principles of induction, in particular the principle of **Structural Risk Minimization (SRM) *.**

Principle:

minimize the guaranteed risk:

We define a sequence of model classes As an ordered set:

As the ability:

€

R

_garanti

(w) = R ˜ (w) + ε ( N,VC − dim(w), δ )

S _i = { ψ ( x, w ) ^/ ^w ∈Ω _i }

S

₁

⊂ S

₂

⊂ L S

_n

VC − dim ( ) S

₁

^< ^VC ⁻ ^dim ( ) ^S

²

^< L VC − dim ( ) S

_n

Model Capacity VC-dim(w)^*

€ R ˜ (w) ε

(

N,VC−dim(w),δ

)

€

R_garanti(w)=R ˜ (w)+ε

(

N,VC−dim(w),δ

)

S^*

Si

S1

Learning and Generalization:

SRM: Minimizing Structural Risk

(20)

The principle of Structural Risk Minimization is to choose S * Minimizing according to a two-step optimization:

€

R _garanti (w)

Step 1 :

In each S

_i

compute the minimum of

Step 2 :

choose S* minimizing the guaranteed risk

€

R _garanti (w)

€

S ^* = argmin

S

_i

R ˜ (w) + ε ( N,VC − dim(w),δ )

[ ]

€

R ˜ (w _i ^* ) = min

w∈Ω

i

R ˜ (w)

[ ]

Learning and Generalization:

SRM: Minimizing Structural Risk

Learning and Generalization

Heuristics for the adjustment of the generalization capacity

v Goal :

Ø Operational systems Ø Heuristic methods v Key problem:

Ø Adjust the system capacity at a time:

• to the problem

• to the data

Ø Get the best generalization

v Method:

Ø Restrict the search space of the solution Ø Regularization

Learning and Generalization Why regularization?

Regularization is often used as a solution to the overfitting problem in Machine Learning.

Common causes for overfitting are:

1. When the model is complex enough that it starts modeling the noise in the training data.

2. When the training data is relatively small and is an insufficient representation of the underlying distribution that it is sampled from, the model fails to learn a generalizable mapping.

Regularization helps us overcome the issue of overfitting.

www.cheatsheets.aqeel-anwar.com

Learning and Generalization What is regularization?

Ø “Regularization consists of different techniques and methods used to address the issue of over-fitting by reducing the generalization error without affecting the training error much.”

Ø Choosing overly complex models for the training data points can often lead to overfitting.

Ø On the other hand, a simpler model leads to underfitting the data.

Ø Hence choosing just the right amount of complexity in the model is critical.

Ø Since the complexity of the model can not be directly inferred from the available training data, it is often impossible to stumble upon the right model complexity for training.

Ø This is where regularization comes into play making the complex model

prone to overfitting.

(21)

Learning and Generalization Types of Regularization Ø we can classify the regularization techniques into three categories:

Ø Modify loss function

• L2 Regularization (strong)

• L1 Regularization (strong)

• Entropy Regularization (strong)

Ø Modify sampling method

• Data Augmentation (weak)

• K-fold Cross-Validation (medium)

Ø Modify training algorithm

• Dropout (strong)

• Injecting noise (weak)

Ø strong, medium, and weak based on how effective the approach is in addressing the issue of overfitting.

Learning and Generalization Types of Regularization Ø Modify loss function

take into account the norm of the learned parameters or the output distribution.

• L2 Regularization (strong):

Consider the following linear regression problem with mean-squared loss.

In L2 regularization, we modify the loss to include the weighted L2 norm of the weights (beta) being optimized.

This prevents the weights from getting too large and hence avoiding them to overfit.

The constant lambda (≥0) is used to control the compromise between overfitting and underfitting. When lambda is high (low), the model tends to underfit (overfit).

Learning and Generalization Types of Regularization Ø Modify loss function

take into account the norm of the learned parameters or the output distribution.

• L2 Regularization (strong):

Let us consider the 2D case (n=2) where we can visualize the regression in the cartesian plane.

Solving weights for the L2 regularization loss shown above visually means finding the point with the minimum loss on the MSE contour (blue) that lies within the L2 ball (green circle).

Increasing the value of lambda corresponds to an increase in the size of the green ball.

Learning and Generalization Types of Regularization Ø Modify loss function

take into account the norm of the learned parameters or the output distribution.

• L1 Regularization (strong):

Instead of using the L2 norm of the weights in the loss function, in L1 regularization, the L1 norm (absolute values) of the weights are used. The modified loss becomes:

Just like the L2 regularizer, the L1 regularizer finds the point with the minimum loss on the MSE contour plot that lies within the unit norm ball.

The unit-norm ball for an L1 norm is a diamond with edges.

(22)

Learning and Generalization Types of Regularization Ø Modify loss function

take into account the norm of the learned parameters or the output distribution.

• L1 Regularization (strong):

Solving weights for the L1 regularization loss shown above visually means finding the point with the minimum loss on the MSE contour (blue) that lies within the L1 ball (greed diamond).

The additional advantage of using an L1 regularizer over an L2 regularizer is that the L1 norm tends to induce sparsity in the weights.

This means, with such a regularizer, the weights beta might have elements that are zero.

The weights with the L2 regularizer can become really small, but they never actually go to zero.

Learning and Generalization Types of Regularization Ø Modify loss function

take into account the norm of the learned parameters or the output distribution.

• Entropy Regularization (strong):

Ø Entropy quantifies the probability distribution in terms of uncertainty in them.

Greater the uncertainty in the distribution, the greater the entropy.

Ø Entropy regularization is used when the output of the model is a probability distribution for example classification.

Ø Instead of directly using the norm of the weights in the loss term, the entropy regularizer includes the entropy of the output distribution scaled by lambda.

Learning and Generalization Types of Regularization Ø Modify loss function

take into account the norm of the learned parameters or the output distribution.

• Entropy Regularization (strong):

Consider the following classification problem:

In the case of Entropy regularization, the loss function is modified as follows:

Learning and Generalization Types of Regularization Ø Modify loss function

take into account the norm of the learned parameters or the output distribution.

• Entropy Regularization (strong):

Ø Since we want the output probabilities to have a certain degree of uncertainty in them, which means we want to increase the entropy.

Ø Since we are decreasing the loss, using entropy in the loss function, hence, needs to be multiplied by -1.

Ø The scaling constant lambda controls the regularization. The greater the value of

lambda the more uniform the output distribution is.

(23)

Learning and Generalization Types of Regularization Ø Modify sampling method

These methods are useful to overcome overfitting that arises due to the limited size of the dataset available. These regularization methods try to manipulate the available input to create a fair representation of the actual input distribution.

• Data Augmentation (weak):

Data augmentation involves increasing the size of the available data set by augmenting them with more input created by random cropping, dilating, rotating, adding a small amount of noise, etc.

Learning and Generalization Types of Regularization Ø Modify sampling method

These methods are useful to overcome overfitting that arises due to the limited size of the dataset available. These regularization methods try to manipulate the available input to create a fair representation of the actual input distribution.

• K-fold Cross-Validation (medium):

In K-fold Cross-validation, the available training dataset is divided into k non- overlapping subset and K models are trained. For each model, one of the k subsets is used for validation while the rest of the (k-1) subsets are used for training (as shown in the figure below).

Learning and Generalization Types of Regularization Ø Modify sampling method

Regularization can also be implemented by modifying the training algorithm in various ways. The two most commonly used methods are discussed below.

• Dropout (strong):

Ø Dropout is used when the training model is a neural network.

Ø In the dropout method, connections between the nodes of consecutive layers are randomly dropped based on a dropout-ratio (%age of the total connection dropped) and the remaining network is trained in the current iteration. In the next iteration, another set of random connections are dropped.

By randomly dropping connections, the network is able to learn a better- generalized mapping from input to output hence reducing the overfitting.

The dropout ratio needs to be carefully selected and has a significant impact on the learned model. A good value of the dropout ratio is between 0.25 to 0.4.

Learning and Generalization Types of Regularization Ø Modify sampling method

Regularization can also be implemented by modifying the training algorithm in various ways. The two most commonly used methods are discussed below.

• Injecting noise (weak):

Ø Similar to dropout, this method is usually used when the model being learned is a neural network.

Ø During training, a small amount of random noise is added to the updated weights which helps the model learn a more robust set of features.

Ø A robust set of features makes sure that the model doesn't overfit the training data.

Ø This method, however, doesn't work very well as a regularizer.

Y Y g X X f ˆ (1) = '(1) f (1) ( X )

Artificial Neural Networks

From Perceptron to Deep Learning

Younès BENNANI Full Professor

Master of Science in Informatics

Exploration Informatique des Données et Décisionnel (EID

) Science des Données (WISD & MASD) Mathématiques des Données (MD)

Unsupervised learning of representations Autoencoder, Autoassociator

f θ (1) g θ (1) '

X X ˆ

Y = f

(X)

Y

An autoencoder, autoassociator or Diabolo network is an artificial neural network used for learning efficient coding.

The purpose of an autoencoder is to learn a compression, a distributed representation (encoding) for a set of data, usually for dimensionality reduction.

The autoencoder is based on the Sparse coding concept.

The network is trained to reproduce at the output, the input (learning the identity function).

Unsupervised learning of representations Autoencoder, Autoassociator

Unsupervised learning of representations

Autoencoder, Autoassociator

Deep learning through Stacked Autoencoder (SAE)

The autoencoders can be stacked to form a deep network by feeding the latent representation (output code).

The unsupervised learning of such an architecture is done layer by layer. Each layer is trained as an autoencoder by minimizing the error in the reconstruction of its input (which is the output code of the previous layer).

A stacked autoencoder model is obtained by stacking several autoencoders. The hidden layer of the SAE at layer 'i' becomes the input of the SAE at layer 'i + 1'.

Once all layers are trained, the network goes through a second learning step called "fine tuning".

f θ (1) g θ (1) '

X X ˆ

Deep learning through Stacked Autoencoder (SAE)

New representation of input X

Y

f θ (1) g θ (1) '

X X ˆ

f θ (1)

X

Deep learning through Stacked Autoencoder (SAE)

Autoencoder

The network is trained to reproduce the input at the output (learning of the identity function).

Learning of a Sparse autoencoder.

From an unlabelled data set Y1, Y2, ...

f θ (2) Y ˆ

Y g θ (2) '

Z = f

(Y )

Z

Deep learning through Stacked Autoencoder

(SAE)

Z

From there, the procedure can be repeated.

f θ (2) Y ˆ

Y g θ (2) '

Z = f

(Y )

Z

f θ (2)

Y

Deep learning through Stacked Autoencoder (SAE)

New representation of input Y

f θ (1)

X X ˆ

f θ (2) g θ' (1)

g θ' (2)

Input data reconstituted

Input data Representation

small size

Encoder Decoder

Deep learning through Stacked Autoencoder (SAE)

Deep learning through Stacked Autoencoder (SAE)

Reconstruction comparison

Deep learning through Stacked Autoencoder

(SAE)

Comparison of separability of codes (New 2D representation) generated by an autoencoder (right) and PCA (left) from the MNIST database (large database of handwritten digits)

Comparison of separability of codes (New 2D representation) generated by an autoencoder (right) and PCAP (left) from the News Stories database.

Deep learning through Stacked Autoencoder (SAE)

Generative Adversarial Networks GAN

• System of two neural networks competing against each other in a zero-sum game framework.

• They were first introduced by Ian Goodfellow et al. in 2014.

• Can learn to draw samples from a model that is similar to data that we give them.

Generative Adversarial Networks GAN

• Generative

Learn a generative model

f _θ ⁽¹⁾ g _θ ⁽¹⁾ _'

f _θ ⁽¹⁾ g _θ ⁽¹⁾ _'

f _θ ⁽¹⁾ g _θ ⁽¹⁾ _'

f _θ ⁽¹⁾

f _θ ⁽²⁾ Y ˆ

Y g _θ ⁽²⁾ _'

f _θ ⁽²⁾ Y ˆ

Y g _θ ⁽²⁾ _'

f _θ ⁽²⁾

f _θ ⁽¹⁾

f _θ ⁽²⁾ g _θ' ⁽¹⁾

g _θ' ⁽²⁾

! _" , … , ! _%

&(( _" ), … , &(( _% )

ℒ ₊ = − .

log 4 ! _/ + log 1 − 4 & ( _/

4 ^∗ = 89:;<= ℒ ₊

! ^∗ = $%&'$( ℒ _* = $%&'$( − ,

log 1 − 5 ! 6 _-

! ^∗ = $%&'$( ,

log 5 ! 6 _- ℒ ₇ = −ℒ _*