• Aucun résultat trouvé

M M M M M {} D x f x dim ( N ( x n = ) ) → = {} ( x x d 1 . (1) ( x n , 2 ) d n N (1) = 1 ),( x (2) x , ( n d ) (2) ∈ℜ ), K D , ,( d x ( n ( ) N ∈ℜ ) , d ( N K ) )

N/A
N/A
Protected

Academic year: 2022

Partager "M M M M M {} D x f x dim ( N ( x n = ) ) → = {} ( x x d 1 . (1) ( x n , 2 ) d n N (1) = 1 ),( x (2) x , ( n d ) (2) ∈ℜ ), K D , ,( d x ( n ( ) N ∈ℜ ) , d ( N K ) )"

Copied!
35
0
0

Texte intégral

(1)

Artificial Neural Networks: From Perceptron to Deep Learning 1 © 2021 ⏐Younès Bennani - USPN

Artificial Neural Networks

From Perceptron to Deep Learning

Younès BENNANI Full Professor

Master of Science in Informatics

Exploration Informatique des Données et Décisionnel (EID

2

) Science des Données (WISD & MASD) Mathématiques des Données (MD)

© 2001-2021 @Y. Bennani : Ce document est la propriété de Younès Bennani, Professeur à l'USPN. Il ne peut être diffusé ou reproduit sans son autorisation écrite (younes.bennani@sorbonne-paris-nord.fr).

Artificial Neural Networks: From Perceptron to Deep Learning 2 © 2021 ⏐Younès Bennani - USPN

General representation of a continuous function

Kolmogorov (1957)

Theorem:[Kolmogorov (1957)]

Any continuous function defined on can be written in the form:

where and are continuous functions of a variable.

Example:

A neural network is able to approximate the functions and using functions of the form:

f (x) = F

i

G

ij

( ) x

j

j=1 n

⎛ ∑

⎜ ⎞

i=1 2n+1

f (x) [ ] 0, 1

n

G

ij

F

i

G

ij

F

i

f w (

0

+ w

T

x )

x

0

x

1

x

n

M

M

G

1,1

G

1,n

G

2n+1,1

G

2n+1,n

F

1

F

2n+1

M

M

f (x) f (x)= x

1

.x

2

f (x) = 1

4 ( ( x

1

+ x

2

)

2

− ( x

1

x

2

)

2

)

Artificial Neural Networks: From Perceptron to Deep Learning 3 © 2021 ⏐Younès Bennani - USPN

Multi-Layers Perceptron (MLP)

x

0

x

1

x

dim

M Architecture:

similar to that of Perceptron or Madaline + intermediate processing layers (hidden layers) - External layers:

- Input (e units), - Output (s units) - Internal layers:

- Hidden (c units) Notation: <e ∣ c ∣ s>

example: <6 ∣ 4 ∣ 2>

Goal:

x

(n)

d

(n)

{ } n=1 N x

(n)

∈ ℜ D ,d

(n)

∈ ℜ K

x

y

d

w (1) w (2)

Input Hidden Output

Target/

desired output

Computed /Network output

D

N

= { (x

(1)

, d

(1)

), (x

(2)

, d

(2)

), K , ( x

(N)

, d

(N)

) }

Artificial Neural Networks: From Perceptron to Deep Learning 4 © 2021 ⏐Younès Bennani - USPN

0 1 2 3 4 5

6 7 8 9

10 11

Notations :

: set of the input units : set of the output units

: set of the units whose outputs serve of the inputs to unit

: set of the units that use as input the output of unit

By definition, we have:

E S Amont(k)

Aval( k)

iE, Amont(i) = ∅

iS, Aval (i) = ∅

k k

Amont(7) = { 0,1, 2,3, 4, 5 } Aval(7) = { 10,11 } Amont(2) = ∅ Aval(2) = { 6, 7,8, 9 } Amont(11) = { 6, 7,8, 9 } Aval(11) = ∅

Multi-Layers Perceptron

(MLP)

(2)

Artificial Neural Networks: From Perceptron to Deep Learning 5 © 2021 ⏐Younès Bennani - USPN

x 1

x D M

Input Layer (D)

Hidden Layer (M)

Output Layer (K)

M

M M M

y 1 y K

€ M w K M (2)

w K1 (2)

w 10 (2)

w M D (1)

w (1)

1 1

w 10 (1)

z 1

z M M

x

0=1

Z0=1

biais

biais

Multi-Layers Perceptron (MLP)

Artificial Neural Networks: From Perceptron to Deep Learning 6 © 2021 ⏐Younès Bennani - USPN

The theoretical risk is not calculable, is unknown.

But a sample of i.i.d examples taken from is known.

In practice, we can not minimize , we then use an induction principle.

Most Common = Empirical Risk Minimization (ERM) R(w) = ∫ L y(x, [ w), d ] dp(x, d)

w

*

= Argmin

w

R(w)

Learning from examples

Fixed but unknown probability distribution Theoretical risk

Generalization error

Loss function

Learning consists in finding the parameters:

The problem of learning is often presented as the minimization of a risk/cost/error function:

p(x, d)

D

N

p(x, d)

R(w)

p(x,d)= p(x)p(d

/

x)

x0 x1

xn M

x y(x, w)

Η = { y( x,w) / w ∈ Ω }

Artificial Neural Networks: From Perceptron to Deep Learning 7 © 2021 ⏐Younès Bennani - USPN

D N = { ( x

(1)

,d

(1)

) , ( x

(2)

, d

(2)

) ,..., ( x

(N)

, d

(N)

) }

w

+

= Argmin

w

R(w) R(w) = 1

N L y(x ⎡⎣

(n)

, w), d

(n)

⎤⎦

n=1 N

= 1

N R

(n)

(w)

n=1 N

Empirical Risk Minimization

Input Image Signal Measures

Target Output Class Prediction Score Rank Cluster

Empirical risk Learning error

Local empirical risk

x0 x1

xn M

x y(x, w)

Η = { y(x,w) / w ∈ Ω }

w

ji

(t +1) ← w

ji

(t )−ε (t) ∂R(w)

∂w

ji

Δw

ji

(t) = −ε(t) ∂R(w)

∂w

ji

Gradient descent

Iterative update of the weights using:

Gradient

Learning rate parameter Positive number

Learning consists in finding the parameters:

Artificial Neural Networks: From Perceptron to Deep Learning 8 © 2021 ⏐Younès Bennani - USPN

R(w) = 1

N R

(n)

( ) w

n=1 N

R

(n)

(w) = 1

2 ƒ w k0

(2)

+ w kj

(2)

ƒ w j

(1)0

+ w

(1)

ji x i

(n)

i

=1

D

⎛ ∑

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

j

=1

M

⎛ ∑

⎜ ⎜

⎟ ⎟ − d k

(n)

⎜ ⎜

⎟ ⎟

2

k K

R

(n)

(w) = 1

2 ( y k

(n)

d k

(n)

)

2

k=1 K

y

k(n)

= y

k

(x

(n)

,w)

w ji (t +1) ← w ji (t)−ε(t) ∂R

(n)

(w)

∂w ji

Target output Network

output

Gradient Descent algorithm

MLP wth 1 hidden layer and 2 weights layers w(1)et w(2)

Gradient descent Update of the weights:

Cost function

Mean Squared Error

(3)

Artificial Neural Networks: From Perceptron to Deep Learning 9 © 2021 ⏐Younès Bennani - USPN

Gradient Descent algorithm

Artificial Neural Networks: From Perceptron to Deep Learning 10 © 2021 ⏐Younès Bennani - USPN

a j = w j0

(1)

+ w

(1)

ji x i

i

Amont(j)

z j = ƒ ( ) a j

y k = ƒ ( ) a k a k = w k0

(2)

+ w kj

(2)

z j j

Amont(k)

y k = ƒ w k0

(2)

+ w kj

(2)

ƒ w

(1)

j0 + w ji

(1)

x i

i

=1

D

⎛ ∑

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

j

=1

M

⎛ ∑

⎜ ⎜

⎟ ⎟ x

0

x

1

x

D

M

x

y

d

Target output

Network output

The activation of the j

th

hidden unit is:

The output of this hidden unit is obtained by a nonlinear transformation of the activation:

In the same way, the activation and the output of the k

th

output unit can be obtained as follows:

If we combine the calculation of the outputs of the hidden units and that of output units we obtain for the k

th

output of the network, the following expression:

Multi-Layers Perceptron (MLP)

Artificial Neural Networks: From Perceptron to Deep Learning 11 © 2021 ⏐Younès Bennani - USPN

Multi-Layers Perceptron Transition functions

ƒ

( ) x

=tanh(x)=

e

x−e−x

e

x+e−x

ƒ ( ) x = 1

1 + e

−x

ƒ ( ) x = e

x

−1

e

x

+1

ƒ ʹ ( ) x = f (x) 1− ( f (x) )

ƒ ( ) x

ʹ ƒ ( ) x

Artificial Neural Networks: From Perceptron to Deep Learning 12 © 2021 ⏐Younès Bennani - USPN

Rectified Linear Units

w More efficient gradient propagation, derivative is 0 or constant, just fold into learning rate

w More efficient computation: Only comparison, addition and multiplication.

n

Leaky ReLU: f(x) = x if > 0 else ax where 0 ≤ a <= 1, so that derivate is not 0 and can do some learning for that case.

n

Lots of other variations

w Sparse activation: For example, in a randomly initialized networks,

only about 50% of hidden units are activated (having a non-zero

output)

(4)

Artificial Neural Networks: From Perceptron to Deep Learning 13 © 2021 ⏐Younès Bennani - USPN

Rectified Linear Units

Artificial Neural Networks: From Perceptron to Deep Learning 14 © 2021 ⏐Younès Bennani - USPN

Illustration of some possible decision boundaries

Artificial Neural Networks: From Perceptron to Deep Learning 15 © 2021 ⏐Younès Bennani - USPN

Output

Input or

and Convex

regions

Hyperplanes Arbitrary decision regions

Illustration of some possible decision boundaries

Artificial Neural Networks: From Perceptron to Deep Learning 16 © 2021 ⏐Younès Bennani - USPN

R mse (w) = 1

2 ( d

(n)

y

(n)

)

2

n=1 N

R multiple−logistic (w) = d i

(n)

log d i

(n)

y i

(n)

+ ( 1− d i

(n)

) log 1− 1 d y i

(n)

i

(n)

⎣ ⎢ ⎤

⎦ ⎥

i=1 D

n=1 N

R

log−likelihood

(w) = d

(n)

log d

(n)

p

(n)

n=1 N

avec p

(n)

= e y

(n)

e y

(j)

j

R mse−pondéré (w) = n=1 N ( ( d

(n)

y

(n)

) ∑

−1

( d

(n)

y

(n)

) )

avec

−1

= N 1 ( d

(n)

y

(n)

) ( d

(n)

y

(n)

) T

n=1 N

Multi-Layers Perceptron

Error functions

(5)

Artificial Neural Networks: From Perceptron to Deep Learning 17 © 2021 ⏐Younès Bennani - USPN

R % mlp

(n)

(w) = 1

2 ( y k

(n)

d k

(n)

)

2

k=1 K

z = (x,d) v.a. (X,D) p(x,d) = p(x)p(d / x)

R % mlp

(n)

(w) = 1

2 ƒ w k0

(2)

+ w kj

(2)

ƒ w j0

(1)

+ w

(1)

ji x i

i

=1

D

⎛ ∑

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

j

=1

M

⎛ ∑

⎜ ⎜

⎟ ⎟ − d k

(n)

⎜ ⎜

⎟ ⎟

2

k=1 K

R(w) = L z,w ( ) dp( z)

∫ Z R(w) % = 1

N L z (

(n)

, w )

n=1 N

w + = argmin

w

R ˜ (w)

Multi-Layers Perceptron Rules for updating the weights

Artificial Neural Networks: From Perceptron to Deep Learning 18 © 2021 ⏐Younès Bennani - USPN

Δw ji = w ji (t +1) − w ji (t)

= −ε(t)∇ w

ji

R % mlp

(n)

(w)

w

ji

R % mlp

(n)

(w) =δ j

(n)

s i

δ

j(n)

= f ʹ (a

j

) ( y

j(n)

d

j(n)

) si j Sortie

δ

j(n)

= f ʹ (a

j

) w

hj

δ

h(n)

h∈Aval(j)

si i Sortie

⎪ ⎪

⎪ ⎪

R % mlp

(n)

(w) = 1

2 ƒ w k0

(2)

+ w kj

(2)

ƒ w j0

(1)

+ w ji

(1)

x i i

=1

D

⎛ ∑

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

j

=1

M

⎛ ∑

⎜ ⎜

⎟ ⎟ − d k

(n)

⎜ ⎜

⎟ ⎟

2

k=1 K

!" # ∈ %&'(&'

!" # ∉ %&'(&'

Multi-Layers Perceptron Rules for updating the weights

Artificial Neural Networks: From Perceptron to Deep Learning 19 © 2021 ⏐Younès Bennani - USPN

Gradient Backpropagation (GBP)

R % mlp

(n)

(w) = 1

2 ( y k

(n)

d k

(n)

)

2

k=1 K

R % mlp

(n)

(w)

w ji = ∂ R % mlp

(n)

(w)

a j

a j

w ji

δ k j

si jSortie δ j

(n)

= ∂ R % mlp

(n)

(w)

a j = ∂ R % mlp

(n)

(w)

s j

s j

a j

= f ʹ (a j ) ∂ R % mlp

(n)

(w)

a h

a h

s j

h∈Aval(j)

= f ʹ (a j ) w hj δ h

(n)

h∈Aval(j)

s i

w

ji

R % mlp

(n)

(w) =δ j

(n)

s i

j

h

si jSortie δ j

(n)

= ∂ R % mlp

(n)

(w)

a j = ∂ R % mlp

(n)

(w)

y j

y j

a j

= ( y

(n)

jd j

(n)

) f ʹ (a j )

!" # ∈ %&'(&' !" # ∉ %&'(&'

Artificial Neural Networks: From Perceptron to Deep Learning 20 © 2021 ⏐Younès Bennani - USPN

y

i=

f a ( )

i

a

i=

w

i j

x

j j∈Amont(i

) x0

wi0

ai yi

x1

xn

€ M

wi1

wi n

δi+1

wi+1i

δi

δk

€ M wk i

wm i

ʹ f a

( )

i

€ M

€ M

€ M δm

δi

(n)=

f a

ʹ

( )

i

w

hiδh (n) h∈Aval(i)

Propagation

Back-Propagation

Gradient Backpropagation (GBP)

(6)

Artificial Neural Networks: From Perceptron to Deep Learning 21 © 2021 ⏐Younès Bennani - USPN

1- Initialize randomly

2- Randomly choose a couple of data

3- Calculate the state of the network by propagation

4- Calculate a gradient approximation

5- Adapt the weights

Where is de step of the gradient.

6- Repeat from 2 to 5 until an acceptable error value is obtained.

W

o

ε ( ) t

x

(n)

, y

(n)

( )

x j = ƒ ( ) a j

MLP Learning algorithm

a

j

= w

j0

+ w

ji

x

i i

Multi-Layers Perceptron

!

"#

$ + 1 = !

"#

$ − )($),

"(-)

.

#

,

"(-)

= /

0

1

"

2

"-

− 3

"-

4/ 5 ∈ 78$98$

,

"(-)

= /

0

1

"

:

; ∈ <=>?(")

!

;"

,

;(-)

4/ 5 ∉ 78$98$

Computational complexity of backpropagation O(W) where W

= number of free parameters (weight and bias) Complexity of a 2-layer MLP:

W = (D+1)H+(H+1)O D: #Imput H: #Hidden O: #Output

Artificial Neural Networks: From Perceptron to Deep Learning 22 © 2021 ⏐Younès Bennani - USPN

MLP Example

D = 1 1

⎜ ⎜

⎟ ⎟ ,−1

⎜ ⎜

⎟ ⎟

; 1

−1

⎜ ⎜

⎟ ⎟ ,1

⎜ ⎜

⎟ ⎟

;

−1 1

⎜ ⎜

⎟ ⎟ ,1

⎜ ⎜

⎟ ⎟

;

−1

−1

⎜ ⎜

⎟ ⎟ ,−1

⎜ ⎜

⎟ ⎟

⎨ ⎪

⎩ ⎪

⎬ ⎪

⎭ ⎪

1

1 0.5

1.0

−1.0 1.5

1.0 1.0

1.0 1.0 0.5

x2

x1

ψ ( x,w ) = x

1

x

2

x2

x1

x2

x1

1.0 x

1

+1.0 x

2

+1.5 = 0 x

1

+ x

2

+1.5 = 0 1.0 x

1

+1.0 x

2

+ 0.5 = 0 x

1

+ x

2

+ 0.5 = 0

Artificial Neural Networks: From Perceptron to Deep Learning 23 © 2021 ⏐Younès Bennani - USPN

MLP Example

Artificial Neural Networks: From Perceptron to Deep Learning 24 © 2021 ⏐Younès Bennani - USPN

Data organization

Coding of the outputs

Coding « 1-among-C »

Blue

Red White

Category

1 0 0

0 0 1 0 1 0

Coding (0/1)

1 -1 -1

-1 -1 1 -1 1 -1

Coding (-1/1)

1 0

0 0 0 1

Coding (0/1)

1 -1

-1 -1 -1 1

Coding (-1/1) Coding « 1-among-(C-1) »

1 1 1

0 0 1 0 1 1

Coding (0/1) Thermometer coding

1 1 1 1

1 1

1

2 2 2 2 2

2

3 3 3 3 3

3 x

? x Neural Net 2

Classes ranking

(7)

Artificial Neural Networks: From Perceptron to Deep Learning 25 © 2021 ⏐Younès Bennani - USPN

Data organization

Coding of the inputs

Why Data Normalization is necessary for Machine Learning models?

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

Artificial Neural Networks: From Perceptron to Deep Learning 26 © 2021 ⏐Younès Bennani - USPN

Data organization

Coding of the inputs

Why Data Normalization is necessary for Machine Learning models?

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

Urvashi Jaitley, Oct 8, 2018 (https://medium.com/@urvashilluniya)

Artificial Neural Networks: From Perceptron to Deep Learning 27 © 2021 ⏐Younès Bennani - USPN

As with most statistical techniques, we always have interest to pre-process the data so that it is centered, reduced.

Data organization

Coding of the inputs

Artificial Neural Networks: From Perceptron to Deep Learning 28 © 2021 ⏐Younès Bennani - USPN

As with most statistical techniques, we always have interest to pre-process the data so that it is centered, reduced.

x

(n)

x

(n)

x σ x = 1

N x

(n)

n=1 N

σ

2

= N 1 ( x

(n)

x )

2

n=1 N

a i = w ij x j j∈Amont(i

)

E a [ ] i = E w ij x j j∈Amont(i

)

⎣ ⎢

⎦ ⎥ = E w [ ] ij E x [ ] j = 0

j

∈Amont(i)

Var a [ ] i = Var w ij x j j

∈Amont(

i)

⎣ ⎢

⎦ ⎥ = j

∈Amont(i)

E w [ ]

ij2

E x [ ]

2j

= Amont(i) Var w [ ] ij Var x [ ] j

= Amont(i) σ

2

[ ] w ij

=0

Data organization

Coding of the inputs

(8)

Artificial Neural Networks: From Perceptron to Deep Learning 29 © 2021 ⏐Younès Bennani - USPN

Weights nitialization

Var a [ ] i = Amont(i) σ

2

[ ] w ij = 1

σ

2

[ ] w ij = Amont(i) 1 σ [ ] w ij Amont(i )

12

We can therefore initialize the weights according to a uniform law over an interval:

w ij ∈ − k

Amont(i) , k Amont(i)

⎣ ⎢

⎦ ⎥

0.5 < k < 2 f (x) = 1.71 tanh 2 3 x

⎝ ⎞

Artificial Neural Networks: From Perceptron to Deep Learning 30 © 2021 ⏐Younès Bennani - USPN

Exercice to do

D = 1 1

⎜ ⎜

⎟ ⎟ , 1

−1

⎝ ⎜ ⎞

⎠ ⎟

⎜ ⎜

⎟ ⎟

⎟ ;

−1 1

⎜ ⎜

⎟ ⎟ , −1 1

⎝ ⎜ ⎞

⎠ ⎟

⎜ ⎜

⎟ ⎟

⎟ ; 1

−1

⎜ ⎜

⎟ ⎟ , −1 1

⎝ ⎜ ⎞

⎠ ⎟

⎜ ⎜

⎟ ⎟

⎟ ;

−1

−1

⎜ ⎜

⎟ ⎟ , 1

−1

⎝ ⎜ ⎞

⎠ ⎟

⎜ ⎜

⎟ ⎟

⎨ ⎪

⎩ ⎪

⎬ ⎪

⎭ ⎪

1 1

0.5 0.3 0.4 0.1

0.3 0.6

−0.8 0.9 0.2

−0.4

−0.2

−0.7

ƒ ( ) x = 1+ 1 e

−x

ƒ ʹ ( ) x = f (x)[1− f ( x)]

Artificial Neural Networks: From Perceptron to Deep Learning 31 © 2021 ⏐Younès Bennani - USPN Artificial Neural Networks: From Perceptron to Deep Learning 32 © 2021 ⏐Younès Bennani - USPN

(9)

Artificial Neural Networks: From Perceptron to Deep Learning 33 © 2021 ⏐Younès Bennani - USPN Artificial Neural Networks: From Perceptron to Deep Learning 34 © 2021 ⏐Younès Bennani - USPN

x

1

x

2

x

1

x

2

x

1

x

2

By taking linear combinations of these localized functions, we can approximate any smooth functional mapping.

MLP is an universal approximator.

A neural network with 3 layers of sigmoidal hidden units, can approximate a smooth multivariate mapping to arbitrary accuracy

In (a) we see the output of a single sigmoidal unit as a function of two input variables.

Adding the outputs from 2 such units can produce a ridge-like function (b).

Adding 2 ridges can give a function with a

maximum (c). Transforming this function with another sigmoid gives a localized response (d).

x 1 x 2

Artificial Neural Networks: From Perceptron to Deep Learning 35 © 2021 ⏐Younès Bennani - USPN

Notes:

- In theory, there is no need for any other network structure. However, in applications, it may be more convenient to use multiple layers, non-linear outputs, and so on.

- It is a purely existential theorem. It does not say how to determine an appropriate number of neurons in the hidden layer and weight values to approximate a given function with a given precision!

Theorem [Hornik 1989]:

Three-layered artificial neural networks with linear outputs, with sufficient increasing saturating hidden units, can arbitrarily approach any measurable bounded function from one finite-dimensional space to another.

An artificial neural network is an universal approximator.

Universal Approximator

Artificial Neural Networks: From Perceptron to Deep Learning 36 © 2021 ⏐Younès Bennani - USPN

Links with factor analysis

MLP & PCA * **

x

1

x

n

€ M

Input Hidden Target

Output Network Output

€ M

* Bourlard H. & Kamp Y. (1988) : «Auto-association by multilayer perceptrons and singular value decomposition» Biological Cybernetics, Vol. 59, pp. 291-294.

** Baldi P. & Hornik K. (1989) : «Neural networks and principal component analysis: Learning from examples without local minima » Neural Networks, Vol. 2, N 1, pp. 53-58.

x ˆ

1

x ˆ

n

€ M

€ M

d

1

= x

1

d

n

= x

n

€ M

€ M

Output

Coding

Encoder Decoder

u

2

u

1

x x

2

x

1

An auto-associative (auto-encoder) MLP with linear outputs, can realize a Karhunen-Loève transformation.

A auto-associative MLP is equivalent to a Principal Component Analysis

(PCA).

(10)

Artificial Neural Networks: From Perceptron to Deep Learning 37 © 2021 ⏐Younès Bennani - USPN

* Bennani Y. (1992) : «Approches Connexionnistes pour la Modélisation et l Identification» Thèse de Doctorat, LRI-Université Paris 11, Orsay.

* Gallinari P., Thiria S., Badran F., Fogelman-Soulie F. (1991) : « On the relations between discriminant analysis and multilayer perceptrons» Neural Networks, Vol. 4, N 3, pp.

349-360.

1

2 3

5 4 6

7 9 8

0

Couche d'entrée

2 1

3 4

5 6

8 7 9

0

1ère couche cachée

1

2 3 4

5 6

7

8 9

0

2ème couche cachée

1 2 3

4 5 6

7

8 9

0

Couche de sortie

x

1

x

n

M

M

M

€ M

Links with factor analysis

MLP & LDA * **

Input layer Hidden layer #1

Hidden layer #2 Output layer

Artificial Neural Networks: From Perceptron to Deep Learning 38 © 2021 ⏐Younès Bennani - USPN

Structured Networks

Complete Connections

x

1

x

n

M

x

y

d

Input Hidden Output

Target/Desired Output

Network/Computed Output

M

Artificial Neural Networks: From Perceptron to Deep Learning 39 © 2021 ⏐Younès Bennani - USPN

x

1

(t)

x

n

(t)

€ M

y

d

€ M

c

1

(t −1)

c

m

(t −1)

€ M

Context

* Elman J.L. (1990) : «Finding structure in time» Cognitive Science, Vol. 14, pp. 179-212.

Structured Networks

Complete Connections with Context [Elman *]

Input Hidden Output Target Output

Network Output

Artificial Neural Networks: From Perceptron to Deep Learning 40 © 2021 ⏐Younès Bennani - USPN

x

1

(t)

x

n

(t)

€ M

y

d

€ M

y

1

(t −1) y

p

(t − 1)

* Jordan M.I. (1992) : «Constrained supervised learning» Journal of Mathematical Psychology, Vol. 36, pp. 396-425.

Context

Structured Networks

Complete Connections with Context [Jordan *]

Input Hidden Output Target Output

Network Output

(11)

Artificial Neural Networks: From Perceptron to Deep Learning 41 © 2021 ⏐Younès Bennani - USPN

Structured Networks

Synthesis of speech [Sejnowski & Rosenberg *]

* Sejnowski T.J. & Rosenberg C.R. (1987) :

«Parallel Networks that learn to pronounce English text»

Complex systems, Vol 1, pp. 145-168.

h t s i s i h

T e i n p u t

/z/

NETtalk : text to speech

Artificial Neural Networks: From Perceptron to Deep Learning 42 © 2021 ⏐Younès Bennani - USPN

Features exctractors

Input Hidden Output

Local features Receptive field

The use of local connections greatly reduces the number of weights in a network.

w

1(1)

w

3(1)

w

2(1)

w

4(1)

⎣ ⎢

⎦ ⎥

w

1(2)

w

3(2)

w

2(2)

w

4(2)

⎣ ⎢

⎦ ⎥

w

1 (3)

w

2(3)

w

3(3)

w

4(3)

Structured Networks

Local Connections

Artificial Neural Networks: From Perceptron to Deep Learning 43 © 2021 ⏐Younès Bennani - USPN

Structured networks

Constrained or shared-weight connections

Local features Receptive field

An interesting property of the weight sharing mechanism is the very small number of free parameters.

w

1(1)

w

3(1)

w

2(1)

w

4(1)

⎣ ⎢

⎦ ⎥ w

1

(5)

w

2 (5)

w

3(5)

w

4(5)

w

1(1)

w

3(1)

w

2(1)

w

4(1)

⎣ ⎢

⎦ ⎥

w

1(3)

w

3(3)

w

2(3)

w

4(3)

⎣ ⎢

⎦ ⎥

Output Convolution

filter Features exctractors

Artificial Neural Networks: From Perceptron to Deep Learning 44 © 2021 ⏐Younès Bennani - USPN

Structured networks

Constrained or shared-weight connections

(12)

Artificial Neural Networks: From Perceptron to Deep Learning 45 © 2021 ⏐Younès Bennani - USPN

Structured Networks

Constrained Connections or Shared Weights TDNN (Time Delay Neural Network)

Frequency

Output

time

w1 (1) w5 (1)

w2 (1) w6

(1) w3

(1) w7 (1)

w4 (1) w8 (1)

⎣ ⎢

⎦ ⎥ Convolution

filter

w1 (1) w5 (1)

w2 (1) w6

(1) w3

(1) w7 (1)

w4 (1) w8 (1)

⎣ ⎢

⎦ ⎥

w1 (2)

w5(2) w2

(2)

w6(2) w3

(2)

w7(2) w4

(2)

w8(2)

⎣ ⎢

⎦ ⎥ w1

(3) w5

(3) w2

(3) w6 (3)

w3 (3) w7 (3)

w4 (3) w8 (3)

⎣ ⎢

⎦ ⎥ w1

(4)

w5(4) w2

(4)

w6(4) w3

(4)

w7(4) w4

(4)

w8(4)

⎣ ⎢

⎦ ⎥ w1

(5) w2 (5) w3

[

(5)

]

n=2 d=1

N=6 M=((N-n)/d)+1

* Waibel A., Hanazawa T., Hinton G., Shikano K., Lang K. (1987) : «Phoneme recognition using Time-Delay Neural Networks» Tech. Rep. ATR, TR-1-006.

Features exctractors

Artificial Neural Networks: From Perceptron to Deep Learning 46 © 2021 ⏐Younès Bennani - USPN

Structured Networks

Convolutional Neural Network

Super-Resolution Convolutional Neural Network

Artificial Neural Networks: From Perceptron to Deep Learning 47 © 2021 ⏐Younès Bennani - USPN

Super-Resolution Convolutional Neural Network

•Input Image: LR single-channel image up-sampled to desired higher resolution

•Conv. Layer 1: Patch extraction

• 64 filters of size 1 x 9 x 9

• Activation function: ReLU

• Output: 64 feature maps

• Parameters to optimize: 1 x 9 x 9 x 64 = 5184 weights and 64 biases

•Conv. Layer 2: Non-linear mapping

• 32 filters of size 64 x 1 x 1

• Activation function: ReLU

• Output: 32 feature maps

• Parameters to optimize: 64 x 1 x 1 x 32 = 2048 weights and 32 biases

•Conv. Layer 3: Reconstruction

• 1 filter of size 32 x 5 x 5

• Activation function: Identity

• Output: HR image

• Parameters to optimize: 32 x 5 x 5 x 1 = 800 weights and 1 bias

Structured Networks

Convolutional Neural Network

Artificial Neural Networks: From Perceptron to Deep Learning 48 © 2021 ⏐Younès Bennani - USPN

Input Low resolution image

Superpixel layers

Output High resolution image

•Conv. Layer 1: Feature extraction

56 filters of size 1 x 5 x 5

Activation function: PReLU

Output: 56 feature maps

Parameters: 1 x 5 x 5 x 56 = 1400 weights and 56 biases

•Conv. Layer 2: Shrinking

12 filters of size 56 x 1 x 1

Activation function: PReLU

Output: 12 feature maps

Parameters: 56 x 1 x 1 x 12 = 672 weights and 12 biases

•Conv. Layers 3–6: Mapping

4 x 12 filters of size 12 x 3 x 3

Activation function: PReLU

Output: HR feature maps

Parameters: 4 x 12 x 3 x 3 x 12 = 5184 weights and 48 biases

•Conv. Layer 7: Expanding

56 filters of size 12 x 1 x 1

Activation function: PReLU

Output: 12 feature maps

Parameters: 12 x 1 x 1 x 56 = 672 weights and 56 biases

•DeConv Layer 8: Transposed Convolution (Deconvolution)

One filter of size 56 x 9 x 9

Activation function: PReLU

Output: 12 feature maps

Parameters: 56 x 9 x 9 x 1 = 4536 weights and 1 bias

Total number of weights: 12464 (plus a very small number of parameters in PReLU layers)

1. Feature extraction: Extracts a set of feature maps directly from the LR image.

2. Shrinking: Reduces dimension of feature vectors (thus decreasing the number of parameters) by using a smaller number of filters (compared to the number of filters used for feature extraction).

3. Non-linear mapping: Maps the feature maps representing LR to HR patches. This step is performed using several mapping layers.

4. Expanding: Increases dimension of feature vectors. This operation performs the inverse operation as the shrinking layers, in order to more accurately produce the HR image.

5. Deconvolution: Produces the HR image from HR features.

Super-Resolution Convolutional Neural Network

Structured Networks

Convolutional Neural Network

(13)

Artificial Neural Networks: From Perceptron to Deep Learning 49 © 2021 ⏐Younès Bennani - USPN

Super-Resolution Convolutional Neural Network

Structured Networks

Convolutional Neural Network

Artificial Neural Networks: From Perceptron to Deep Learning 50 © 2021 ⏐Younès Bennani - USPN

Structured Networks

Convolutional Neural Network

Artificial Neural Networks: From Perceptron to Deep Learning 51 © 2021 ⏐Younès Bennani - USPN

Structured Networks

Convolutional Neural Network

Artificial Neural Networks: From Perceptron to Deep Learning 52 © 2021 ⏐Younès Bennani - USPN

Structured Networks

Convolutional Neural Network

(14)

Artificial Neural Networks: From Perceptron to Deep Learning 53 © 2021 ⏐Younès Bennani - USPN

Discrete convolution is a linear transformation that preserves the notion of ordering.

It is sparse (only a few input units contribute to a given output unit) and reuses parameters (the same weights are applied to multiple locations in the input).

Computing the output values of a discrete convolution.

Artificial Neural Networks: From Perceptron to Deep Learning 54 © 2021 ⏐Younès Bennani - USPN

The light bluegrid is called theinput feature map. A kernel (shaded area) of value:

Slides across the input feature map. At each location, the product between each element of the kernel and the input element it overlaps is computed and the results are summed up to obtain the output in the current location.

The final outputs of this procedure are called output feature maps.

Computing the output values of a discrete convolution.

stride: distance between two consecutive positions of the kernel

zero padding : number of zeros concatenated at the beginning and at the end of an axis

A (3 3) kernel applied to a (5 5) input padded with a no border of zeros

using 1 strides.

Artificial Neural Networks: From Perceptron to Deep Learning 55 © 2021 ⏐Younès Bennani - USPN

Pooling is a concept in deep learning visual object recognition that goes hand-in-hand with convolution.

The idea is that a convolution (or a local neural network feature detector) maps a region of an image to a feature map.

What is pooling in a deep architecture?

By organizing alternating feature detection and pooling layers in a hierarchy, flexibly deformed structures can be recognized as "variations on the same structural theme."

At the very highest level, the features might be object categories that are fully independent of the position of the object within the frame.

Artificial Neural Networks: From Perceptron to Deep Learning 56 © 2021 ⏐Younès Bennani - USPN

Pooling is one of the most important concepts of Convolutional neural network, which divides the input map into a set of rectangles and, outputs the maximum for nonlinear down-sampling.

The most common pooling layer filter is of size 2x2, which discards three forth of the activations.

Role of pooling layer is to reduce the resolution of the feature map but retaining features of the map required for classification through translational and rotational invariants.

What is pooling in a deep architecture?

(15)

Artificial Neural Networks: From Perceptron to Deep Learning 57 © 2021 ⏐Younès Bennani - USPN

What is pooling in a deep architecture?

Artificial Neural Networks: From Perceptron to Deep Learning 58 © 2021 ⏐Younès Bennani - USPN

There are three variants of pooling operation depending on roots of regularization technique:

• Stochastic pooling,

• Overlapping pooling,

• Fractional pooling.

What is pooling in a deep architecture?

Stochastic Pooling

Randomly picked activation within each pooling region is considered than deterministic pooling operations for regularization of the network. Stochastic pooling performs reduction of feature size but denies role for selecting features judiciously for the sake of regularization.

Overlapping pooling

Overlapping pooling operation shares responsibility of local connection beyond the size of previous convolutional filter, which breaks orthogonal responsibility between pooling layer and convolutional layer. So, no information is gained if pooling windows overlap.

Fractional Pooling

Reduction ratio of filter size due to pooling can be controlled by a fractional pooling concept, which helps to increase the depth of the network. Unlike stochastic pooling, the randomness is related to the choice of pooling regions, not the way pooling is performed inside each of the pooling regions.

Artificial Neural Networks: From Perceptron to Deep Learning 59 © 2021 ⏐Younès Bennani - USPN

Pooling

Pooling works by sliding a window across the input and feeding the content of the window to apooling function.

Pooling works very much like a discrete convolution, but replaces the linear combination described by the kernel with some other function.

Pooling operations reduce the size of feature maps by using some function to summarize subregions, such as taking the average or the maximum value.

Pooling layers provide invariance to small translations of the input.

Artificial Neural Networks: From Perceptron to Deep Learning 60 © 2021 ⏐Younès Bennani - USPN

Computing the output values of a (3 3) max pooling operation on a (5 5) input using (1 1) strides.

Pooling

Computing the output values of a (3 3) average pooling

operation on a (5 5) input using (1 1) strides.

(16)

Artificial Neural Networks: From Perceptron to Deep Learning 61 © 2021 ⏐Younès Bennani - USPN

• square inputs (i

1

=i

2

=i),

• square kernel size (k

1

=k

2

=k),

• same strides along both axes (s

1

=s

2

=s),

• same zero padding along both axes (p

1

=p

2

=p).

No zero padding, unit strides Relationship 1.

For any i and k, and for s= 1 and p= 0, o= (i−k) + 1.

No padding, unit strides

Convolving a (3 3) kernel over a (4 4) input using unit strides (i.e., i= 4, k= 3, s= 1and p= 0) à o=(4-3)+1=2

à à (2 2)

N.B.: Blue maps are inputs, and cyan maps are outputs.

Artificial Neural Networks: From Perceptron to Deep Learning 62 © 2021 ⏐Younès Bennani - USPN

Convolving a (4 4) kernel over a (5 5) input padded with a 2 border of zeros using unit strides (i.e., i= 5, k= 4, s= 1 and p= 2).

à o=(5-4)+2 2+1=6

à à (6 6)

Zero padding, unit strides

Padding with p zeros changes the effective input size from i to i+ 2p.

Relationship 1 can then be used to infer the following relationship:

Relationship 2.

For any i, k and p, and for s= 1, o= (i−k) + 2p+ 1.

Artificial Neural Networks: From Perceptron to Deep Learning 63 © 2021 ⏐Younès Bennani - USPN

Half (same) padding

Having the output size be the same as the input size (i.e., o=i) can be a desirable property:

Relationship 3.

For any i and for k odd (k= 2n+ 1, n∈N), s= 1 and p= ⌊k/2⌋ =n, o= i + 2 ⌊k/2⌋ − (k−1)

=i+ 2n−2n

=i.

(Half padding, unit strides) Convolving a 3 3 kernel over a 5 5 input using half padding and unit strides (i.e., i= 5, k= 3, s= 1 and p= 1).

Artificial Neural Networks: From Perceptron to Deep Learning 64 © 2021 ⏐Younès Bennani - USPN

Full padding

While convolving a kernel generally decreases the output size with respect to the input size, sometimes the opposite is required. This can be achieved with proper zero padding:

Relationship 4.

For any i and k, and for p=k−1and s= 1, o=i+ 2(k−1)−(k−1)

=i+ (k−1)

(Full padding, unit strides) Convolving a 3 3 kernel over a 5 5 input

using full padding and unit strides (i.e., i= 5, k= 3, s= 1 and p= 2).

(17)

Artificial Neural Networks: From Perceptron to Deep Learning 65 © 2021 ⏐Younès Bennani - USPN

No zero padding, non-unit strides

Relationship 5.

For any i, k and s, and for p= 0, o=⌊(i−k)/s⌋+1.

Convolving a (3 3) kernel over a (6 6) input padded with a 1 border of zeros using (2 2) strides (i.e., i= 6, k= 3, s= 2 and p= 1).

In this case, the bottom row and right column of the zero padded input are not covered by the kernel.

Artificial Neural Networks: From Perceptron to Deep Learning 66 © 2021 ⏐Younès Bennani - USPN

Zero padding, non-unit strides

The most general case (convolving over a zero padded input using non-unit strides) can be derived by applying Relationship 5 on an effective input of size i+ 2p, in analogy to what was done for Relationship 2:

Relationship 6.

For any i, k, p and s, o=⌊(i+ 2p−k)/s⌋+ 1.

As before, the floor function means that in some cases a convolution will produce the same output size for multiple input sizes.

More specifically, if i+ 2p−k is a multiple of s, then any input size j=i+a, a∈{0, . . . , s−1} will produce the same output size. Note that this ambiguity applies only for s >1.

Artificial Neural Networks: From Perceptron to Deep Learning 67 © 2021 ⏐Younès Bennani - USPN

Transposed convolution arithmetic

Transposed convolutions : use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to some thing that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

For instance:

- decoding layer of a convolutional autoencoder, - project feature maps to a higher-dimensional space.

If the input and output were to be unrolled into vectors from left to right, top to bottom, the convolution could be represented as a sparse matrix C where the non-zero elements are the elements w

i,j

of the kernel (with i and j being the row and column of the kernel respectively):

Artificial Neural Networks: From Perceptron to Deep Learning 68 © 2021 ⏐Younès Bennani - USPN

This linear operation takes the input matrix flattened as a 16-dimensional vector and produces a 4-dimensional vector that is later reshaped as the 2 2 output matrix.

Using this representation, the backward pass is easily obtained by transposing C; in other words, the error is backpropagated by multiplying the loss with C

T

. This operation takes a 4-dimensional vector as input and produces a 16- dimensional vector as output, and its connectivity pattern is compatible with C by construction.

Notably, the kernel w defines both the matrices C and C

T

used for the forward and backward passes.

Transposed convolutions – also called fractionally strided convolutions or

deconvolutions – work by swapping the forward and backward passes of a

convolution.

(18)

Artificial Neural Networks: From Perceptron to Deep Learning 69 © 2021 ⏐Younès Bennani - USPN

No zero padding, unit strides, transposed

The simplest way to think about a transposed convolution on a given input is to imagine such an input as being the result of a direct convolution applied onsome initial feature map.

The transposed convolution can be then considered as the operation that allows to recover the shape of this initial feature map.

The transpose of convolving a (3 3) kernel over a (4 4) input using unit strides (i.e.,i= 4, k= 3, s= 1and p= 0).

It is equivalent to convolving a (3 3) kernel over a (2 2) input padded with a (2 2) border of zeros using unit strides (i.e.,i′= 2, k′=k, s′= 1 and p′= 2).

Relationship 8.

A convolution described by s= 1,p= 0 and k has an associated transposed convolution described by k′=k, s′=s and p′=k−1 and its output size is

o′=i′+ (k−1).

This corresponds to a fully padded convolution with unit strides.

Artificial Neural Networks: From Perceptron to Deep Learning 70 © 2021 ⏐Younès Bennani - USPN

Transposed convolution animations

N.B.: Blue maps are inputs, and cyan maps are outputs.

No padding, no strides, transposed Arbitrary padding, no strides,

transposed Half padding, no strides,

transposed Full padding, no strides, transposed

No padding, strides, transposed Padding, strides, transposed Padding, strides, transposed (odd)

Vincent Dumoulin, Francesco Visin - A guide to convolution arithmetic for deep learning https://github.com/vdumoulin/conv_arithmetic

Artificial Neural Networks: From Perceptron to Deep Learning 71 © 2021 ⏐Younès Bennani - USPN

Structured networks

LeNet for Digit recognition [Yann LeCun *]

* Le Cun Y., Boser B., Denker J.S., Henderson D., Howard R.E., Hubbard W., Jackel L.D. (1989) : «Back-propagation applied to handwritten zip code recognition»

Neural Computation, Vol. 1, pp. 541-551.

Artificial Neural Networks: From Perceptron to Deep Learning 72 © 2021 ⏐Younès Bennani - USPN

Structured networks

LeNet for Digit recognition [Yann LeCun *]

Références