M M M M M {} D x f x dim ( N ( x n = ) ) → = {} ( x x d 1 . (1) ( x n , 2 ) d n N (1) = 1 ),( x (2) x , ( n d ) (2) ∈ℜ ), K D , ,( d x ( n ( ) N ∈ℜ ) , d ( N K ) )

(1)

Artificial Neural Networks

From Perceptron to Deep Learning

Younès BENNANI Full Professor

Master of Science in Informatics

Exploration Informatique des Données et Décisionnel (EID

²

) Science des Données (WISD & MASD) Mathématiques des Données (MD)

© 2001-2021 @Y. Bennani : Ce document est la propriété de Younès Bennani, Professeur à l'USPN. Il ne peut être diffusé ou reproduit sans son autorisation écrite (younes.bennani@sorbonne-paris-nord.fr).

General representation of a continuous function

Kolmogorov (1957)

Theorem:[Kolmogorov (1957)]

Any continuous function defined on can be written in the form:

where and are continuous functions of a variable.

Example:

A neural network is able to approximate the functions and using functions of the form:

f (x) = F

_i

G

_ij

( ) x

_j

j=1 n

⎛ ∑

⎝

⎜ ⎞

⎠

⎟

i=1 2n+1

∑

f (x) [ ] 0, 1

ⁿ

G

_ij

F

_i

G

_ij

F

_i

f w (

₀

+ w

^T

x )

x

₀

x

1

x

_n

M

G

_1,1

G

_1,n

G

_2n+1,1

G

_2n+1,n

F

₁

F

2n+1

M

∑

M

f (x) f (x)= x

₁

.x

₂

f (x) = 1

4 ( ( x

₁

+ x

₂

)

²

− ( x

₁

− x

₂

)

²

)

Multi-Layers Perceptron (MLP)

x

₀

x

1

x

_dim

M Architecture:

similar to that of Perceptron or Madaline + intermediate processing layers (hidden layers) - External layers:

- Input (e units), - Output (s units) - Internal layers:

- Hidden (c units) Notation: <e ∣ c ∣ s>

example: <6 ∣ 4 ∣ 2>

Goal:

x

⁽ⁿ⁾

→ d

⁽ⁿ⁾

{ } _n=1 ^N ^x

⁽ⁿ⁾

^{∈ ℜ} ^D ^,d

⁽ⁿ⁾

^{∈ ℜ} ^K

x

€

€ y

d

w ⁽¹⁾ w ⁽²⁾

Input Hidden Output

Target/

desired output

Computed /Network output

D

_N

= { (x

⁽¹⁾

, d

⁽¹⁾

), (x

⁽²⁾

, d

⁽²⁾

), K , ( x

^(N)

, d

^(N)

) }

0 1 2 3 4 5

6 7 8 9

10 11

Notations :

: set of the input units : set of the output units

: set of the units whose outputs serve of the inputs to unit

: set of the units that use as input the output of unit

By definition, we have:

E S Amont(k)

Aval( k)

∀ i ∈ E, Amont(i) = ∅

∀ i ∈ S, Aval (i) = ∅

k k

Amont(7) = { 0,1, 2,3, 4, 5 } ^Aval(7) ⁼ { ^10,11 } Amont(2) = ∅ Aval(2) = { 6, 7,8, 9 } Amont(11) = { 6, 7,8, 9 } Aval(11) = ∅

Multi-Layers Perceptron

(MLP)

(2)

x ₁

x _D M

Input Layer (D)

Hidden Layer (M)

Output Layer (K)

M

M M M

y ₁ y _K

€ M w _{K M} ⁽²⁾

w _K1 ⁽²⁾

€

w ₁₀ ⁽²⁾

w _{M D} ⁽¹⁾

€

w ⁽¹⁾

_{1 1}

€

w ₁₀ ⁽¹⁾

€

z ₁

z _M M

€

x

0=1

€ Z₀=1

€

biais

€

biais

Multi-Layers Perceptron (MLP)

The theoretical risk is not calculable, is unknown.

But a sample of i.i.d examples taken from is known.

In practice, we can not minimize , we then use an induction principle.

Most Common = Empirical Risk Minimization (ERM) R(w) = ∫ L y(x, [ w), d ] ^dp(x, ^d)

w

^*

= Argmin

w

R(w)

Learning from examples

Fixed but unknown probability distribution Theoretical risk

Generalization error

Loss function

Learning consists in finding the parameters:

The problem of learning is often presented as the minimization of a risk/cost/error function:

p(x, d)

D

N

p(x, d)

R(w)

p(x,d)= p(x)p(d

/

x)

x₀ x₁

x_n M

x y(x, w)

Η = { y( x,w) / w ∈ Ω }

D N = { ( x

⁽¹⁾

,d

⁽¹⁾

) ^, ( ^x

⁽²⁾

^, ^d

⁽²⁾

) ^,..., ( ^x

^(N)

^, ^d

^(N)

) }

w

⁺

= Argmin

w

R(w) R(w) = 1

N L y(x ⎡⎣

⁽ⁿ⁾

, w), d

⁽ⁿ⁾

⎤⎦

n=1 N

∑

= 1

N R

⁽ⁿ⁾

(w)

n=1 N

∑

Empirical Risk Minimization

Input Image Signal Measures

…

Target Output Class Prediction Score Rank Cluster

… Empirical risk Learning error

Local empirical risk

x0 x1

xn M

x y(x, w)

Η = { y(x,w) / w ∈ Ω }

w

_ji

(t +1) ← w

_ji

(t )−ε (t) ∂R(w)

∂w

_ji

Δw

_ji

(t) = −ε(t) ∂R(w)

∂w

_ji

Gradient descent

Iterative update of the weights using:

Gradient

Learning rate parameter Positive number

Learning consists in finding the parameters:

R(w) = 1

N R

⁽ⁿ⁾

( ) w

n=1 N

∑

R

⁽ⁿ⁾

(w) = 1

2 ƒ w _k0

⁽²⁾

+ w _kj

⁽²⁾

ƒ w _j

⁽¹⁾₀

+ w

⁽¹⁾

_ji x _i

⁽ⁿ⁾

i

=1

D

⎛ ∑

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

j

=1

M

⎛ ∑

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟ − d _k

⁽ⁿ⁾

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

2

k K

∑

R

⁽ⁿ⁾

(w) = 1

2 ( y _k

⁽ⁿ⁾

− d _k

⁽ⁿ⁾

)

²

k=1 K

∑ ^y

^k⁽ⁿ⁾

⁼ ^y

^k

^(x

⁽ⁿ⁾

^,w)

w _ji (t +1) ← w _ji (t)−ε(t) ∂R

⁽ⁿ⁾

(w)

∂w _ji

Target output Network

output

Gradient Descent algorithm

MLP wth 1 hidden layer and 2 weights layers w⁽¹⁾et w⁽²⁾

Gradient descent Update of the weights:

Cost function

Mean Squared Error

(3)

Gradient Descent algorithm

a _j = w _j0

⁽¹⁾

+ w

⁽¹⁾

_ji x _i

i

∈

Amont(j)

∑

z _j = ƒ ( ) a _j

€

y _k = ƒ ( ) a _k a k = w k0

(2)

+ w kj

(2)

z j j

∈

Amont(k) ∑

y _k = ƒ w _k0

⁽²⁾

+ w _kj

⁽²⁾

ƒ w

⁽¹⁾

_j0 + w _ji

⁽¹⁾

x _i

i

=1

D

⎛ ∑

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

j

=1

M

⎛ ∑

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟ x

₀

x

1

x

_D

M

x

€

€ y

d

Target output

Network output

The activation of the j

^th

hidden unit is:

The output of this hidden unit is obtained by a nonlinear transformation of the activation:

In the same way, the activation and the output of the k

^th

output unit can be obtained as follows:

If we combine the calculation of the outputs of the hidden units and that of output units we obtain for the k

^th

output of the network, the following expression:

Multi-Layers Perceptron (MLP)

Multi-Layers Perceptron Transition functions

ƒ

( ) x

=tanh(x)=

e

^x−e^−x

e

^x+e^−x

ƒ ( ) x = 1

1 + e

^−x

^ƒ ( ) ^x = e

^x

−1

e

^x

+1

€

ƒ ʹ ( ) x ⁼ ^f ^{(x) 1−} ( ^f ^(x) )

€

ƒ ( ) x

€

ʹ ƒ ( ) x

Rectified Linear Units

w More efficient gradient propagation, derivative is 0 or constant, just fold into learning rate

w More efficient computation: Only comparison, addition and multiplication.

n

Leaky ReLU: f(x) = x if > 0 else ax where 0 ≤ a <= 1, so that derivate is not 0 and can do some learning for that case.

n

Lots of other variations

w Sparse activation: For example, in a randomly initialized networks,

only about 50% of hidden units are activated (having a non-zero

output)

(4)

Rectified Linear Units

Illustration of some possible decision boundaries

Output

Input or

and Convex

regions

Hyperplanes Arbitrary decision regions

Illustration of some possible decision boundaries

R _mse (w) = 1

2 ( d

⁽ⁿ⁾

− y

⁽ⁿ⁾

)

²

n=1 N

∑

R multiple−logistic (w) = d _i

⁽ⁿ⁾

log d _i

⁽ⁿ⁾

y _i

⁽ⁿ⁾

+ ( 1− d _i

⁽ⁿ⁾

) ^log ¹⁻ ₁ ₋ ^d _y ⁱ

⁽ⁿ⁾

i

(n)

⎡

⎣ ⎢ ⎤

⎦ ⎥

i=1 D

∑

n=1 N

∑

R

log−likelihood

(w) = d

⁽ⁿ⁾

log d

⁽ⁿ⁾

p

⁽ⁿ⁾

n=1 N

∑ ^avec ^p

⁽ⁿ⁾

⁼ ^e ^y

(n)

e ^y

⁽^j)

j

∑

R mse−pondéré (w) = ^∑ _n=1 ^N ( ( d

⁽ⁿ⁾

− y

⁽ⁿ⁾

) ∑

⁻¹

( ^d

⁽ⁿ⁾

⁻ ^y

⁽ⁿ⁾

) )

avec ∑

⁻¹

= _N ¹ ( ^d

⁽ⁿ⁾

⁻ ^y

⁽ⁿ⁾

) ( ^d

⁽ⁿ⁾

⁻ ^y

⁽ⁿ⁾

) ^T

n=1 N

∑

Multi-Layers Perceptron

Error functions

(5)

R % _mlp

⁽ⁿ⁾

(w) = 1

2 ( y _k

⁽ⁿ⁾

− d _k

⁽ⁿ⁾

)

²

k=1 K

∑

€

z = (x,d) v.a. (X,D) p(x,d) = p(x)p(d / x)

R % _mlp

⁽ⁿ⁾

(w) = 1

2 ƒ w _k0

⁽²⁾

+ w _kj

⁽²⁾

ƒ w _j0

⁽¹⁾

+ w

⁽¹⁾

_ji x _i

i

=1

D

⎛ ∑

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

j

=1

M

⎛ ∑

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟ − d _k

⁽ⁿ⁾

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

2

k=1 K

∑

€

R(w) = L z,w ( ) ^dp( ^z)

∫ Z ^R(w) ^% ⁼ ¹

N L z (

⁽ⁿ⁾

, w )

n=1 N

∑

€

w ⁺ = argmin

w

R ˜ (w)

Multi-Layers Perceptron Rules for updating the weights

Δw _ji = w _ji (t +1) − w _ji (t)

= −ε(t)∇ _w

ji

R % _mlp

⁽ⁿ⁾

(w)

∇ _w

_ji

R % _mlp

⁽ⁿ⁾

(w) =δ j

(n)

s _i

δ

_j⁽ⁿ⁾

= f ʹ (a

_j

) ( y

_j⁽ⁿ⁾

− d

_j⁽ⁿ⁾

) ^si ^j ^∈ ^Sortie

δ

_j⁽ⁿ⁾

= f ʹ (a

_j

) w

_hj

δ

_h⁽ⁿ⁾

h∈Aval(j)

∑ ^{si i} ^∉ ^Sortie

⎧

⎨

⎪ ⎪

⎩

⎪ ⎪

R % mlp

(n)

(w) = 1

2 ƒ w k0

(2)

+ w kj

(2)

ƒ w j0

(1)

+ w ji

(1)

x i i

=1

D

⎛ ∑

⎝ ⎜⎜ ⎞

⎠ ⎟⎟

j

=1

M

⎛ ∑

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟ − d k

(n)

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

2

k=1 K

∑

!" # ∈ %&'(&'

!" # ∉ %&'(&'

Multi-Layers Perceptron Rules for updating the weights

Gradient Backpropagation (GBP)

R % _mlp

⁽ⁿ⁾

(w) = 1

2 ( y _k

⁽ⁿ⁾

− d _k

⁽ⁿ⁾

)

²

k=1 K

∑ ^∂ ^R ^% ^mlp

⁽ⁿ⁾

^(w)

∂ w _ji = ∂ R % _mlp

⁽ⁿ⁾

(w)

∂ a _j

∂ w _ji

€

δ ^k _j

si j ∉ Sortie δ _j

⁽ⁿ⁾

= ∂ R % _mlp

⁽ⁿ⁾

(w)

∂ a _j = ∂ R % _mlp

⁽ⁿ⁾

(w)

∂ s _j

∂ a _j

= f ʹ (a _j ) ∂ R % _mlp

⁽ⁿ⁾

(w)

∂ a _h

∂ s _j

h∈Aval(j) ∑

= f ʹ (a _j ) w _hj δ _h

⁽ⁿ⁾

h∈Aval(j) ∑

€

s _i

∇ w

_ji

R % _mlp

⁽ⁿ⁾

(w) =δ j

(n)

s _i

j

h

si j ∈ Sortie δ j

(n)

= ∂ R % _mlp

⁽ⁿ⁾

(w)

∂ a _j = ∂ R % _mlp

⁽ⁿ⁾

(w)

∂ y _j

∂ a _j

= ( y

⁽ⁿ⁾

_j − d _j

⁽ⁿ⁾

) ^f ^ʹ ^(a ^j ⁾

!" # ∈ %&'(&' !" # ∉ %&'(&'

y

i=

f a ( )

i

a

i=

w

i j

x

j j∈Amont(i

∑

) x0

wi0

ai yi

x1

xn

€ M

wi1

wi n

∑

δi+1

wi+1i

δi

δk

€ M wk i

wm i

∑

ʹ f a

( )

i

€ M

€ M δm

δi

(n)=

f a

ʹ

( )

i

^w

hiδh (n) h∈Aval(i)

∑ Propagation

Back-Propagation

Gradient Backpropagation (GBP)

(6)

1- Initialize randomly

2- Randomly choose a couple of data

3- Calculate the state of the network by propagation

4- Calculate a gradient approximation

5- Adapt the weights

Where is de step of the gradient.

6- Repeat from 2 to 5 until an acceptable error value is obtained.

W

_o

ε ( ) t

x

⁽ⁿ⁾

, y

⁽ⁿ⁾

( )

€

x _j = ƒ ( ) a _j

MLP Learning algorithm

a

j

= w

j0

+ w

ji

x

i i

∑

Multi-Layers Perceptron

!

"#

$ + 1 = !

"#

$ − )($),

_"^(-)

.

#

,

_"^(-)

= /

⁰

1

_"

2

_"^-

− 3

_"^-

4/ 5 ∈ 78$98$

,

_"^(-)

= /

⁰

1

_"

:

; ∈ <=>?(")

!

_;"

,

_;^(-)

4/ 5 ∉ 78$98$

Computational complexity of backpropagation O(W) where W

= number of free parameters (weight and bias) Complexity of a 2-layer MLP:

W = (D+1)H+(H+1)O D: #Imput H: #Hidden O: #Output

MLP Example

€

D = 1 1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟ ,−1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟

; 1

−1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟ ,1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟

;

−1 1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟ ,1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟

;

−1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟ ,−1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟

⎧

⎨ ⎪

⎩ ⎪

⎫

⎬ ⎪

⎭ ⎪

1 1 _0.5

1.0 −1.0 1.5

1.0 1.0

1.0 1.0 0.5

x2

x1

ψ ( x,w ) = x

₁

⊕ x

₂

x2

x1

x2

x1

1.0 x

₁

+1.0 x

₂

+1.5 = 0 x

1

+ x

2

+1.5 = 0 1.0 x

₁

+1.0 x

₂

+ 0.5 = 0 x

1

+ x

2

+ 0.5 = 0

MLP Example

Data organization

Coding of the outputs

Coding « 1-among-C »

Blue

Red White

Category

1 0 0

0 0 1 0 1 0

Coding (0/1)

1 -1 -1

-1 -1 1 -1 1 -1

Coding (-1/1)

1 0

0 0 0 1

Coding (0/1)

1 -1

-1 -1 -1 1

Coding (-1/1) Coding « 1-among-(C-1) »

1 1 1

0 0 1 0 1 1

Coding (0/1) Thermometer coding

1 1 1 1

1 1

1 2 2 2 2 2

2 3 3 3 3 3

3 x

? x _{Neural Net} 2

Classes ranking

(7)

Data organization

Coding of the inputs

Why Data Normalization is necessary for Machine Learning models?

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

Data organization

Coding of the inputs

Why Data Normalization is necessary for Machine Learning models?

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

Urvashi Jaitley, Oct 8, 2018 (https://medium.com/@urvashilluniya)

As with most statistical techniques, we always have interest to pre-process the data so that it is centered, reduced.

Data organization

Coding of the inputs

As with most statistical techniques, we always have interest to pre-process the data so that it is centered, reduced.

x

⁽ⁿ⁾

← x

⁽ⁿ⁾

− x σ x = 1

N x

⁽ⁿ⁾

n=1 N

∑ ^σ

²

⁼ _N ¹ ( ^x

⁽ⁿ⁾

⁻ ^x )

²

n=1 N

∑

a i = w ij x j j∈Amont(i ∑

)

E a [ ] i ⁼ ^E ^w ^ij ^x ^j j∈Amont(i ∑

)

⎡

⎣ ⎢

⎤

⎦ ⎥ = E w [ ] ij ^{E x} [ ] ^j ⁼ ⁰

j

∈Amont(i)

∑

Var a [ ] i ⁼ ^Var ^w ^ij ^x ^j j

∈Amont(

∑ i)

⎡

⎣ ⎢

⎤

⎦ ⎥ = _j

_∈Amont(i)

^∑ E w [ ]

_ij²

^{E x} [ ]

²^j

⁼ ^Amont(i) ^{Var w} ^{[ ]} ^ij ^{Var x} ^{[ ]} ^j

= Amont(i) σ

²

[ ] w ij

=0

Data organization

Coding of the inputs

(8)

Weights nitialization

Var a [ ] i ⁼ ^Amont(i) ^σ

²

[ ] ^w ^ij ⁼ ¹

σ

²

[ ] w _ij ⁼ _Amont(i) ¹ σ [ ] w ij ^∝ ^Amont(i ⁾

⁻¹²

We can therefore initialize the weights according to a uniform law over an interval:

w _ij ∈ − k

Amont(i) , k Amont(i)

⎡

⎣ ⎢

⎤

⎦ ⎥

0.5 < k < 2 f (x) = 1.71 tanh 2 3 x

⎛

⎝ ⎞

⎠

Exercice to do

€

D = 1 1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟ , 1

−1

⎛

⎝ ⎜ ⎞

⎠ ⎟

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟ ;

−1 1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟ , −1 1

⎛

⎝ ⎜ ⎞

⎠ ⎟

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟ ; 1

−1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟ , −1 1

⎛

⎝ ⎜ ⎞

⎠ ⎟

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟ ;

−1

⎛

⎝

⎜

⎜ ⎜

⎞

⎠

⎟

⎟ ⎟ , 1

−1

⎛

⎝ ⎜ ⎞

⎠ ⎟

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟

⎧

⎨ ⎪

⎩ ⎪

⎫

⎬ ⎪

⎭ ⎪

1 1

0.5 0.3 0.4 0.1

0.3 0.6

−0.8 0.9 0.2

−0.4

−0.2

−0.7

€

ƒ ( ) x ⁼ ₁₊ ¹ _e

−x

€

ƒ ʹ ( ) x ⁼ ^f ^(x)[1− ^f ⁽ ^x)]

(9)

x

1

x

2

x

1

x

2

x

1

x

2

By taking linear combinations of these localized functions, we can approximate any smooth functional mapping.

MLP is an universal approximator.

A neural network with 3 layers of sigmoidal hidden units, can approximate a smooth multivariate mapping to arbitrary accuracy

In (a) we see the output of a single sigmoidal unit as a function of two input variables.

Adding the outputs from 2 such units can produce a ridge-like function (b).

Adding 2 ridges can give a function with a

maximum (c). Transforming this function with another sigmoid gives a localized response (d).

x ₁ x ₂

Notes:

- In theory, there is no need for any other network structure. However, in applications, it may be more convenient to use multiple layers, non-linear outputs, and so on.

- It is a purely existential theorem. It does not say how to determine an appropriate number of neurons in the hidden layer and weight values to approximate a given function with a given precision!

Theorem [Hornik 1989]:

Three-layered artificial neural networks with linear outputs, with sufficient increasing saturating hidden units, can arbitrarily approach any measurable bounded function from one finite-dimensional space to another.

An artificial neural network is an universal approximator.

Universal Approximator

Links with factor analysis

MLP & PCA * **

x

₁

x

_n

€ M

Input Hidden Target

Output Network Output

€ M

* Bourlard H. & Kamp Y. (1988) : «Auto-association by multilayer perceptrons and singular value decomposition» Biological Cybernetics, Vol. 59, pp. 291-294.

** Baldi P. & Hornik K. (1989) : «Neural networks and principal component analysis: Learning from examples without local minima » Neural Networks, Vol. 2, N 1, pp. 53-58.

x ˆ

₁

x ˆ

_n

€ M

€

d

₁

= x

₁

€

d

_n

= x

_n

€ M

Output

Coding

Encoder Decoder

u

2

u

₁

x x

₂

x

₁

≡

An auto-associative (auto-encoder) MLP with linear outputs, can realize a Karhunen-Loève transformation.

A auto-associative MLP is equivalent to a Principal Component Analysis

(PCA).

(10)

* Bennani Y. (1992) : «Approches Connexionnistes pour la Modélisation et l Identification» Thèse de Doctorat, LRI-Université Paris 11, Orsay.

* Gallinari P., Thiria S., Badran F., Fogelman-Soulie F. (1991) : « On the relations between discriminant analysis and multilayer perceptrons» Neural Networks, Vol. 4, N 3, pp.

349-360.

1

2 3

5 4 6

7 9 8

0

Couche d'entrée

2 1

3 4

5 6

8 7 9

0

1ère couche cachée

1

2 3 4

5 6

7

8 9

0

2ème couche cachée

1 2 3

4 5 6

7

8 9

0

Couche de sortie

x

₁

x

_n

€

M

€

M

€

M

€ M

Links with factor analysis

MLP & LDA * **

Input layer Hidden layer #1

Hidden layer #2 Output layer

Structured Networks

Complete Connections

x

₁

x

_n

M

x

€

€ y

d

Target/Desired Output

Network/Computed Output

M

x

₁

(t)

x

_n

(t)

€ M

€

€ y

d

€ M

c

₁

(t −1)

c

m

(t −1)

€ M

Context

* Elman J.L. (1990) : «Finding structure in time» Cognitive Science, Vol. 14, pp. 179-212.

Structured Networks

Complete Connections with Context **[Elman *]**

Input Hidden Output Target Output

Network Output

x

₁

(t)

x

_n

(t)

€ M

€

€ y

d

€ M

y

₁

(t −1) y

_p

(t − 1)

* Jordan M.I. (1992) : «Constrained supervised learning» Journal of Mathematical Psychology, Vol. 36, pp. 396-425.

Context

Structured Networks

Complete Connections with Context **[Jordan *]**

Input Hidden Output Target Output

Network Output

(11)

Structured Networks

**Synthesis of speech [Sejnowski & Rosenberg *]**

* Sejnowski T.J. & Rosenberg C.R. (1987) :

«Parallel Networks that learn to pronounce English text»

Complex systems, Vol 1, pp. 145-168.

h t s i s i h

T e i n p u t

/z/

NETtalk : text to speech

Features exctractors

Local features Receptive field

The use of local connections greatly reduces the number of weights in a network.

w

₁⁽¹⁾

w

₃⁽¹⁾

w

₂⁽¹⁾

w

₄⁽¹⁾

⎡

⎣ ⎢

⎤

⎦ ⎥

w

₁⁽²⁾

w

₃⁽²⁾

w

₂⁽²⁾

w

₄⁽²⁾

⎡

⎣ ⎢

⎤

⎦ ⎥

w

1 (3)

w

₂⁽³⁾

w

₃⁽³⁾

w

₄⁽³⁾

⎡

⎣

⎢

⎤

⎦

⎥

Structured Networks

Local Connections

Structured networks

Constrained or shared-weight connections

Local features Receptive field

An interesting property of the weight sharing mechanism is the very small number of free parameters.

w

₁⁽¹⁾

w

₃⁽¹⁾

w

₂⁽¹⁾

w

₄⁽¹⁾

⎡

⎣ ⎢

⎤

⎦ ⎥ w

1

(5)

w

2 (5)

w

₃⁽⁵⁾

w

₄⁽⁵⁾

⎡

⎣

⎢

⎤

⎦

⎥

⎥ w

₁⁽¹⁾

w

₃⁽¹⁾

w

₂⁽¹⁾

w

₄⁽¹⁾

⎡

⎣ ⎢

⎤

⎦ ⎥

w

₁⁽³⁾

w

₃⁽³⁾

w

₂⁽³⁾

w

₄⁽³⁾

⎡

⎣ ⎢

⎤

⎦ ⎥

Output Convolution

filter Features exctractors

Structured networks

Constrained or shared-weight connections

(12)

Structured Networks

Constrained Connections or Shared Weights TDNN (Time Delay Neural Network)

Frequency

Output

time

w1 (1) w5 (1)

w2 (1) w6

(1) w3

(1) w7 (1)

w4 (1) w8 (1)

⎡

⎣ ⎢

⎤

⎦ ⎥ Convolution

filter

w1 (1) w5 (1)

w2 (1) w6

(1) w3

(1) w7 (1)

w4 (1) w8 (1)

⎡

⎣ ⎢

⎤

⎦ ⎥

w1 (2)

w5(2) w2

(2)

w6(2) w3

(2)

w7(2) w4

(2)

w8(2)

⎡

⎣ ⎢

⎤

⎦ ⎥ w1

(3) w5

(3) w2

(3) w6 (3)

w3 (3) w7 (3)

w4 (3) w8 (3)

⎡

⎣ ⎢

⎤

⎦ ⎥ w1

(4)

w5(4) w2

(4)

w6(4) w3

(4)

w7(4) w4

(4)

w8(4)

⎡

⎣ ⎢

⎤

⎦ ⎥ w1

(5) w2 (5) w3

[

(5)

]

n=2 d=1

N=6 M=((N-n)/d)+1

* Waibel A., Hanazawa T., Hinton G., Shikano K., Lang K. (1987) : «Phoneme recognition using Time-Delay Neural Networks» Tech. Rep. ATR, TR-1-006.

Features exctractors

Structured Networks

Convolutional Neural Network

Super-Resolution Convolutional Neural Network

•Input Image: LR single-channel image up-sampled to desired higher resolution

•Conv. Layer 1: Patch extraction

• 64 filters of size 1 x 9 x 9

• Activation function: ReLU

• Output: 64 feature maps

• Parameters to optimize: 1 x 9 x 9 x 64 = 5184 weights and 64 biases

•Conv. Layer 2: Non-linear mapping

• Activation function: ReLU

• Parameters to optimize: 64 x 1 x 1 x 32 = 2048 weights and 32 biases

•Conv. Layer 3: Reconstruction

• 1 filter of size 32 x 5 x 5

• Activation function: Identity

• Output: HR image

• Parameters to optimize: 32 x 5 x 5 x 1 = 800 weights and 1 bias

Structured Networks

Convolutional Neural Network

Input Low resolution image

Superpixel layers

Output High resolution image

•Conv. Layer 1: Feature extraction

• Activation function: PReLU

• Parameters: 1 x 5 x 5 x 56 = 1400 weights and 56 biases

•Conv. Layer 2: Shrinking

•Conv. Layers 3–6: Mapping

• 4 x 12 filters of size 12 x 3 x 3

• Output: HR feature maps

• Parameters: 4 x 12 x 3 x 3 x 12 = 5184 weights and 48 biases

•Conv. Layer 7: Expanding

…

•DeConv Layer 8: Transposed Convolution (Deconvolution)

• One filter of size 56 x 9 x 9

• Parameters: 56 x 9 x 9 x 1 = 4536 weights and 1 bias

Total number of weights: 12464 (plus a very small number of parameters in PReLU layers)

1. Feature extraction: Extracts a set of feature maps directly from the LR image.

2. Shrinking: Reduces dimension of feature vectors (thus decreasing the number of parameters) by using a smaller number of filters (compared to the number of filters used for feature extraction).

3. Non-linear mapping: Maps the feature maps representing LR to HR patches. This step is performed using several mapping layers.

4. Expanding: Increases dimension of feature vectors. This operation performs the inverse operation as the shrinking layers, in order to more accurately produce the HR image.

5. Deconvolution: Produces the HR image from HR features.

Super-Resolution Convolutional Neural Network

Structured Networks

Convolutional Neural Network

(13)

Super-Resolution Convolutional Neural Network

Structured Networks

Convolutional Neural Network

Structured Networks

Convolutional Neural Network

Structured Networks

Convolutional Neural Network

Structured Networks

Convolutional Neural Network

(14)

Discrete convolution is a linear transformation that preserves the notion of ordering.

It is sparse (only a few input units contribute to a given output unit) and reuses parameters (the same weights are applied to multiple locations in the input).

Computing the output values of a discrete convolution.

The light bluegrid is called theinput feature map. A kernel (shaded area) of value:

Slides across the input feature map. At each location, the product between each element of the kernel and the input element it overlaps is computed and the results are summed up to obtain the output in the current location.

The final outputs of this procedure are called output feature maps.

Computing the output values of a discrete convolution.

stride: distance between two consecutive positions of the kernel

zero padding : number of zeros concatenated at the beginning and at the end of an axis

A (3 3) kernel applied to a (5 5) input padded with a no border of zeros

using 1 strides.

Pooling is a concept in deep learning visual object recognition that goes hand-in-hand with convolution.

The idea is that a convolution (or a local neural network feature detector) maps a region of an image to a feature map.

What is pooling in a deep architecture?

By organizing alternating feature detection and pooling layers in a hierarchy, flexibly deformed structures can be recognized as "variations on the same structural theme."

At the very highest level, the features might be object categories that are fully independent of the position of the object within the frame.

Pooling is one of the most important concepts of Convolutional neural network, which divides the input map into a set of rectangles and, outputs the maximum for nonlinear down-sampling.

The most common pooling layer filter is of size 2x2, which discards three forth of the activations.

Role of pooling layer is to reduce the resolution of the feature map but retaining features of the map required for classification through translational and rotational invariants.

What is pooling in a deep architecture?

(15)

What is pooling in a deep architecture?

There are three variants of pooling operation depending on roots of regularization technique:

• Stochastic pooling,

• Overlapping pooling,

• Fractional pooling.

What is pooling in a deep architecture?

Stochastic Pooling

Randomly picked activation within each pooling region is considered than deterministic pooling operations for regularization of the network. Stochastic pooling performs reduction of feature size but denies role for selecting features judiciously for the sake of regularization.

Overlapping pooling

Overlapping pooling operation shares responsibility of local connection beyond the size of previous convolutional filter, which breaks orthogonal responsibility between pooling layer and convolutional layer. So, no information is gained if pooling windows overlap.

Fractional Pooling

Reduction ratio of filter size due to pooling can be controlled by a fractional pooling concept, which helps to increase the depth of the network. Unlike stochastic pooling, the randomness is related to the choice of pooling regions, not the way pooling is performed inside each of the pooling regions.

Pooling

Pooling works by sliding a window across the input and feeding the content of the window to apooling function.

Pooling works very much like a discrete convolution, but replaces the linear combination described by the kernel with some other function.

Pooling operations reduce the size of feature maps by using some function to summarize subregions, such as taking the average or the maximum value.

Pooling layers provide invariance to small translations of the input.

Computing the output values of a (3 3) max pooling operation on a (5 5) input using (1 1) strides.

Pooling

Computing the output values of a (3 3) average pooling

operation on a (5 5) input using (1 1) strides.

(16)

• square inputs (i

1

=i

2

=i),

• square kernel size (k

1

=k

2

=k),

• same strides along both axes (s

1

=s

2

=s),

• same zero padding along both axes (p

1

=p

2

=p).

No zero padding, unit strides Relationship 1.

For any i and k, and for s= 1 and p= 0, o= (i−k) + 1.

No padding, unit strides

Convolving a (3 3) kernel over a (4 4) input using unit strides (i.e., i= 4, k= 3, s= 1and p= 0) à o=(4-3)+1=2

à à (2 2)

N.B.: Blue maps are inputs, and cyan maps are outputs.

Convolving a (4 4) kernel over a (5 5) input padded with a 2 border of zeros using unit strides (i.e., i= 5, k= 4, s= 1 and p= 2).

à o=(5-4)+2 2+1=6

à à (6 6)

Zero padding, unit strides

Padding with p zeros changes the effective input size from i to i+ 2p.

Relationship 1 can then be used to infer the following relationship:

Relationship 2.

For any i, k and p, and for s= 1, o= (i−k) + 2p+ 1.

Half (same) padding

Having the output size be the same as the input size (i.e., o=i) can be a desirable property:

Relationship 3.

For any i and for k odd (k= 2n+ 1, n∈N), s= 1 and p= ⌊k/2⌋ =n, o= i + 2 ⌊k/2⌋ − (k−1)

=i+ 2n−2n

=i.

(Half padding, unit strides) Convolving a 3 3 kernel over a 5 5 input using half padding and unit strides (i.e., i= 5, k= 3, s= 1 and p= 1).

Full padding

While convolving a kernel generally decreases the output size with respect to the input size, sometimes the opposite is required. This can be achieved with proper zero padding:

Relationship 4.

For any i and k, and for p=k−1and s= 1, o=i+ 2(k−1)−(k−1)

=i+ (k−1)

(Full padding, unit strides) Convolving a 3 3 kernel over a 5 5 input

using full padding and unit strides (i.e., i= 5, k= 3, s= 1 and p= 2).

(17)

No zero padding, non-unit strides

Relationship 5.

For any i, k and s, and for p= 0, o=⌊(i−k)/s⌋+1.

Convolving a (3 3) kernel over a (6 6) input padded with a 1 border of zeros using (2 2) strides (i.e., i= 6, k= 3, s= 2 and p= 1).

In this case, the bottom row and right column of the zero padded input are not covered by the kernel.

Zero padding, non-unit strides

The most general case (convolving over a zero padded input using non-unit strides) can be derived by applying Relationship 5 on an effective input of size i+ 2p, in analogy to what was done for Relationship 2:

Relationship 6.

For any i, k, p and s, o=⌊(i+ 2p−k)/s⌋+ 1.

As before, the floor function means that in some cases a convolution will produce the same output size for multiple input sizes.

More specifically, if i+ 2p−k is a multiple of s, then any input size j=i+a, a∈{0, . . . , s−1} will produce the same output size. Note that this ambiguity applies only for s >1.

Transposed convolution arithmetic

Transposed convolutions : use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to some thing that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

For instance:

- decoding layer of a convolutional autoencoder, - project feature maps to a higher-dimensional space.

If the input and output were to be unrolled into vectors from left to right, top to bottom, the convolution could be represented as a sparse matrix C where the non-zero elements are the elements w

_i,j

of the kernel (with i and j being the row and column of the kernel respectively):

This linear operation takes the input matrix flattened as a 16-dimensional vector and produces a 4-dimensional vector that is later reshaped as the 2 2 output matrix.

Using this representation, the backward pass is easily obtained by transposing C; in other words, the error is backpropagated by multiplying the loss with C

^T

. This operation takes a 4-dimensional vector as input and produces a 16- dimensional vector as output, and its connectivity pattern is compatible with C by construction.

Notably, the kernel w defines both the matrices C and C

^T

used for the forward and backward passes.

Transposed convolutions – also called fractionally strided convolutions or

deconvolutions – work by swapping the forward and backward passes of a

convolution.

(18)

No zero padding, unit strides, transposed

The simplest way to think about a transposed convolution on a given input is to imagine such an input as being the result of a direct convolution applied onsome initial feature map.

The transposed convolution can be then considered as the operation that allows to recover the shape of this initial feature map.

The transpose of convolving a (3 3) kernel over a (4 4) input using unit strides (i.e.,i= 4, k= 3, s= 1and p= 0).

It is equivalent to convolving a (3 3) kernel over a (2 2) input padded with a (2 2) border of zeros using unit strides (i.e.,i′= 2, k′=k, s′= 1 and p′= 2).

Relationship 8.

A convolution described by s= 1,p= 0 and k has an associated transposed convolution described by k′=k, s′=s and p′=k−1 and its output size is

o′=i′+ (k−1).

This corresponds to a fully padded convolution with unit strides.

Transposed convolution animations

N.B.: Blue maps are inputs, and cyan maps are outputs.

No padding, no strides, transposed Arbitrary padding, no strides,

transposed Half padding, no strides,

transposed Full padding, no strides, transposed

No padding, strides, transposed Padding, strides, transposed Padding, strides, transposed (odd)

Vincent Dumoulin, Francesco Visin - A guide to convolution arithmetic for deep learning https://github.com/vdumoulin/conv_arithmetic

Structured networks

**LeNet for Digit recognition [Yann LeCun *]**

* Le Cun Y., Boser B., Denker J.S., Henderson D., Howard R.E., Hubbard W., Jackel L.D. (1989) : «Back-propagation applied to handwritten zip code recognition»

Neural Computation, Vol. 1, pp. 541-551.

Structured networks

**LeNet for Digit recognition [Yann LeCun *]**

M M M M M {} D x f x dim ( N ( x n = ) ) → = {} ( x x d 1 . (1) ( x n , 2 ) d n N (1) = 1 ),( x (2) x , ( n d ) (2) ∈ℜ ), K D , ,( d x ( n ( ) N ∈ℜ ) , d ( N K ) )

Artificial Neural Networks

From Perceptron to Deep Learning

Younès BENNANI Full Professor

Master of Science in Informatics

Exploration Informatique des Données et Décisionnel (EID

) Science des Données (WISD & MASD) Mathématiques des Données (MD)

General representation of a continuous function

f (x) = F

G

( ) x

⎛ ∑

⎝

⎜ ⎞

⎠

⎟

∑

f (x) [ ] 0, 1

G

F

G

F

f w (

+ w

x )

x

x

x

M

M

G

G

G

G

F

F

M

∑

∑

∑

M

f (x) f (x)= x

.x

f (x) = 1

4 ( ( x

+ x

)

− ( x

− x

)

)

Multi-Layers Perceptron (MLP)

x

x

x

M Architecture:

similar to that of Perceptron or Madaline + intermediate processing layers (hidden layers) - External layers:

- Input (e units), - Output (s units) - Internal layers:

- Hidden (c units) Notation: <e ∣ c ∣ s>

example: <6 ∣ 4 ∣ 2>

Goal:

x

→ d

{ } n=1 N x

∈ ℜ D ,d

∈ ℜ K

x

€

€ y

d

w (1) w (2)

D

= { (x

, d

), (x

, d

), K , ( x

, d

) }

0 1 2 3 4 5

{ } _n=1 ^N ^x

^{∈ ℜ} ^D ^,d

^{∈ ℜ} ^K

w ⁽¹⁾ w ⁽²⁾

Amont(7) = { 0,1, 2,3, 4, 5 } ^Aval(7) ⁼ { ^10,11 } Amont(2) = ∅ Aval(2) = { 6, 7,8, 9 } Amont(11) = { 6, 7,8, 9 } Aval(11) = ∅

x ₁

x _D M

y ₁ y _K

€ M w _{K M} ⁽²⁾

w _K1 ⁽²⁾

w ₁₀ ⁽²⁾

w _{M D} ⁽¹⁾

w ⁽¹⁾

w ₁₀ ⁽¹⁾

z ₁

z _M M

Most Common = Empirical Risk Minimization (ERM) R(w) = ∫ L y(x, [ w), d ] ^dp(x, ^d)

) ^, ( ^x

^, ^d

) ^,..., ( ^x

^, ^d