Artificial Neural Networks: From Perceptron to Deep Learning 1 © 2021 ⏐Younès Bennani - USPN
Artificial Neural Networks
From Perceptron to Deep Learning
Younès BENNANI Full Professor
Master of Science in Informatics
Exploration Informatique des Données et Décisionnel (EID
2) Science des Données (WISD & MASD) Mathématiques des Données (MD)
© 2001-2021 @Y. Bennani : Ce document est la propriété de Younès Bennani, Professeur à l'USPN. Il ne peut être diffusé ou reproduit sans son autorisation écrite (younes.bennani@sorbonne-paris-nord.fr).
Artificial Neural Networks: From Perceptron to Deep Learning 2 © 2021 ⏐Younès Bennani - USPN
General representation of a continuous function
Kolmogorov (1957)
Theorem:[Kolmogorov (1957)]
Any continuous function defined on can be written in the form:
where and are continuous functions of a variable.
Example:
A neural network is able to approximate the functions and using functions of the form:
f (x) = F
iG
ij( ) x
jj=1 n
⎛ ∑
⎝
⎜ ⎞
⎠
⎟
i=1 2n+1
∑
f (x) [ ] 0, 1
nG
ijF
iG
ijF
if w (
0+ w
Tx )
x
0x
1x
nM
M
G
1,1G
1,nG
2n+1,1G
2n+1,nF
1F
2n+1M
∑
∑
∑
M
f (x) f (x)= x
1.x
2f (x) = 1
4 ( ( x
1+ x
2)
2− ( x
1− x
2)
2)
Artificial Neural Networks: From Perceptron to Deep Learning 3 © 2021 ⏐Younès Bennani - USPN
Multi-Layers Perceptron (MLP)
x
0x
1x
dimM Architecture:
similar to that of Perceptron or Madaline + intermediate processing layers (hidden layers) - External layers:
- Input (e units), - Output (s units) - Internal layers:
- Hidden (c units) Notation: <e ∣ c ∣ s>
example: <6 ∣ 4 ∣ 2>
Goal:
x
(n)→ d
(n){ } n=1 N x
(n)∈ ℜ D ,d
(n)∈ ℜ K
x
€
€ y
d
w (1) w (2)
Input Hidden Output
Target/
desired output
Computed /Network output
D
N= { (x
(1), d
(1)), (x
(2), d
(2)), K , ( x
(N), d
(N)) }
Artificial Neural Networks: From Perceptron to Deep Learning 4 © 2021 ⏐Younès Bennani - USPN
0 1 2 3 4 5
6 7 8 9
10 11
Notations :: set of the input units : set of the output units
: set of the units whose outputs serve of the inputs to unit
: set of the units that use as input the output of unit
By definition, we have:
E S Amont(k)
Aval( k)
∀ i ∈ E, Amont(i) = ∅
∀ i ∈ S, Aval (i) = ∅
k k
Amont(7) = { 0,1, 2,3, 4, 5 } Aval(7) = { 10,11 } Amont(2) = ∅ Aval(2) = { 6, 7,8, 9 } Amont(11) = { 6, 7,8, 9 } Aval(11) = ∅
Multi-Layers Perceptron
(MLP)
Artificial Neural Networks: From Perceptron to Deep Learning 5 © 2021 ⏐Younès Bennani - USPN
x 1
x D M
Input Layer (D)
Hidden Layer (M)
Output Layer (K)
M
M M M
y 1 y K
€ M w K M (2)
w K1 (2)
€
w 10 (2)
w M D (1)
€
w (1)
1 1€
w 10 (1)
€
z 1
z M M
€
x
0=1€ Z0=1
€
biais
€
biais
Multi-Layers Perceptron (MLP)
Artificial Neural Networks: From Perceptron to Deep Learning 6 © 2021 ⏐Younès Bennani - USPN
The theoretical risk is not calculable, is unknown.
But a sample of i.i.d examples taken from is known.
In practice, we can not minimize , we then use an induction principle.
Most Common = Empirical Risk Minimization (ERM) R(w) = ∫ L y(x, [ w), d ] dp(x, d)
w
*= Argmin
w
R(w)
Learning from examples
Fixed but unknown probability distribution Theoretical risk
Generalization error
Loss function
Learning consists in finding the parameters:
The problem of learning is often presented as the minimization of a risk/cost/error function:
p(x, d)
D
Np(x, d)
R(w)
p(x,d)= p(x)p(d
/x)
x0 x1
xn M
x y(x, w)
Η = { y( x,w) / w ∈ Ω }
Artificial Neural Networks: From Perceptron to Deep Learning 7 © 2021 ⏐Younès Bennani - USPN
D N = { ( x
(1),d
(1)) , ( x
(2), d
(2)) ,..., ( x
(N), d
(N)) }
w
+= Argmin
w
R(w) R(w) = 1
N L y(x ⎡⎣
(n), w), d
(n)⎤⎦
n=1 N
∑
= 1
N R
(n)(w)
n=1 N
∑
Empirical Risk Minimization
Input Image Signal Measures
…
Target Output Class Prediction Score Rank Cluster
… Empirical risk Learning error
Local empirical risk
x0 x1
xn M
x y(x, w)
Η = { y(x,w) / w ∈ Ω }
w
ji(t +1) ← w
ji(t )−ε (t) ∂R(w)
∂w
jiΔw
ji(t) = −ε(t) ∂R(w)
∂w
jiGradient descent
Iterative update of the weights using:
Gradient
Learning rate parameter Positive number
Learning consists in finding the parameters:
Artificial Neural Networks: From Perceptron to Deep Learning 8 © 2021 ⏐Younès Bennani - USPN
R(w) = 1
N R
(n)( ) w
n=1 N
∑
R
(n)(w) = 1
2 ƒ w k0
(2)+ w kj
(2)ƒ w j
(1)0+ w
(1)ji x i
(n)i
=1D
⎛ ∑
⎝ ⎜⎜ ⎞
⎠ ⎟⎟
j
=1M
⎛ ∑
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟ − d k
(n)⎛
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟
2
k K
∑
R
(n)(w) = 1
2 ( y k
(n)− d k
(n))
2k=1 K
∑ y
k(n)= y
k(x
(n),w)
w ji (t +1) ← w ji (t)−ε(t) ∂R
(n)(w)
∂w ji
Target output Network
output
Gradient Descent algorithm
MLP wth 1 hidden layer and 2 weights layers w(1)et w(2)
Gradient descent Update of the weights:
Cost function
Mean Squared Error
Artificial Neural Networks: From Perceptron to Deep Learning 9 © 2021 ⏐Younès Bennani - USPN
Gradient Descent algorithm
Artificial Neural Networks: From Perceptron to Deep Learning 10 © 2021 ⏐Younès Bennani - USPN
a j = w j0
(1)+ w
(1)ji x i
i
∈Amont(j)
∑
z j = ƒ ( ) a j
€
y k = ƒ ( ) a k a k = w k0
(2)
+ w kj
(2)
z j j
∈Amont(k) ∑
y k = ƒ w k0
(2)+ w kj
(2)ƒ w
(1)j0 + w ji
(1)x i
i
=1D
⎛ ∑
⎝ ⎜⎜ ⎞
⎠ ⎟⎟
j
=1M
⎛ ∑
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟ x
0x
1x
DM
x
€
€ y
d
Target output
Network output
The activation of the j
thhidden unit is:
The output of this hidden unit is obtained by a nonlinear transformation of the activation:
In the same way, the activation and the output of the k
thoutput unit can be obtained as follows:
If we combine the calculation of the outputs of the hidden units and that of output units we obtain for the k
thoutput of the network, the following expression:
Multi-Layers Perceptron (MLP)
Artificial Neural Networks: From Perceptron to Deep Learning 11 © 2021 ⏐Younès Bennani - USPN
Multi-Layers Perceptron Transition functions
ƒ
( ) x
=tanh(x)=e
x−e−xe
x+e−xƒ ( ) x = 1
1 + e
−xƒ ( ) x = e
x−1
e
x+1
€
ƒ ʹ ( ) x = f (x) 1− ( f (x) )
€
ƒ ( ) x
€
ʹ ƒ ( ) x
Artificial Neural Networks: From Perceptron to Deep Learning 12 © 2021 ⏐Younès Bennani - USPN
Rectified Linear Units
w More efficient gradient propagation, derivative is 0 or constant, just fold into learning rate
w More efficient computation: Only comparison, addition and multiplication.
n
Leaky ReLU: f(x) = x if > 0 else ax where 0 ≤ a <= 1, so that derivate is not 0 and can do some learning for that case.
n
Lots of other variations
w Sparse activation: For example, in a randomly initialized networks,
only about 50% of hidden units are activated (having a non-zero
output)
Artificial Neural Networks: From Perceptron to Deep Learning 13 © 2021 ⏐Younès Bennani - USPN
Rectified Linear Units
Artificial Neural Networks: From Perceptron to Deep Learning 14 © 2021 ⏐Younès Bennani - USPN
Illustration of some possible decision boundaries
Artificial Neural Networks: From Perceptron to Deep Learning 15 © 2021 ⏐Younès Bennani - USPN
Output
Input or
and Convex
regions
Hyperplanes Arbitrary decision regions
Illustration of some possible decision boundaries
Artificial Neural Networks: From Perceptron to Deep Learning 16 © 2021 ⏐Younès Bennani - USPN
R mse (w) = 1
2 ( d
(n)− y
(n))
2n=1 N
∑
R multiple−logistic (w) = d i
(n)log d i
(n)y i
(n)+ ( 1− d i
(n)) log 1− 1 − d y i
(n)i
(n)⎡
⎣ ⎢ ⎤
⎦ ⎥
i=1 D
∑
n=1 N
∑
R
log−likelihood(w) = d
(n)log d
(n)p
(n)n=1 N
∑ avec p
(n)= e y
(n)
e y
(j)j
∑
R mse−pondéré (w) = ∑ n=1 N ( ( d
(n)− y
(n)) ∑
−1( d
(n)− y
(n)) )
avec ∑
−1= N 1 ( d
(n)− y
(n)) ( d
(n)− y
(n)) T
n=1 N
∑
Multi-Layers Perceptron
Error functions
Artificial Neural Networks: From Perceptron to Deep Learning 17 © 2021 ⏐Younès Bennani - USPN
R % mlp
(n)(w) = 1
2 ( y k
(n)− d k
(n))
2k=1 K
∑
€
z = (x,d) v.a. (X,D) p(x,d) = p(x)p(d / x)
R % mlp
(n)(w) = 1
2 ƒ w k0
(2)+ w kj
(2)ƒ w j0
(1)+ w
(1)ji x i
i
=1D
⎛ ∑
⎝ ⎜⎜ ⎞
⎠ ⎟⎟
j
=1M
⎛ ∑
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟ − d k
(n)⎛
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟
2
k=1 K
∑
€
R(w) = L z,w ( ) dp( z)
∫ Z R(w) % = 1
N L z (
(n), w )
n=1 N
∑
€
w + = argmin
w
R ˜ (w)
Multi-Layers Perceptron Rules for updating the weights
Artificial Neural Networks: From Perceptron to Deep Learning 18 © 2021 ⏐Younès Bennani - USPN
Δw ji = w ji (t +1) − w ji (t)
= −ε(t)∇ w
ji
R % mlp
(n)(w)
∇ w
jiR % mlp
(n)(w) =δ j
(n)s i
δ
j(n)= f ʹ (a
j) ( y
j(n)− d
j(n)) si j ∈ Sortie
δ
j(n)= f ʹ (a
j) w
hjδ
h(n)h∈Aval(j)
∑ si i ∉ Sortie
⎧
⎨
⎪ ⎪
⎩
⎪ ⎪
R % mlp
(n)(w) = 1
2 ƒ w k0
(2)
+ w kj
(2)
ƒ w j0
(1)+ w ji
(1)
x i i
=1D
⎛ ∑
⎝ ⎜⎜ ⎞
⎠ ⎟⎟
j
=1M
⎛ ∑
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟ − d k
(n)⎛
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟
2
k=1 K
∑
!" # ∈ %&'(&'
!" # ∉ %&'(&'
Multi-Layers Perceptron Rules for updating the weights
Artificial Neural Networks: From Perceptron to Deep Learning 19 © 2021 ⏐Younès Bennani - USPN
Gradient Backpropagation (GBP)
R % mlp
(n)(w) = 1
2 ( y k
(n)− d k
(n))
2k=1 K
∑ ∂ R % mlp
(n)(w)
∂ w ji = ∂ R % mlp
(n)(w)
∂ a j
∂ a j
∂ w ji
€
δ k j
si j ∉ Sortie δ j
(n)= ∂ R % mlp
(n)(w)
∂ a j = ∂ R % mlp
(n)(w)
∂ s j
∂ s j
∂ a j
= f ʹ (a j ) ∂ R % mlp
(n)(w)
∂ a h
∂ a h
∂ s j
h∈Aval(j) ∑
= f ʹ (a j ) w hj δ h
(n)h∈Aval(j) ∑
€
s i
∇ w
jiR % mlp
(n)(w) =δ j
(n)s i
j
h
si j ∈ Sortie δ j
(n)
= ∂ R % mlp
(n)(w)
∂ a j = ∂ R % mlp
(n)(w)
∂ y j
∂ y j
∂ a j
= ( y
(n)j − d j
(n)) f ʹ (a j )
!" # ∈ %&'(&' !" # ∉ %&'(&'
Artificial Neural Networks: From Perceptron to Deep Learning 20 © 2021 ⏐Younès Bennani - USPN
y
i=f a ( )
ia
i=w
i jx
j j∈Amont(i∑
) x0wi0
ai yi
x1
xn
€ M
wi1
wi n
∑
δi+1
wi+1i
δi
δk
€ M wk i
wm i
∑
ʹ f a
( )
i
€ M
€ M
€ M δm
δi
(n)=
f a
ʹ( )
iw
hiδh (n) h∈Aval(i)∑ Propagation
Back-Propagation
Gradient Backpropagation (GBP)
Artificial Neural Networks: From Perceptron to Deep Learning 21 © 2021 ⏐Younès Bennani - USPN
1- Initialize randomly
2- Randomly choose a couple of data
3- Calculate the state of the network by propagation
4- Calculate a gradient approximation
5- Adapt the weights
Where is de step of the gradient.
6- Repeat from 2 to 5 until an acceptable error value is obtained.
W
oε ( ) t
x
(n), y
(n)( )
€
x j = ƒ ( ) a j
MLP Learning algorithm
a
j= w
j0+ w
jix
i i∑
Multi-Layers Perceptron
!
"#$ + 1 = !
"#$ − )($),
"(-).
#,
"(-)= /
01
"2
"-− 3
"-4/ 5 ∈ 78$98$
,
"(-)= /
01
":
; ∈ <=>?(")
!
;",
;(-)4/ 5 ∉ 78$98$
Computational complexity of backpropagation O(W) where W
= number of free parameters (weight and bias) Complexity of a 2-layer MLP:
W = (D+1)H+(H+1)O D: #Imput H: #Hidden O: #Output
Artificial Neural Networks: From Perceptron to Deep Learning 22 © 2021 ⏐Younès Bennani - USPN
MLP Example
€
D = 1 1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟ ,−1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟
; 1
−1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟ ,1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟
;
−1 1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟ ,1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟
;
−1
−1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟ ,−1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟
⎧
⎨ ⎪
⎩ ⎪
⎫
⎬ ⎪
⎭ ⎪
1
1 0.5
1.0
−1.0 1.5
1.0 1.0
1.0 1.0 0.5
x2
x1
ψ ( x,w ) = x
1⊕ x
2x2
x1
x2
x1
1.0 x
1+1.0 x
2+1.5 = 0 x
1+ x
2+1.5 = 0 1.0 x
1+1.0 x
2+ 0.5 = 0 x
1+ x
2+ 0.5 = 0
Artificial Neural Networks: From Perceptron to Deep Learning 23 © 2021 ⏐Younès Bennani - USPN
MLP Example
Artificial Neural Networks: From Perceptron to Deep Learning 24 © 2021 ⏐Younès Bennani - USPN
Data organization
Coding of the outputs
Coding « 1-among-C »
Blue
Red White
Category
1 0 0
0 0 1 0 1 0
Coding (0/1)
1 -1 -1
-1 -1 1 -1 1 -1
Coding (-1/1)
1 0
0 0 0 1
Coding (0/1)
1 -1
-1 -1 -1 1
Coding (-1/1) Coding « 1-among-(C-1) »
1 1 1
0 0 1 0 1 1
Coding (0/1) Thermometer coding
1 1 1 1
1 1
1
2 2 2 2 2
2
3 3 3 3 3
3 x
? x Neural Net 2
Classes ranking
Artificial Neural Networks: From Perceptron to Deep Learning 25 © 2021 ⏐Younès Bennani - USPN
Data organization
Coding of the inputs
Why Data Normalization is necessary for Machine Learning models?
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
Artificial Neural Networks: From Perceptron to Deep Learning 26 © 2021 ⏐Younès Bennani - USPN
Data organization
Coding of the inputs
Why Data Normalization is necessary for Machine Learning models?
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
Urvashi Jaitley, Oct 8, 2018 (https://medium.com/@urvashilluniya)
Artificial Neural Networks: From Perceptron to Deep Learning 27 © 2021 ⏐Younès Bennani - USPN
As with most statistical techniques, we always have interest to pre-process the data so that it is centered, reduced.
Data organization
Coding of the inputs
Artificial Neural Networks: From Perceptron to Deep Learning 28 © 2021 ⏐Younès Bennani - USPN
As with most statistical techniques, we always have interest to pre-process the data so that it is centered, reduced.
x
(n)← x
(n)− x σ x = 1
N x
(n)n=1 N
∑ σ
2= N 1 ( x
(n)− x )
2n=1 N
∑
a i = w ij x j j∈Amont(i ∑
)E a [ ] i = E w ij x j j∈Amont(i ∑
)⎡
⎣ ⎢
⎤
⎦ ⎥ = E w [ ] ij E x [ ] j = 0
j
∈Amont(i)∑
Var a [ ] i = Var w ij x j j
∈Amont(∑ i)
⎡
⎣ ⎢
⎤
⎦ ⎥ = j
∈Amont(i)∑ E w [ ]
ij2E x [ ]
2j= Amont(i) Var w [ ] ij Var x [ ] j
= Amont(i) σ
2[ ] w ij
=0
Data organization
Coding of the inputs
Artificial Neural Networks: From Perceptron to Deep Learning 29 © 2021 ⏐Younès Bennani - USPN
Weights nitialization
Var a [ ] i = Amont(i) σ
2[ ] w ij = 1
σ
2[ ] w ij = Amont(i) 1 σ [ ] w ij ∝ Amont(i )
−12We can therefore initialize the weights according to a uniform law over an interval:
w ij ∈ − k
Amont(i) , k Amont(i)
⎡
⎣ ⎢
⎤
⎦ ⎥
0.5 < k < 2 f (x) = 1.71 tanh 2 3 x
⎛
⎝ ⎞
⎠
Artificial Neural Networks: From Perceptron to Deep Learning 30 © 2021 ⏐Younès Bennani - USPN
Exercice to do
€
D = 1 1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟ , 1
−1
⎛
⎝ ⎜ ⎞
⎠ ⎟
⎛
⎝
⎜ ⎜
⎜
⎞
⎠
⎟ ⎟
⎟ ;
−1 1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟ , −1 1
⎛
⎝ ⎜ ⎞
⎠ ⎟
⎛
⎝
⎜ ⎜
⎜
⎞
⎠
⎟ ⎟
⎟ ; 1
−1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟ , −1 1
⎛
⎝ ⎜ ⎞
⎠ ⎟
⎛
⎝
⎜ ⎜
⎜
⎞
⎠
⎟ ⎟
⎟ ;
−1
−1
⎛
⎝
⎜
⎜ ⎜
⎞
⎠
⎟
⎟ ⎟ , 1
−1
⎛
⎝ ⎜ ⎞
⎠ ⎟
⎛
⎝
⎜ ⎜
⎜
⎞
⎠
⎟ ⎟
⎟
⎧
⎨ ⎪
⎩ ⎪
⎫
⎬ ⎪
⎭ ⎪
1 1
0.5 0.3 0.4 0.1
0.3 0.6
−0.8 0.9 0.2
−0.4
−0.2
−0.7
€
ƒ ( ) x = 1+ 1 e
−x€
ƒ ʹ ( ) x = f (x)[1− f ( x)]
Artificial Neural Networks: From Perceptron to Deep Learning 31 © 2021 ⏐Younès Bennani - USPN Artificial Neural Networks: From Perceptron to Deep Learning 32 © 2021 ⏐Younès Bennani - USPN
Artificial Neural Networks: From Perceptron to Deep Learning 33 © 2021 ⏐Younès Bennani - USPN Artificial Neural Networks: From Perceptron to Deep Learning 34 © 2021 ⏐Younès Bennani - USPN
x
1x
2x
1x
2x
1x
2By taking linear combinations of these localized functions, we can approximate any smooth functional mapping.
MLP is an universal approximator.
A neural network with 3 layers of sigmoidal hidden units, can approximate a smooth multivariate mapping to arbitrary accuracy
In (a) we see the output of a single sigmoidal unit as a function of two input variables.
Adding the outputs from 2 such units can produce a ridge-like function (b).
Adding 2 ridges can give a function with a
maximum (c). Transforming this function with another sigmoid gives a localized response (d).
x 1 x 2
Artificial Neural Networks: From Perceptron to Deep Learning 35 © 2021 ⏐Younès Bennani - USPN
Notes:
- In theory, there is no need for any other network structure. However, in applications, it may be more convenient to use multiple layers, non-linear outputs, and so on.
- It is a purely existential theorem. It does not say how to determine an appropriate number of neurons in the hidden layer and weight values to approximate a given function with a given precision!
Theorem [Hornik 1989]:
Three-layered artificial neural networks with linear outputs, with sufficient increasing saturating hidden units, can arbitrarily approach any measurable bounded function from one finite-dimensional space to another.
An artificial neural network is an universal approximator.
Universal Approximator
Artificial Neural Networks: From Perceptron to Deep Learning 36 © 2021 ⏐Younès Bennani - USPN
Links with factor analysis
MLP & PCA * **
x
1x
n€ M
Input Hidden Target
Output Network Output
€ M
* Bourlard H. & Kamp Y. (1988) : «Auto-association by multilayer perceptrons and singular value decomposition» Biological Cybernetics, Vol. 59, pp. 291-294.
** Baldi P. & Hornik K. (1989) : «Neural networks and principal component analysis: Learning from examples without local minima » Neural Networks, Vol. 2, N 1, pp. 53-58.
x ˆ
1x ˆ
n€ M
€ M
€
d
1= x
1€
d
n= x
n€ M
€ M
Output
Coding
Encoder Decoder
u
2u
1x x
2x
1≡
An auto-associative (auto-encoder) MLP with linear outputs, can realize a Karhunen-Loève transformation.
A auto-associative MLP is equivalent to a Principal Component Analysis
(PCA).
Artificial Neural Networks: From Perceptron to Deep Learning 37 © 2021 ⏐Younès Bennani - USPN
* Bennani Y. (1992) : «Approches Connexionnistes pour la Modélisation et l Identification» Thèse de Doctorat, LRI-Université Paris 11, Orsay.
* Gallinari P., Thiria S., Badran F., Fogelman-Soulie F. (1991) : « On the relations between discriminant analysis and multilayer perceptrons» Neural Networks, Vol. 4, N 3, pp.
349-360.
1
2 3
5 4 6
7 9 8
0
Couche d'entrée
2 1
3 4
5 6
8 7 9
0
1ère couche cachée
1
2 3 4
5 6
7
8 9
0
2ème couche cachée
1 2 3
4 5 6
7
8 9
0
Couche de sortie
x
1x
n€
M
€
M
€
M
€ M
Links with factor analysis
MLP & LDA * **
Input layer Hidden layer #1
Hidden layer #2 Output layer
Artificial Neural Networks: From Perceptron to Deep Learning 38 © 2021 ⏐Younès Bennani - USPN
Structured Networks
Complete Connections
x
1x
nM
x
€
€ y
d
Input Hidden Output
Target/Desired Output
Network/Computed Output
M
Artificial Neural Networks: From Perceptron to Deep Learning 39 © 2021 ⏐Younès Bennani - USPN
x
1(t)
x
n(t)
€ M
€
€ y
d
€ M
c
1(t −1)
c
m(t −1)
€ M
Context
* Elman J.L. (1990) : «Finding structure in time» Cognitive Science, Vol. 14, pp. 179-212.
Structured Networks
Complete Connections with Context [Elman *]
Input Hidden Output Target Output
Network Output
Artificial Neural Networks: From Perceptron to Deep Learning 40 © 2021 ⏐Younès Bennani - USPN
x
1(t)
x
n(t)
€ M
€
€ y
d
€ M
y
1(t −1) y
p(t − 1)
* Jordan M.I. (1992) : «Constrained supervised learning» Journal of Mathematical Psychology, Vol. 36, pp. 396-425.
Context
Structured Networks
Complete Connections with Context [Jordan *]
Input Hidden Output Target Output
Network Output
Artificial Neural Networks: From Perceptron to Deep Learning 41 © 2021 ⏐Younès Bennani - USPN
Structured Networks
Synthesis of speech [Sejnowski & Rosenberg *]
* Sejnowski T.J. & Rosenberg C.R. (1987) :
«Parallel Networks that learn to pronounce English text»
Complex systems, Vol 1, pp. 145-168.
h t s i s i h
T e i n p u t
/z/
NETtalk : text to speech
Artificial Neural Networks: From Perceptron to Deep Learning 42 © 2021 ⏐Younès Bennani - USPN
Features exctractors
Input Hidden Output
Local features Receptive field
The use of local connections greatly reduces the number of weights in a network.
w
1(1)w
3(1)w
2(1)w
4(1)⎡
⎣ ⎢
⎤
⎦ ⎥
w
1(2)w
3(2)w
2(2)w
4(2)⎡
⎣ ⎢
⎤
⎦ ⎥
w
1 (3)w
2(3)w
3(3)w
4(3)⎡
⎣
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
Structured Networks
Local Connections
Artificial Neural Networks: From Perceptron to Deep Learning 43 © 2021 ⏐Younès Bennani - USPN
Structured networks
Constrained or shared-weight connections
Local features Receptive field
An interesting property of the weight sharing mechanism is the very small number of free parameters.
w
1(1)w
3(1)w
2(1)w
4(1)⎡
⎣ ⎢
⎤
⎦ ⎥ w
1(5)
w
2 (5)w
3(5)w
4(5)⎡
⎣
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥ w
1(1)w
3(1)w
2(1)w
4(1)⎡
⎣ ⎢
⎤
⎦ ⎥
w
1(3)w
3(3)w
2(3)w
4(3)⎡
⎣ ⎢
⎤
⎦ ⎥
Output Convolution
filter Features exctractors
Artificial Neural Networks: From Perceptron to Deep Learning 44 © 2021 ⏐Younès Bennani - USPN
Structured networks
Constrained or shared-weight connections
Artificial Neural Networks: From Perceptron to Deep Learning 45 © 2021 ⏐Younès Bennani - USPN
Structured Networks
Constrained Connections or Shared Weights TDNN (Time Delay Neural Network)
Frequency
Output
time
w1 (1) w5 (1)
w2 (1) w6
(1) w3
(1) w7 (1)
w4 (1) w8 (1)
⎡
⎣ ⎢
⎤
⎦ ⎥ Convolution
filter
w1 (1) w5 (1)
w2 (1) w6
(1) w3
(1) w7 (1)
w4 (1) w8 (1)
⎡
⎣ ⎢
⎤
⎦ ⎥
w1 (2)
w5(2) w2
(2)
w6(2) w3
(2)
w7(2) w4
(2)
w8(2)
⎡
⎣ ⎢
⎤
⎦ ⎥ w1
(3) w5
(3) w2
(3) w6 (3)
w3 (3) w7 (3)
w4 (3) w8 (3)
⎡
⎣ ⎢
⎤
⎦ ⎥ w1
(4)
w5(4) w2
(4)
w6(4) w3
(4)
w7(4) w4
(4)
w8(4)
⎡
⎣ ⎢
⎤
⎦ ⎥ w1
(5) w2 (5) w3
[
(5)]
n=2 d=1
N=6 M=((N-n)/d)+1
* Waibel A., Hanazawa T., Hinton G., Shikano K., Lang K. (1987) : «Phoneme recognition using Time-Delay Neural Networks» Tech. Rep. ATR, TR-1-006.
Features exctractors
Artificial Neural Networks: From Perceptron to Deep Learning 46 © 2021 ⏐Younès Bennani - USPN
Structured Networks
Convolutional Neural Network
Super-Resolution Convolutional Neural Network
Artificial Neural Networks: From Perceptron to Deep Learning 47 © 2021 ⏐Younès Bennani - USPN
Super-Resolution Convolutional Neural Network
•Input Image: LR single-channel image up-sampled to desired higher resolution
•Conv. Layer 1: Patch extraction
• 64 filters of size 1 x 9 x 9
• Activation function: ReLU
• Output: 64 feature maps
• Parameters to optimize: 1 x 9 x 9 x 64 = 5184 weights and 64 biases
•Conv. Layer 2: Non-linear mapping
• 32 filters of size 64 x 1 x 1
• Activation function: ReLU
• Output: 32 feature maps
• Parameters to optimize: 64 x 1 x 1 x 32 = 2048 weights and 32 biases
•Conv. Layer 3: Reconstruction
• 1 filter of size 32 x 5 x 5
• Activation function: Identity
• Output: HR image
• Parameters to optimize: 32 x 5 x 5 x 1 = 800 weights and 1 bias
Structured Networks
Convolutional Neural Network
Artificial Neural Networks: From Perceptron to Deep Learning 48 © 2021 ⏐Younès Bennani - USPN
Input Low resolution image
Superpixel layers
Output High resolution image
•Conv. Layer 1: Feature extraction
• 56 filters of size 1 x 5 x 5
• Activation function: PReLU
• Output: 56 feature maps
• Parameters: 1 x 5 x 5 x 56 = 1400 weights and 56 biases
•Conv. Layer 2: Shrinking
• 12 filters of size 56 x 1 x 1
• Activation function: PReLU
• Output: 12 feature maps
• Parameters: 56 x 1 x 1 x 12 = 672 weights and 12 biases
•Conv. Layers 3–6: Mapping
• 4 x 12 filters of size 12 x 3 x 3
• Activation function: PReLU
• Output: HR feature maps
• Parameters: 4 x 12 x 3 x 3 x 12 = 5184 weights and 48 biases
•Conv. Layer 7: Expanding
• 56 filters of size 12 x 1 x 1
• Activation function: PReLU
• Output: 12 feature maps
• Parameters: 12 x 1 x 1 x 56 = 672 weights and 56 biases
…
•DeConv Layer 8: Transposed Convolution (Deconvolution)
• One filter of size 56 x 9 x 9
• Activation function: PReLU
• Output: 12 feature maps
• Parameters: 56 x 9 x 9 x 1 = 4536 weights and 1 bias
Total number of weights: 12464 (plus a very small number of parameters in PReLU layers)
1. Feature extraction: Extracts a set of feature maps directly from the LR image.
2. Shrinking: Reduces dimension of feature vectors (thus decreasing the number of parameters) by using a smaller number of filters (compared to the number of filters used for feature extraction).
3. Non-linear mapping: Maps the feature maps representing LR to HR patches. This step is performed using several mapping layers.
4. Expanding: Increases dimension of feature vectors. This operation performs the inverse operation as the shrinking layers, in order to more accurately produce the HR image.
5. Deconvolution: Produces the HR image from HR features.
Super-Resolution Convolutional Neural Network
Structured Networks
Convolutional Neural Network
Artificial Neural Networks: From Perceptron to Deep Learning 49 © 2021 ⏐Younès Bennani - USPN
Super-Resolution Convolutional Neural Network
Structured Networks
Convolutional Neural Network
Artificial Neural Networks: From Perceptron to Deep Learning 50 © 2021 ⏐Younès Bennani - USPN
Structured Networks
Convolutional Neural Network
Artificial Neural Networks: From Perceptron to Deep Learning 51 © 2021 ⏐Younès Bennani - USPN
Structured Networks
Convolutional Neural Network
Artificial Neural Networks: From Perceptron to Deep Learning 52 © 2021 ⏐Younès Bennani - USPN
Structured Networks
Convolutional Neural Network
Artificial Neural Networks: From Perceptron to Deep Learning 53 © 2021 ⏐Younès Bennani - USPN
Discrete convolution is a linear transformation that preserves the notion of ordering.
It is sparse (only a few input units contribute to a given output unit) and reuses parameters (the same weights are applied to multiple locations in the input).
Computing the output values of a discrete convolution.
Artificial Neural Networks: From Perceptron to Deep Learning 54 © 2021 ⏐Younès Bennani - USPN
The light bluegrid is called theinput feature map. A kernel (shaded area) of value:
Slides across the input feature map. At each location, the product between each element of the kernel and the input element it overlaps is computed and the results are summed up to obtain the output in the current location.
The final outputs of this procedure are called output feature maps.
Computing the output values of a discrete convolution.
stride: distance between two consecutive positions of the kernel
zero padding : number of zeros concatenated at the beginning and at the end of an axis
A (3 3) kernel applied to a (5 5) input padded with a no border of zeros
using 1 strides.
Artificial Neural Networks: From Perceptron to Deep Learning 55 © 2021 ⏐Younès Bennani - USPN
Pooling is a concept in deep learning visual object recognition that goes hand-in-hand with convolution.
The idea is that a convolution (or a local neural network feature detector) maps a region of an image to a feature map.
What is pooling in a deep architecture?
By organizing alternating feature detection and pooling layers in a hierarchy, flexibly deformed structures can be recognized as "variations on the same structural theme."
At the very highest level, the features might be object categories that are fully independent of the position of the object within the frame.
Artificial Neural Networks: From Perceptron to Deep Learning 56 © 2021 ⏐Younès Bennani - USPN
Pooling is one of the most important concepts of Convolutional neural network, which divides the input map into a set of rectangles and, outputs the maximum for nonlinear down-sampling.
The most common pooling layer filter is of size 2x2, which discards three forth of the activations.
Role of pooling layer is to reduce the resolution of the feature map but retaining features of the map required for classification through translational and rotational invariants.
What is pooling in a deep architecture?
Artificial Neural Networks: From Perceptron to Deep Learning 57 © 2021 ⏐Younès Bennani - USPN
What is pooling in a deep architecture?
Artificial Neural Networks: From Perceptron to Deep Learning 58 © 2021 ⏐Younès Bennani - USPN
There are three variants of pooling operation depending on roots of regularization technique:
• Stochastic pooling,
• Overlapping pooling,
• Fractional pooling.
What is pooling in a deep architecture?
Stochastic Pooling
Randomly picked activation within each pooling region is considered than deterministic pooling operations for regularization of the network. Stochastic pooling performs reduction of feature size but denies role for selecting features judiciously for the sake of regularization.
Overlapping pooling
Overlapping pooling operation shares responsibility of local connection beyond the size of previous convolutional filter, which breaks orthogonal responsibility between pooling layer and convolutional layer. So, no information is gained if pooling windows overlap.
Fractional Pooling
Reduction ratio of filter size due to pooling can be controlled by a fractional pooling concept, which helps to increase the depth of the network. Unlike stochastic pooling, the randomness is related to the choice of pooling regions, not the way pooling is performed inside each of the pooling regions.
Artificial Neural Networks: From Perceptron to Deep Learning 59 © 2021 ⏐Younès Bennani - USPN
Pooling
Pooling works by sliding a window across the input and feeding the content of the window to apooling function.
Pooling works very much like a discrete convolution, but replaces the linear combination described by the kernel with some other function.
Pooling operations reduce the size of feature maps by using some function to summarize subregions, such as taking the average or the maximum value.
Pooling layers provide invariance to small translations of the input.
Artificial Neural Networks: From Perceptron to Deep Learning 60 © 2021 ⏐Younès Bennani - USPN
Computing the output values of a (3 3) max pooling operation on a (5 5) input using (1 1) strides.
Pooling
Computing the output values of a (3 3) average pooling
operation on a (5 5) input using (1 1) strides.
Artificial Neural Networks: From Perceptron to Deep Learning 61 © 2021 ⏐Younès Bennani - USPN
• square inputs (i
1=i
2=i),
• square kernel size (k
1=k
2=k),
• same strides along both axes (s
1=s
2=s),
• same zero padding along both axes (p
1=p
2=p).
No zero padding, unit strides Relationship 1.
For any i and k, and for s= 1 and p= 0, o= (i−k) + 1.
No padding, unit strides
Convolving a (3 3) kernel over a (4 4) input using unit strides (i.e., i= 4, k= 3, s= 1and p= 0) à o=(4-3)+1=2
à à (2 2)
N.B.: Blue maps are inputs, and cyan maps are outputs.
Artificial Neural Networks: From Perceptron to Deep Learning 62 © 2021 ⏐Younès Bennani - USPN
Convolving a (4 4) kernel over a (5 5) input padded with a 2 border of zeros using unit strides (i.e., i= 5, k= 4, s= 1 and p= 2).
à o=(5-4)+2 2+1=6
à à (6 6)
Zero padding, unit strides
Padding with p zeros changes the effective input size from i to i+ 2p.
Relationship 1 can then be used to infer the following relationship:
Relationship 2.
For any i, k and p, and for s= 1, o= (i−k) + 2p+ 1.
Artificial Neural Networks: From Perceptron to Deep Learning 63 © 2021 ⏐Younès Bennani - USPN
Half (same) padding
Having the output size be the same as the input size (i.e., o=i) can be a desirable property:
Relationship 3.
For any i and for k odd (k= 2n+ 1, n∈N), s= 1 and p= ⌊k/2⌋ =n, o= i + 2 ⌊k/2⌋ − (k−1)
=i+ 2n−2n
=i.
(Half padding, unit strides) Convolving a 3 3 kernel over a 5 5 input using half padding and unit strides (i.e., i= 5, k= 3, s= 1 and p= 1).
Artificial Neural Networks: From Perceptron to Deep Learning 64 © 2021 ⏐Younès Bennani - USPN
Full padding
While convolving a kernel generally decreases the output size with respect to the input size, sometimes the opposite is required. This can be achieved with proper zero padding:
Relationship 4.
For any i and k, and for p=k−1and s= 1, o=i+ 2(k−1)−(k−1)
=i+ (k−1)
(Full padding, unit strides) Convolving a 3 3 kernel over a 5 5 input
using full padding and unit strides (i.e., i= 5, k= 3, s= 1 and p= 2).
Artificial Neural Networks: From Perceptron to Deep Learning 65 © 2021 ⏐Younès Bennani - USPN
No zero padding, non-unit strides
Relationship 5.
For any i, k and s, and for p= 0, o=⌊(i−k)/s⌋+1.
Convolving a (3 3) kernel over a (6 6) input padded with a 1 border of zeros using (2 2) strides (i.e., i= 6, k= 3, s= 2 and p= 1).
In this case, the bottom row and right column of the zero padded input are not covered by the kernel.
Artificial Neural Networks: From Perceptron to Deep Learning 66 © 2021 ⏐Younès Bennani - USPN
Zero padding, non-unit strides
The most general case (convolving over a zero padded input using non-unit strides) can be derived by applying Relationship 5 on an effective input of size i+ 2p, in analogy to what was done for Relationship 2:
Relationship 6.
For any i, k, p and s, o=⌊(i+ 2p−k)/s⌋+ 1.
As before, the floor function means that in some cases a convolution will produce the same output size for multiple input sizes.
More specifically, if i+ 2p−k is a multiple of s, then any input size j=i+a, a∈{0, . . . , s−1} will produce the same output size. Note that this ambiguity applies only for s >1.
Artificial Neural Networks: From Perceptron to Deep Learning 67 © 2021 ⏐Younès Bennani - USPN
Transposed convolution arithmetic
Transposed convolutions : use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to some thing that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.
For instance:
- decoding layer of a convolutional autoencoder, - project feature maps to a higher-dimensional space.
If the input and output were to be unrolled into vectors from left to right, top to bottom, the convolution could be represented as a sparse matrix C where the non-zero elements are the elements w
i,jof the kernel (with i and j being the row and column of the kernel respectively):
Artificial Neural Networks: From Perceptron to Deep Learning 68 © 2021 ⏐Younès Bennani - USPN
This linear operation takes the input matrix flattened as a 16-dimensional vector and produces a 4-dimensional vector that is later reshaped as the 2 2 output matrix.
Using this representation, the backward pass is easily obtained by transposing C; in other words, the error is backpropagated by multiplying the loss with C
T. This operation takes a 4-dimensional vector as input and produces a 16- dimensional vector as output, and its connectivity pattern is compatible with C by construction.
Notably, the kernel w defines both the matrices C and C
Tused for the forward and backward passes.
Transposed convolutions – also called fractionally strided convolutions or
deconvolutions – work by swapping the forward and backward passes of a
convolution.
Artificial Neural Networks: From Perceptron to Deep Learning 69 © 2021 ⏐Younès Bennani - USPN
No zero padding, unit strides, transposed
The simplest way to think about a transposed convolution on a given input is to imagine such an input as being the result of a direct convolution applied onsome initial feature map.
The transposed convolution can be then considered as the operation that allows to recover the shape of this initial feature map.
The transpose of convolving a (3 3) kernel over a (4 4) input using unit strides (i.e.,i= 4, k= 3, s= 1and p= 0).
It is equivalent to convolving a (3 3) kernel over a (2 2) input padded with a (2 2) border of zeros using unit strides (i.e.,i′= 2, k′=k, s′= 1 and p′= 2).
Relationship 8.
A convolution described by s= 1,p= 0 and k has an associated transposed convolution described by k′=k, s′=s and p′=k−1 and its output size is
o′=i′+ (k−1).
This corresponds to a fully padded convolution with unit strides.
Artificial Neural Networks: From Perceptron to Deep Learning 70 © 2021 ⏐Younès Bennani - USPN
Transposed convolution animations
N.B.: Blue maps are inputs, and cyan maps are outputs.
No padding, no strides, transposed Arbitrary padding, no strides,
transposed Half padding, no strides,
transposed Full padding, no strides, transposed
No padding, strides, transposed Padding, strides, transposed Padding, strides, transposed (odd)
Vincent Dumoulin, Francesco Visin - A guide to convolution arithmetic for deep learning https://github.com/vdumoulin/conv_arithmetic
Artificial Neural Networks: From Perceptron to Deep Learning 71 © 2021 ⏐Younès Bennani - USPN
Structured networks
LeNet for Digit recognition [Yann LeCun *]
* Le Cun Y., Boser B., Denker J.S., Henderson D., Howard R.E., Hubbard W., Jackel L.D. (1989) : «Back-propagation applied to handwritten zip code recognition»
Neural Computation, Vol. 1, pp. 541-551.
Artificial Neural Networks: From Perceptron to Deep Learning 72 © 2021 ⏐Younès Bennani - USPN