ArtificialNeural Networks

(1)

Artificial Neural Networks

From Perceptron to Deep Learning

Younès BENNANI Full Professor

Master of Science in Informatics

Exploration Informatique des Données et Décisionnel (EID

²

) Science des Données (WISD & MASD) Mathématiques des Données (MD)

Artificial Intelligence: A Modern Approach

by Stuart Russell, Peter Norvig Prentice Hall Series in Artificial Intelligence Hardcover – Import, 2015

Deep Learning

by Ian Goodfellow, Yoshua Bengio, Aaron Courville

Adaptive Computation and Machine Learning series - Hardcover – 2017

The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman Second Edition (Springer Series in Statistics) Hardcover – 2017

Machine Learning by Tom M. Mitchell Indian Edition – 2017

Pattern Recognition and Machine Learning

by Christopher M. Bishop Information Science and Statistics - Hardcover – 2010

The Nature Of Statistical Learning Theory by Vladimir Vapnik Second edition – Springer – 2011

BOOKS

Apprentissage connexionniste by Younès Bennani

Hermès / Lavoisier -2006- Traité IC2 Série Informatique et systèmes d'information Apprentissage artificiel

Concepts et algorithmes.

by Antoine Cornuéjols & Laurent Miclet Eyrolles – 2010-EAN13 : 9782212124712

Apprentissage machine de la théorie à la pratique by Massih-Reza Amini

Eyrolles – 2015 - EAN13 : 9782212138009

Data mining et statistique décisionnelle by Stéphane Tufféry Editions TECHNIP-2012 Intelligence artificielle

by Stuart Russell, Peter Norvig Editeur(s) : Pearson, 10/12/2010 (3e édition) EAN13 : 9782744074554

L'apprentissage profond

By Yoshua Bengio, Aaron Courville, Ian Goodfellow Préface Francis Bach -Florent Massot Eds – 2018 - EAN 979-1097160432 - ISBN 1097160432

BOOKS Practical work ^{(lab work)}

An open source machine learning framework that accelerates the path from research prototyping to production deployment.

https://pytorch.org

(2)

Course materials

Interactive Teaching Space

lipn.univ-paris13.fr/~bennani/enseignements

Deep Learning (DL)

Password: epi-m2-info-rna

v Theoretical formalisms, models, and learning algorithms Ø Motivations

Ø What is machine learning?

Ø Basic elements (formal neuron, architecture, parameters, ...) Ø Adaline and Perceptron

Ø Multi-Layers Perceptron (MLP) Ø Structured and convolutional networks Ø Auto Encoder / Auto-Associator (AE) Networks Ø Stacked Auto-Encoder (SAE)

Ø Radial Function Networks (RBF) Ø Learning Vector Quantization (LVQ) Ø Self Organizing Maps (SOM)

Ø Deep Self Organizing Maps (DeepSOM) Ø Restricted Boltzmann Machine (RBM) Ø Deep Belief Network (DBN)

Ø Generative Adversarial Network (GAN) v Applications

Artificial Intelligence (AI) can

isolate your face from a crowd AI for the detection and segmentation

of objects of interest

(3)

Image segmentation

Google Self-Driving Car Project

MIT: the "autonomous" car Medical diagnosis: spectacular results!

«Dermatologist-level Classification of Skin Cancer with Deep Neural Networks» Andre Esteva, Brett Kuprel, Rob Novoa, Justin Ko, Susan Swetter, Helen Blau, Sebastian Thrun - Nature, 2017

Classification des Cancer de la peau du niveau d’un expert.

(Nature, 2017)

(4)

The AI invites itself into the world of art ...

Une œuvre peinte par une IA a été vendue aux enchères à plus de 430.000 $

à New York. 115 ans après l’IA a permis de

terminer une symphonie du compositeur tchèque

Antonín Dvořák

AI "plays" better than the great champions

w Chess: the famous defeat in 1997 of the World Chess Champion Garry Kasparov against an AI designed by IBM:

Deep Blue

w In 2011, Watson, beats the humans in the Jeopardy TV game show!

w In 2017, AlphaGo Zero has supplanted all the champions!

Ø

in 3 hours, AlphaGo Zero learned the fundamentals of the game of Go

Ø

After 21 days, he equalled AlphaGo Master, who had beaten world champion Ke Jie

Categories/Positions in an image

30 x 32 Inputs

Left Straight Right Up

Hidden Layer Weights after 1 Epoch Hidden Layer Weights after 25 Epochs Output Layer Weights (including w0= q) after 1 Epoch

Position of the face in a picture

(5)

D après Beat, 2002 210 images (246x256 : TIFF)

10 femmes japonaises 6 expressions + 1 position neutre

Expression of the face in a picture

D après Beat, 2002

Facial Recognition

"DeepFace", Facebook's new facial

recognition system Unmanned aeroplanes soon?

Google Planes!

Once again, artificial intelligence is knocking down a big name in its field: General

Gene Lee, trainer of pilots for

the

US Air Force. AI has won all the rounds of the

simulation.

“J’ai été surpris par sa réactivité. Alpha semblait au fait de toutes mes intentions et réagissait instantanément à mes changements de vol et mes déploiements de missile”

Gene Lee Juillet 2016

Google Planes

"I was surprised by its responsiveness. Alpha seemed to be aware of all my intentions and reacted instantly to my flight changes and missile deployments".

Gene Lee July 2016

(6)

Eliminating language barriers Worldwide by using AI

DeepL is a German company that has set itself the goal of eliminating language barriers worldwide by using artificial intelligence.

(www.DeepL.com)

Assistance for the blind

DeepDream : machine à « rêves » psychédéliques de Google DeepDream: Google's psychedelic

dream machine

DeepDream : machine à « rêves » psychédéliques de Google DeepDream: Google's psychedelic

dream machine

(7)

AI & Machine Learning

Machine Learning paradigms taxonomy

Diameter (cm) Height (m)

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

(8)

1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm)

What is Machine Learning?

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

0 1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm) Forecast

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17

19

3,6

?

What is Machine Learning?

0 1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm) Slo pe a

Biais b

Y = a . X + b

Height = a . Diameter + b

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

0 1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm)

a b

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

(9)

1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm)

a b

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

0 1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm)

a b

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

0 1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm)

a b

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

0 1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm)

a b

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

(10)

1 2 3 4 5 6 7

0 5 10 15 20 25 30 35 40

He ig ht ( m )

Diameter (cm)

a b

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

Machine Learning

Output

Y

Input

X

X Y

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

Machine Learning Output

Y

Input

X

X Y

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

Machine Learning Output

Y

Input

X

Learning Phase

X Y

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

17 3,6

What is Machine Learning?

(11)

Reality is more complex!

Recommendations at Amazon

A personalized space that is highlighted on the homepage, and it is surely not by chance.

Machine Learning

Output Input

To make recommendations, Amazon uses all the information it has:

- The products I bought;

- The products that I said I already have;

- The products I consulted - Links between products, extrapolated from shopping lists, created by all customers.

- ...

Will you like, click, comment, interact, ...?

Recommendations at Amazon Detection and investigation of credit

card fraud on the Internet

Machine Learning Output

Input

Fraud?

(12)

Image recognition

Machine Learning

•

Flower?

•

Flamingo?

•

Lamp?

•

Robot?

•

...?

For real problems in large size, the regression line is not enough!

More complex and richer models:

artificial neural networks

€ M

Large size vs. complexity

Image 400x400

160 000 numbers as input

? ^Cat ^Other

Extract features

•

Motorbike?

•

Car ?

•

Bus ?

•

Train ?

•

… ?

€ M

€ M Image 1000x1000

10 ⁶ numbers as input

?

(13)

Extract features

Algorithm

Extraction of

features

€ M

Extracted characteristics

(relevant) Face/Other Pose Lighting

Expression

€ M Example: images of a person's face

1000x1000 pixels = 1,000,000 dimensions

Discovering the hidden structure of large dimension data

• But the face has 3 Cartesian coordinates and 3 Euler angles.

• The human face has about 50 muscles

• The hidden dimension < 56 dimensions

The relevant representation of a face image must contain less than 10

⁶

values!

• Motorbike?

• Car ?

• Bus ?

• Train ?

• … ?

€ M

Algorithm

Extraction of

features

€ M

Recognition using an Artificial Neural Network Extracted

characteristics (a few dozen) Raw image

(millions of pixels)

Image Abstraction

Extract features

€ M

Raw image (millions of pixels)

Deep Network

Extraction of hierarchical characteristics Classification

Extract features

(14)

€ M

Raw image (millions of pixels)

Deep Network

Extraction of hierarchical characteristics Classification

New representation space

Learning the characteristics

€ M

How to learn features with such networks?

Yann LE CUN

Léon BOTTOU

How to learn features with such

networks?

(15)

Artificial Neural Networks: From Perceptron to Deep Learning 57 © 2021 ⏐Younès Bennani - USPN Le Cun Y., Boser B., Denker J.S., Henderson D., Howard R.E., Hubbard W., Jackel L.D. (1989) : «Back-propagation applied to handwritten zip code recognition»

Neural Computation, Vol. 1, pp. 541-551.

How to learn features with such networks?

BENNANI Y. & GALLINARI P. (1991), “On The Use Of TDNN-Extracted Features Information In Talker Identification”, IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’91, May 14-17, pp. 385-388, Toronto, CANADA Speech

Time

Frequency

Convolutional hidden layers Convolutional hidden layers Extracted features HMM Speaker 1

HMM Speaker 2

….HMM Speaker N Input layer

Younès BENNANI Patrick GALLINARI

How to learn features with such networks?

Big names in digital technology recruit big names in Machine Learning

Yoshua Bengio Geoff Hinton Yann

LeCun

Andrew Ng

Big Data: data in all directions ...

An exponential explosion in the quantity of data

GPU computing: using the graphics processing unit (GPU) in parallel with the CPU to accelerate computations

Graphics Processing Units

Why this strong comeback?

NVIDIA DGX-1

Deep Learning System

(16)

The graphics card becomes a

"coprocessor" that supports the main processor of a system.

How does the GPU work?

CPU : Central processing unit GPU : Graphics processing unit

Learning in Deep Networks

w Constructing a feature space

n Note that this is what we do with kernels in SVM, or hidden layers in MLP, etc, but now we will build the representation space using deep architectures.

n Unsupervised learning between layers can decompose the problem into distributed sub-problems (with higher levels of abstraction) to be further decomposed at successive layers.

w Difficulties in supervised learning of deep networks

n

The first layers of MLP are not well learned

l

Gradient Diffusion - error is reduced as it propagates to the previous layers: the gradient propagates "badly" from the output to the input.

l

Leads to very slow learning.

l

The deeper the network, the greater the degree of non-linearity of the network, which would increase the chances of finding these obstacles to optimisation.

l

The lower layers remain with not very useful transformations of the input.

l

Need a way to help the first layers to do an efficient job.

n

Often not enough labelled data available

l

Can we use unsupervised/semi-supervised approaches to take advantage of untagged data?

n

Deep networks tend to have more local minima problems than simple networks during supervised learning

The challenge of learning in deep networks

• Living organisms, even primitive ones (eg insects), carry out complex tasks of information processing:

• Orientation

• Communication

• Social behavior

• ...

• The processing power of their nervous system comes from:

• Massive interconnection (10

¹⁴

connections in humans)

• Large number of single treatment units (10

¹¹

in humans): neurons

• The initial motivation was to make neuromimicry

• however, the 1940s vision was rather simplistic;

• the biological reality has been more complex since.

• On the other hand, this idea has proved very fruitful in mathematics and engineering.

The original idea (≈1940)

(17)

1943, Mc Culloch & Pitts

A model and not a copy of the biological neuron

= an elementary processor characterized by:

• input signals

• weight of connections

• activation function

• internal activation state

• transition function

• output state

x = x

₀

, x

₁

, K , x

_d

w = w

₀

, w

₁

, K , w

_d

€

F (x,w)

€

a = F(x, w)

€

f (a) y = f (a)

x0 w0

a y

x1

xd M

F(x,w) f(a)

w1

wd

The artificial neuron

M

x0

w0

y x1

xd

M

F(x,w) f(a)

w1

wd

a

Definition: A formal (artificial) neuron is a processing unit that receives input data, in the form of a vector, and produces a real output. This output is a function of the inputs and the weights of the connections.

M

The artificial neuron

x

₀

x

1

x

_d

M

w

_i

x

_i

i

∑

ƒ

( )

x=e^x−e^−x e^x+e^−x

a

€

w

j0

€

w

_j1

€

w

j n

F(x,w) = w.x

y = ƒ w

_i

x

_i

∑

i

"

# $$ %

&

''

The artificial neuron

Activation functions

ƒ

( )

x=tanh(x)=e^x−e^−x e^x+e^−x

ƒ ( ) x = 1

1+ e

^−x ƒ

( )

x=e^x−1

e^x+1

ʹ

ƒ ( ) x ⁼ ^f ^{(x) 1−} ( ^f ^(x) )

(18)

Activation functions

Rectified Linear Units

w More efficient gradient propagation, derivative is 0 or constant, just fold into learning rate

w More efficient computation: Only comparison, addition and multiplication.

n

Leaky ReLU: f(x) = x if > 0 else ax where 0 ≤ a <= 1, so that derivate is not 0 and can do some learning for that case.

n

Lots of other variations

w Sparse activation: For example, in a randomly initialized networks, only about 50% of hidden units are activated (having a non-zero output)

The Neuron Distance

x

₀

x

₁

x

_n

€

w

j i

− x

i

( )

²

i=0 n

∑

€

a

_j

€

w

_j₀

€

w

_j1

€

w − x ² = w ² − 2 w, x + x ²

€

F(x,w) = w− x

²

= ( w

_{j i}

− x

_i

)

²

i∈Amont(j)

∑

€

s

j

= ƒ ( w

j i

− x

i

)

²

i∈Amont(j)

∑

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

M

x

₀

x

1

x

_D

€ M

y

₁

(x,w) Definition: A neural network is a valued oriented graph consisting of a set of units (or automata), performing elementary computations, structured in successive layers capable of exchanging information by means of connections that connect them.

- A network is characterized by:

- architecture - functions of its elements

€ M

-Massively parallel architecture.

- System based on the cooperation of several simple units (formal neurons).

Artificial Neural Network

y

_K

(x,w)

(19)

x

0

x

₁

x

_D

€ M

y

1

(x,w)

y

_K

(x,w)

€ M

Adaptation network parameters Synaptic

efficiency

Artificial Neural Network

How Artificial Neural Networks work?

Adaline : Adaptive Linear Element

Stanford, 1960, Bernard Widrow*

x

₀

= 1 w

0

a x

1

x

_D

€ M

w

_i

x

_i

i=0 D

∑

w

₁

w

D

It is an adaptive linear element: y = w _i x _i

i=0 D

∑

The unit x0, whose activation set to 1, is called bias unit. It allows to introduce the terms of bias in the network.

* Widrow B., Hoff M.E. (1960) : « Adaptive switching circuits », IRE WESCON Conv. Record, part 4, pp. 96-104.

y = ψ (x, w) = w ₀ + w _i x _i

i=1 D

∑

€

y

f(x)=

1 si x>0

−1 si x<0

⎧

⎨

⎪

⎩ ⎪

y = ψ (x, w) = w ₀ + w _i x _i

i=1 D

∑

€

y = ψ (x,w) = w ₀ + w ^T x

w x 2

x ₁ w ₀ + w ^T x = 0

l = w

^T

x w = − w

₀

w

Adaline : Adaptive Linear Element

Stanford, 1960, Bernard Widrow*

(20)

Without bias

x x x

x x

x x x x

x

oo o

o o

o o o

o o o x x

x x x

x x x x

x

oo o

o o

o o o

With bias

Adaline : Adaptive Linear Element

Stanford, 1960, Bernard Widrow*

It can be used in discrimination (classification) between 2 classes:

x x x

x x

x x x x

x

oo o

o o

o o o

x

i

C

2

C

1

!(x,w)=0

!(x,w)> 0)

!(x,w)< 0

! ", $ = $

_&

+ $

⁽

"

> 0 +, " ∈ .

/

< 0 +, " ∈ .

1

Adaline : Adaptive Linear Element

Stanford, 1960, Bernard Widrow*

If we call the example taken into account at iteration n, We define the square of the instantaneous error by:

R %

Adaline

(n)

(w) = ( d

⁽ⁿ⁾

− w x

⁽ⁿ⁾

)

²

z

⁽ⁿ⁾

= ( x

⁽ⁿ⁾

, d

⁽ⁿ⁾

)

Desired output

The mean squared error or (MSE) is defined as the mean observed squares of instantaneous errors on all of all examples:

Computed output: y

R %

Adaline

(w) = 1

N R %

Adaline

(n)

(w)

n=1 N

∑

There are several learning algorithms.

Adaline : Adaptive Linear Element

Stanford, 1960, Bernard Widrow*

Gradient descent techniques (the biggest slope):

suppose that at the moment , the weights of the Adaline are

and show the example , then the weights will be changed by:

This rule is called stochastic gradient rule or Widrow-Hoff rule

or delta Widrow-Hoff rule or µ-LMS (Least Mean Square) rule

∇

w

R %

Adaline

(n)

(w) = ∂ R ^%

_Adaline⁽ⁿ⁾

(w)

∂ w = −2 ( d

⁽ⁿ⁾

− w x

⁽ⁿ⁾

) ^x

⁽ⁿ⁾

w(t +1) = w(t) − ε (t)∇

w

R %

Adaline

(n)

(w)

t

The step of the gradient The instant gradient

w t ( )

!

^(#)

, &

^(#)

Adaline : Adaptive Linear Element

Stanford, 1960, Bernard Widrow*

(21)

(22)

1- Initialize randomly

2- Randomly choose a couple of data 3- Determine the value of the error

4- Calculate a gradient approximation

5- Adapt the weights

Where is de step of the gradient.

6- Repeat from 2 to 4 until an acceptable error value is obtained.

W

0

w t ( ) ε ( ) t

e

⁽ⁿ⁾

( ) t ⁼ ( ^d

⁽ⁿ⁾

⁻ ^{w x}

⁽ⁿ⁾

)

∇

w

R %

Adaline

(n)

(w) = −2e

⁽ⁿ⁾

( ) t ^x

⁽ⁿ⁾

x

⁽ⁿ⁾

, d

⁽ⁿ⁾

( )

w(t+1) = w(t) − ε (t)∇

_w

R %

_Adaline⁽ⁿ⁾

(w) Learning algorithm ADALINE

Adaline : Adaptive Linear Element

Stanford, 1960, Bernard Widrow*

Adaline : Example

Data: Truth Table of a Boolean Function of 2 Variables

Problem:

Find an Adaline able to learn the truth table of a Boolean function of 2 variables.

x

1

1 1

x

2

1 -1

d

1 1

-1 1 1

-1 -1 -1

Function : !(x,w)= (x

1

or x

2

)

D

₄

= 1 1

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟ ,1

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟

; 1

−1

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟ ,1

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟

;

−1 1

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟ ,1

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟

;

−1

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟ ,−1

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟

⎧

⎨ ⎪

⎩ ⎪

⎫

⎬ ⎪

⎭ ⎪

Adaline : Example

x

1

1 1

x

2

1 -1

d

1 1

-1 1 1

-1 -1 -1

x ₂

x 1 Function : !(x,w)= (x

1

or x

2

)

Adaline : Example

D

_N

= { ( x

⁽¹⁾

,d

⁽¹⁾

) ^, ( ^x

⁽²⁾

^, ^d

⁽²⁾

) ^,..., ( ^x

^(N)

^,d

^(N)

) }

x

1

1 1

x

2

1 -1

d

1 1

x

0

1 1

-1 1 1

1 -1 -1 -1

1 x

₀

= 1 w

₀

a y

x

₁

x

₂

w

_i

x

_i

i=0 n

w

1

∑

w

2

ψ ( x,w ) = x

₁

∨ x

₂

(23)

Adaline : Example

x

₀

= 1 w

₀

a y

x

₁

x

_n

w

_i

x

_i

i=0 n

w

₁

∑

w

₂

ψ ( x, w ) = x

₁

∨ x

₂

x 2

x ₁ w 1 x 1 + w 2 x 2 + w 0 = 0

Adaline : limitations

ψ ( x,w ) = x

₁

⊕ x

₂

= XOR(x

₁

, x

₂

)

x ₂

x ₁

1 1

1 -1

-1 1 1

1 -1 1 1

1 -1 -1 -1

1 ?

x

₀

x

₁

x

₂

€

d x

¹

x

²

x

³

x

⁴

Linear separability

A

B

A B

Linearly separable Non-Linearly separable

x2

x1

« Two classes of objects, described in a space of dimension n, are said to be "linearly separable" if they are on both sides of a hyperplane in the representation space. »

Madaline : Multi-Adaptive Linear Element

ψ(x,w)=x1⊕x2

Adaline with polynomial preprocessing

x

₀

x

₁

x

₂

Adaline

x

²

x Id x

²

Id

w

₀

w

1

w

₂

w

₃

w

₄

w

₅

w

₁

x

₁

+ w

₂

x

₁²

+ w

₃

x

₁

x

₂

+ w

₄

x

₂

+ w

₅

x

₂²

+ w

₀

= 0

(24)

x ₂

x ₁

w

1

x

1

+ w

2

x

1

2

+ w

3

x

1

x

2

+ w

4

x

2

+ w

5

x

2

+ w

0

= 0

Separation Ellipse

Adaline with polynomial preprocessing

Madaline : Multi-Adaptive Linear Element

€

ψ ( x,w ) ⁼ ^y ⁼ ^x

1

⊕ x

2

Madaline = a set of parallel Adalines

Adaline 1

Adaline 2

x

₀

w

_{0 1}

x

₁

x

₂

w

_{1 1}

w

_{2 1}

w

_{0 2}

w

_{1 2}

w

_{2 2}

Adaline 1 Adaline 2

1 1

1 -1

-1 1 1

1 -1 1 1

1 -1 -1 1

1 1 1 1 -1

-1 1 1 -1

Adaline 3

w

_{3 1}

w

_{3 2}

z1∧z2

z

₁

z

₂

Adaline 3

x

₀

x

₁

x

₂

€

d

€

d

€

d x

¹

x

²

x

³

x

⁴

Madaline : Multi-Adaptive Linear Element

x ₂

x ₁

Adaline 1 Adaline 2

Adaline 1

Adaline 2

x

0

w

0 1

x

1

x

2

w

1 1

w

2 1

w

0 2

w

1 2

w

2 2

Adaline 3

w

3 1

w

3 2 z₁∧z₂

z

1

z

2

€

ψ (

x,w

)

=y=x1⊕x2

Madaline : Multi-Adaptive Linear Element

€

x

₁

⊕ x

₂

x ₀ x ₁

x ₂ ^z

¹

^∧ ^z

²

z ₁

z ₂

Madaline : Multi-Adaptive Linear Element

(25)

Perceptron Perceptron

Rosenblatt F., 1957, 1962*

* Rosenblatt F. (1957) : « The perceptron: a perceiving and recognizing automaton », Reports 85-460-1, Cornell Aeronautical Lab., Ithaca, N.Y.

* Rosenblatt F. (1962) : « Principles of Neurodynamics: perceptrons and theory of brain mechanisms », Spartan Books, Washington.

1957, Frank Rosenblatt

The perceptron does not designate a single model but groups together an important family of algorithms.

The perceptron = adaptive machine used to solve classification problems (discrimination).

X ^ψ ⁽ ^x,w ⁾

Perceptron

Rosenblatt F., 1957, 1962

X

ⁱ⁼⁰

^∑

ⁿ^wⁱ^ϕⁱ^(x) ^ψ

⁽

^x,w

⁾

⁼^f(w^T^ϕ)

w0

w₁

w_n

ϕ

1

ϕ

2

ϕ

D

f(x)=

1 si x≥0

−1 si x<0

⎧

⎨

⎪

⎩ ⎪

The Retina

Receives the information from the

outside

Decision cell The association cells

each cell has a function

transition defined on the retina:

ψ ( x, w ) ⁼ ^{f w}

0

+ w

_i

ϕ

i

(x)

i=1 D

⎛ ∑

⎝ ⎜ ⎞

⎠ ⎟ ϕ

_i

( ) x : R → ℜ

R

w^T=

(

w0,w1,K,wD

)

ϕ=

(

1,ϕ1(x),ϕ2(x),K,ϕD(x)

) 1

Perceptron

2-class case

x x x

x x

x x x x

x

oo o

o o

o o o

x

i

C

2

C

1

!(x,w)=1

!(x,w)= -1 A Perceptron can be seen as a 2-class classifier:

C

₁

= { x ∈R : ψ (x, w ) =1 }

C

₂

= { x ∈ R : ψ (x, w ) = −1 }

! ", $ = & $

'

+ )

*+, -

$

*

.

*

" =

1 0& " ∈ 2

,

−1 0& " ∈ 2

₄

Perceptron

2-class case

The mean square error (MSE) is:

There are several learning algorithms.

If we call the example taken into account at iteration , we define the square of the instantaneous error by:

" # = ! 1

& '

()*

+

" !

⁽⁽⁾

# = − '

/:1⁽²⁾34567855494:;

#

^<

=

⁽⁽⁾

>

⁽⁽⁾

" !

⁽

(#) = −#

^<

=

⁽⁽⁾

>

⁽⁽⁾

∀ @

⁽

, #

^<

=

⁽⁽⁾

>

⁽⁽⁾

> 0

#

^<

=

⁽⁽⁾

> 0 DEF @

⁽⁽⁾

∈ H

*

#

^<

=

⁽⁽⁾

< 0 DEF @

⁽⁽⁾

∈ H

_J

>

⁽⁽⁾

= 1 DEF @

⁽⁽⁾

∈ H

*

>

⁽⁽⁾

= −1 DEF @

⁽⁽⁾

∈ H

_J

@

⁽⁽⁾

, >

⁽⁽⁾

K

(26)

t

Step of the gradient

Instant gradient

w t ( ) x

⁽ⁿ⁾

, d

⁽ⁿ⁾

( )

! " + 1 = ! " − '(")*

+

- ! ,

*

₊

- ! = , . , - !

.! = −/

⁽⁰⁾

1

⁽⁰⁾

Perceptron

2-class case

Gradient descent techniques (the biggest slope):

suppose that at the moment , the weights of the Perceptron are and show the example , then the weights will be changed by:

Perceptron

1- Initialize randomly

2- Randomly choose a couple of data

3- Compute the output of the perceptron and compare it to

4- Adapt the weights:

If is well classified:

If is misclassified:

Where is de step of the gradient.

5- Repeat from 2 to 4 until an acceptable error value is obtained.

Learning algorithm: 2-class case

! " + 1 = ! " + &(") )

^(*)

+

^(*)

, !

^-

)

^(*)

≠ +

^(*)

, !

^-

)

^(*)

= +

^(*)

! " + 1 = ! "

&(") /

^(*)

/

^(*)

0

1

/

^(*)

, +

^(*)

+

^(*)

, !

1

+ 3

456 7

!

4

)

4

/ = , !

^-

)

^(*)

Perceptron example

1 1

1 -1

1 1 1

1 -1 1 1

1 -1 -1 -1

1 x

₀

x

₁

x

₂

€

d x

¹

x

²

x

³

x

⁴

ψ ( x,w ) = x

₁

∨ x

₂

wiϕi i=0 D

∑

^(x)

w

₀

w

1

w

₂

ϕ

1

( ) x

⁽ⁿ⁾

⁼ ^x

¹⁽ⁿ⁾

x ⁽ⁿ ⁾

x2

x1

ϕ

2

( ) x

⁽ⁿ⁾

⁼ ^x

²⁽ⁿ⁾

1

Applet

: http://lcn.epfl.ch/tutorial/french/perceptron/html/index.html

K i 2

Perceptron

p-class case: C

1

, C

2

, …, C

p

X ^∑

^j=0^D^w^ij^ϕ^j^(x)

^ψ ⁽ ^x, ^w ⁾

w_i0

w_i1

w_in

ϕ

1

ϕ

2

ϕ

D

1

Max

wpjϕj j=0 D

∑

^(x)

w2jϕj j=0 D

∑

^(x)

w1jϕj j=0 D

∑

^(x)

1 ! ", $ = &

_'

() ∀ + ≠ (, $

_'^-

. > $

₀^-

. 12

! ", $ = &

_'

() ∀ + ≠ (, 3

456 7

$

_'4

.

₄

> 3

456 7

$

₀₄

.

₄

(27)

Perceptron

1- Initialize randomly

2- Randomly choose a learning example

3- Compute the output of the perceptron and compare it to

4- Adapt the weights:

If is well classified:

If is misclassified:

Where is de step of the gradient.

5- Repeat from 2 to 4 until an acceptable error value is obtained.

!

"

# + 1 = !

"

# + '(#) *

⁽⁺⁾

,

⁽⁺⁾

-

.

= -

"

! # + 1 = ! #

'(#) /

⁽⁺⁾

/

⁽⁺⁾

0

1

/

⁽⁺⁾

∈ -

"

-

"

3 /

⁽⁺⁾

, ! = -

.

⟺ 6 = Argmax

"

!

_"⁼

*

⁽⁺⁾

-

_.

≠ -

_"

!

_.

# + 1 = !

_.

# − '(#) *

⁽⁺⁾

,

⁽⁺⁾

!

@

# + 1 = !

@

# ∀ B ≠ C, 6 Learning algorithm: p-class case

Perceptron vs Adaline

2-class case

Class A Class B

Solution found by the Adaline Best robust separation between classes.

Solution found by the Perceptron Separation that minimizes the number of errors

Pe rc ep tr on Ad al in e

+

SupervisionComputed Output

Error

+

Supervision

Error

Computed Output

Perceptron convergence theorem

« If a set of examples is linearly separable, then the algorithm of Perceptron learning converges to a correct solution in a finite number of iterations »

Arbib M.A. (1987) : « Brains, Machines, and Mathematics » Berlin, Springer-Verlag.

Rosenblatt F. (1962) : « Principles of Neurodynamics » N.Y., Spartan.

Block H.D. (1962) : « The Perceptron: A Model for Brain Functioning » Reviews of Modern Physics 34, 123-135.

Minsky M.L. & Papert S.A. (1969) : « Perceptrons » Cambridge, MIT Press.

Diederich S. & Opper M. (1987) : « Learning of Correlated Patterns Spin-Glass Networks by Local Learning Rules »

Physical Review letters 58, 949-952.

Multi-layer Architecture

The credit assignment problem

Given a layered network, and a set of input-output pairs examples.

The Gradient Back - Propagation algorithm brings a simple solution to this problem.

x0

x1

xD

x

M

€

d

Apply the Perceptron learning algorithm to determine W2 We do not know the desired

outputs of hidden units!

We can not apply the Perceptron learning algorithm to determine W1

W1 W2

(28)

The post connectionist break

Bryson A., Denham W., Dreyfuss S. (1963) : « Optimal Programming Problem With Ineduality Constraints. I: Necessary Conditions for Extremal Solutions », AIAA Journal, Vol. 1, pp. 25-44.

LeCun Y. (1986) : « Learning Processes in Asymmetric Threshold Network » Disordered Systems and Biological Organizations, Les Houches, France, Springer, pp. 223-240.

Rumelhart D., Hinton G.E., Williams R. (1986) : « Learning Internal Representations by Error Propagation » In Parallel Distributed Processing: exploring the microstructure of cognition, Vol I, Badford Books, Cambridge, MA, pp. 318-362, MIT Press.

ArtificialNeural Networks

Artificial Neural Networks

From Perceptron to Deep Learning

Younès BENNANI Full Professor

Master of Science in Informatics

Exploration Informatique des Données et Décisionnel (EID

) Science des Données (WISD & MASD) Mathématiques des Données (MD)

BOOKS

BOOKS Practical work (lab work)

https://pytorch.org

Course materials

Interactive Teaching Space

lipn.univ-paris13.fr/~bennani/enseignements

Deep Learning (DL)

Password: epi-m2-info-rna

Contents

v Theoretical formalisms, models, and learning algorithms Ø Motivations

Ø What is machine learning?

Ø Basic elements (formal neuron, architecture, parameters, ...) Ø Adaline and Perceptron

Ø Multi-Layers Perceptron (MLP) Ø Structured and convolutional networks Ø Auto Encoder / Auto-Associator (AE) Networks Ø Stacked Auto-Encoder (SAE)

Ø Radial Function Networks (RBF) Ø Learning Vector Quantization (LVQ) Ø Self Organizing Maps (SOM)

Ø Deep Self Organizing Maps (DeepSOM) Ø Restricted Boltzmann Machine (RBM) Ø Deep Belief Network (DBN)

Ø Generative Adversarial Network (GAN) v Applications

Artificial Intelligence (AI) can

isolate your face from a crowd AI for the detection and segmentation

of objects of interest

Image segmentation

Google Self-Driving Car Project

MIT: the "autonomous" car Medical diagnosis: spectacular results!

Classification des Cancer de la peau du niveau d’un expert.

(Nature, 2017)

The AI invites itself into the world of art ...

Une œuvre peinte par une IA a été vendue aux enchères à plus de 430.000 $

à New York. 115 ans après l’IA a permis de

terminer une symphonie du compositeur tchèque

Antonín Dvořák

AI "plays" better than the great champions

w Chess: the famous defeat in 1997 of the World Chess Champion Garry Kasparov against an AI designed by IBM:

Deep Blue

w In 2011, Watson, beats the humans in the Jeopardy TV game show!

w In 2017, AlphaGo Zero has supplanted all the champions!

in 3 hours, AlphaGo Zero learned the fundamentals of the game of Go

After 21 days, he equalled AlphaGo Master, who had beaten world champion Ke Jie

Categories/Positions in an image

Left Straight Right Up

Position of the face in a picture

D après Beat, 2002 210 images (246x256 : TIFF)

10 femmes japonaises 6 expressions + 1 position neutre

Expression of the face in a picture

D après Beat, 2002

Facial Recognition

"DeepFace", Facebook's new facial

recognition system Unmanned aeroplanes soon?

Google Planes!

Once again, artificial intelligence is knocking down a big name in its field: General

the

simulation.

Google Planes

Eliminating language barriers Worldwide by using AI

DeepL is a German company that has set itself the goal of eliminating language barriers worldwide by using artificial intelligence.

(www.DeepL.com)

Assistance for the blind

DeepDream : machine à « rêves » psychédéliques de Google DeepDream: Google's psychedelic

dream machine

DeepDream : machine à « rêves » psychédéliques de Google DeepDream: Google's psychedelic

dream machine

AI & Machine Learning

Machine Learning paradigms taxonomy

12 2,7

19 4,7

15 3,9

11 3,1

23 5,9

27 6,7

32 8,2

10 2,6

28 6,3

35 9,2

20 6,4

9 2,1

BOOKS Practical work ^{(lab work)}