9 Younès BENNANI

(1)

Université Paris 13/Younès Bennani Traitement Informatique des Données 1

9 Younès BENNANI

ILOG 3

Traitement

Informatique des

Données

(2)

AdaLinE : Rappel

Stanford, 1960, Bernard Widrow

x0=1 w0

a x1

xn

M

wixi i=0

n

!

w1

wn

z= wixi i=0

n

!

f(x)=

1si x>0

!1si x<0

"

#

$

% $

C

_Adaline^k

(w) = ( y

^k

! wx

^k

)

²

!

w

( C

_Adaline^k

( w) ) ⁼ ^"C

^Adaline^k

_"w ^(w) ⁼ ^#2 [ ^y

^k

^# ^wx

^k

] ^x

^k

w(t + 1) = w(t) ! " (t)#

_w

( C

_Adaline^k

(w ) )

Adaline

Fonction de Coût

Apprentissage

w

0

(t + 1) w

₁

(t + 1)

M w

_n

(t + 1)

!

"

#

# #

$

%

&

= w

0

(t) w

₁

(t) M w

_n

(t)

!

"

#

# #

$

%

&

' 2 ((t) y

^k

' ( w

₀

(t) w

₁

(t) K w

_n

(t) )

x

0 k

x

₁^k

M x

_n^k

!

"

#

# #

$

%

&

)

* + + +

,

- . . .

x

0 k

x

₁^k

M x

_n^k

!

"

#

# #

$

%

&

1 2

w3

Perceptron : Rappel

Rosenblatt F., 1957, 1962

X ^"

ⁱ⁼⁰ⁿ^wⁱ^!ⁱ^(x) ^!⁽^x,w⁾⁼^f(w^T^")

1 ^w⁰

w1

w_n

!1

!2

!_n la rétine

qui reçoit les informations de l'extérieur

la cellule de décision les cellules d'association

chaque cellule possède une fonction

de transition définie sur la rétine :

! ( x, w ) = f w

₀

+ w

_i

"

i

(x)

i=1 n

$ #

%

& '

(

!

i

( ) x : R " # R

w

^T

!

^k

> 0 pour x

^k

"C

₁

w

^T

!

^k

< 0 pour x

^k

"C

₂

#

$

%

&

%

! x

^k

, w

^T

( "

^k

y

^k

) ^> ⁰

C

_Perceptron^k

(w) = ! w

^T

( "

^k

y

^k

)

w(t +1) = w(t)

k k

Si f w

(

^T

!

^k

)

⁼^y^k ^alors

Si f w

(

^T

!

^k

)

^"^y^k ^alors

w

2

3 1

(3)

Représentation générale d’une fonction continue

Kolmogorov (1957)

x

₀

x

₁

x

_n

Théorème : [Kolmogorov (1957)]

Toute fonction continue et définie sur peut s ’écrire sous la forme :

Où les et les sont des fonctions continues d ’une variable.

Exemple :

Un réseau connexionniste est capable d ’approximer les fonctions et par des fonctions de la forme :

M

G

1,1

G

1,n

G

2n+1,1

G

2n+1,n

F

1

F

2n+1

M

!

M

f (x) = F

_i

G

_ij

( ) x

_j

j=1 n

" !

#

$ %

&

'

i=1 2n+1

!

f (x) [ ] 0,1

ⁿ

G

_ij

F

_i

G

_ij

F

_i

f w (

₀

+ w

^T

x )

f (x)

f (x) = x

₁

.x

₂

f (x) = 1

4 ( ( x

₁

+ x

₂

)

²

^! ( ^x

1

! x

₂

)

²

)

Perceptron Multi-Couche (PMC)

Multi-Layer Perceptron (MLP)

x

₀

x

₁

x

_n

M

Architecture :

semblable à celle du Perceptron ou de Madaline + des couches de traitement intermédiaire (couches cachées)

- Couches externes :

Entrée (e unités), Sortie (s unités) - Couches internes :

Cachées (c unités)

Notation : < e I c I s > exemple : < 6 I 4 I 2 >

But :

x ^k ! y ^k

{ } ^k ⁼¹ ^K ^N

x y ˆ

y

w ⁽¹⁾ w ⁽²⁾

Entrée Cachée Sortie

Sortie désirée

Sortie calculée

D

_N

= { (x

¹

, y

¹

), (x

²

, y

²

), K ,(x

^N

, y

^N

) }

(4)

Perceptron Multi-Couche (PMC)

Multi-Layer Perceptron (MLP)

0 1 2 3 4 5

6 7 8 9

10 11 Notations :

: l ’ensemble des unités d ’entrée : l ’ensemble des unités de sortie

: l ’ensemble des unités dont les sorties servent d’entrées à l ’unité

: l ’ensemble des unités qui utilisent comme entrée la sortie de

Par définition, on a :

E S Amont(k)

Aval(k)

! i " E, Amont(i) = #

! i " S, Aval(i) = # k

k

Amont(7) = { 0,1, 2,3,4,5 } ^Aval(7) = { 10,11 } Amont(2) = ! Aval(2) = { 6,7,8,9 } Amont(11) = { 6, 7,8, 9 } Aval(11) = !

Perceptron Multi-Couche (PMC)

Multi-Layer Perceptron (MLP)

x

₁

x

_n

M

Couche D’entrée

Couche cachée

Couche de sortie

M

M M M

y

₁

!

y

_p

!

M

!

w

_{p m}⁽²⁾

!

w

_p1⁽²⁾

!

w

₁₀⁽²⁾

!

w

_{m n}⁽¹⁾

!

w

1 1

(1)

!

w

₁₀⁽¹⁾

!

z

₁

!

z

_m

M

! x0=1

! Z0=1

!

biais

!

biais

(5)

Perceptron Multi-Couche (PMC)

Multi-Layer Perceptron (MLP)

a

_j

= w

₀⁽¹⁾_j

+ w

_ij⁽¹⁾

x

_i

i!Amont(

"

j)

z

_j

= ƒ ( ) a

_j

y ˆ

_k

= ƒ ( ) a

_k

a

_k

= w

_0k^{(2 )}

+ w

_jk⁽²⁾

z

_j

j!Amont(k)

"

y ˆ

_k

= ƒ w

_0k⁽²⁾

+ w

_jk^{(2 )}

ƒ w

₀⁽¹⁾_j

+ w

_ij⁽¹⁾

x

_i

i!Amont(j) n

"

#

$

% &

' (

j!Amont(k)

"

#

$ % &

' ( x

₀

x

₁

x

_n

M

x y ˆ

y

L'activation de la j

^ème

cellule cachée est :

La sortie de cette i

^ème

cellule cachée s'obtient par une transformation non linéaire de l'activation :

De la même façon, l'activation et la sortie de la k

^ième

unité de sortie peuvent s'obtenir comme suit :

Si on combine le calcul des sorties des cellules cachées et celui

des cellules de sortie on obtient pour la k

^ième

sortie du réseau l'expression suivante :

Perceptron Multi-Couche (PMC)

Types de fonction ƒ

ƒ ( ) x = tanh( x) = e ^x ! e ^! ^x e ^x + e ^! ^x

ƒ ( ) x ⁼ ¹ 1 + e ^! ^x ƒ ( ) x ⁼ ^x

1 + x ƒ ( ) x ⁼ ^e

x ! 1 e ^x + 1

!

ƒ ( ) x

!

"

ƒ ( ) x

(6)

Surfaces de séparation et PMC

Sortie

Entrée

OU

ET

Régions

convexes

Hyperplans Régions arbitraires

Perceptron Multi-Couche (PMC)

Critères d!’apprentissage

C

_mse

(w) = 1

2 ( y

^k

! y ˆ

^k

)

²

k=1 N

"

C

_multiple!_log_istic

(w ) = y

_p^k

log y

_p^k

y ˆ

_p^k

+ ( 1 ! y

_p^k

) ^log ¹ ^! ^y

^p

k

1 ! y ˆ

_p^k

"

# $

%

&

p=1

'

n k=1

(

N

(

C

_log!_likelihood

(w) = y

^k

log y

^k

p

^k

k=1 N

" ^avec ^p

^k

⁼ ^e

ˆ y ^k

e

^y^ˆ^j

"

j

C

_mse!_pondéré

(w) = ^"

_k=1^N

( ( y

^k

! y ˆ

^k

) "

^!1

( ^y

^k

^! ^y ^ˆ

^k

) ) ^avec ^"

^!1

⁼ _N ¹ ^"

_k^N₌₁

⁽ ^y

^k

^! ^y ^ˆ

^k

⁾ ⁽ ^y

^k

^! ^y ^ˆ

^k

⁾

^T

(7)

Perceptron Multi-Couche (PMC)

Règles d’adaptation

C

_e

(w) = 1

2 N

i=1 n

! ( ^y

ⁱ^k

^" ^y ^ˆ

ⁱ^k

)

²

k=1 N

!

w

_ij

(t + 1) = w

_ij

(t) ! 1

N " (t ) # C

_e^k

(w)

# w

_ij

(t )

k=1 N

w

_ij

( t + 1) = w

_ij

(t) ! " (t) # C

_e^k

(w) $

# w

_ij

( t) w

_ij

( t + 1) = w

_ij

(t) ! " (t) #

_i^k

x

_j

C

_e^k

(w) = 1

2 ( y

_i^k

! y ˆ

_i^k

)

²

i=1 n

"

!

i

k

= f " (a

i

) ˆ y

i k

# y

i

(

k

) ^{si i} ^$ ^S

!

i

k

= f " ( a

_i

) w

_hi

!

h k

h$Aval(i)

% ^{si i} ^& ^S

' ( )

* )

Rétro-propagation du gradient

C

_e^k

( w) = 1

2 ( y

_i^k

! y ˆ

_i^k

)

²

i=1 n

" ^! ^C

^e^k

⁽ ^w)

! w

_ij

= ! C

_e^k

(w)

! x

_i

! x

_i

! a

_i

! a

_i

! w

_ij

si i ! S A

_i

= " ( y

_i^k

" y ˆ

_i^k

)

!

_i^k

= " C

_e^k

(w)

" a

_i

= A

_i

B

_i

= A

_i

f a # ( )

_i

x

_j

= f a ( )

_j

si i ! S

A

_i

= A

_h

B

_h

D

_hi

h"Aval

#

(i)

= A

_h

f a $ ( )

_h

^w

^hi

h"Aval(i

#

)

a

_j

= w

₀⁽¹⁾_j

+ w

_ij⁽¹⁾

x

_i

i!Amont(

"

j)

A

_i

B

_i

= f a ! ( )

_i

C

_{i j}

= x

_j

A

_i

= A

_h

B

_h

! a

_h

! x

_i

h"Aval

#

(i)

D

_hi

= w

_{h i}

(8)

Rétro-propagation du gradient

y_i=f a

( )

_i

a_i= w_{i j}x_j

j!Amont(i)

"

x₀ wi0

a_i yi

x₁

xn

M w_i1

w_{i n}

!

!i+1

wi+1i

!i

!k

M w_{k i}

wm i

!

! f a

( )

i

M

!m

!_i^k= f a"

( )

_i w_hi!_h^k

h#Aval(i)

$ Propagation

Rétro-Propagation

1- Tirer au hasard

2- Présentation d’une forme

3- Calcul de l’état du réseau par propagation

4- calcul des signaux d!’erreur

5- Adaptation les poids

Où est le pas du gradient.

6- Répéter de 2 à 5 jusqu’à l’obtention d’une valeur acceptable de l’erreur

w

₀

! ( ) t

x

^k

, y

^k

( )

Perceptron Multi-Couche

x

_i

= ƒ ( ) a

_i

Algorithme d ’apprentissage

!

_i^k

= f " (a

_i

) ˆ ( y

_i^k

# y

_i^k

) ^{si i} ^$ ^S

!

_i^k

= f " ( a

_i

) w

_hi

!

_h^k

h$Aval(i

%

)

^{si i} ^& ^S

' ( )

* )

a

_j

= w

₀_j

+ w

_ij

x

_i

i!Amont(j

"

)

w

_ij

( t + 1) = w

_ij

(t) ! " (t) #

_i^k

x

_j

(9)

Exemple de PMC (MLP)

D = 1 1 1

!

"

# #

$

%

&

& ,'1

!

"

# #

$

%

&

& ; 1 1 '1

!

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

& ; 1 '1

1 !

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

& ; 1 '1 '1

!

"

# #

$

%

&

& ,'1

!

"

# #

$

%

&

( )

* + *

, -

* . *

1 1 _0.5

1.0 !1.0 1.5

1.0 1.0

1.0 1.0 0.5

x2

x1

! ( x, w ) ⁼ ^x

1

" x

₂

x2

x1

x2

x1

1.0 x

₁

+ 1.0 x

₂

+ 1.5 = 0 x

₁

+ x

₂

+ 1.5 = 0 1.0 x

₁

+ 1.0 x

₂

+ 0.5 = 0 x

₁

+ x

₂

+ 0.5 = 0

Réseaux structurés

Connexions complètes

x

₁

x

_n

M

x y ˆ

y

M

(10)

Réseaux structurés

**Connexions complètes avec contexte [Elman*]**

x

₁

(t)

x

_n

(t) M

y ˆ y

M

c

₁

(t ! 1)

c

_m

(t ! 1)

Contexte

M

* Elman J.L. (1990) : «Finding structure in time»

Cognitive Science, Vol. 14, pp. 179-212.

Réseaux structurés

**Connexions complètes avec contexte [Jordan*]**

x

₁

(t)

x

_n

(t) M

y ˆ y

M

y

₁

(t ! 1) y

_p

(t ! 1)

Contexte

* Jordan M.I. (1992) : «Constrained supervised learning»

Journal of Mathematical Psychology, Vol. 36, pp. 396-425.

(11)

Réseaux structurés

Connexions locales

Extracteurs de traits

Des traits locaux Champ récepteur

L’utilisation de connexions locales diminue très fortement le nombre de poids d’un réseau.

w

₁⁽¹⁾

w

₃⁽¹⁾

w

₂⁽¹⁾

w

₄⁽¹⁾

!

"

#

$

% &

w

₁⁽²⁾

w

₃^{(2 )}

w

₂⁽²⁾

w

₄⁽²⁾

!

"

#

$

% &

w

₁⁽³⁾

w

₂⁽³⁾

w

₃⁽³⁾

w

₄⁽³⁾

!

"

#

$

%

&

Réseaux structurés

Connexions contraintes ou à poids partagés

Extracteurs de traits

Des traits locaux Champ récepteur

Une propriété intéressante du mécanisme de partage des poids tient au nombre très faible de paramètres libres.

w

₁⁽¹⁾

w

₃⁽¹⁾

w

₂⁽¹⁾

w

₄⁽¹⁾

!

"

#

$

% & w

₁⁽⁵⁾

w

₂⁽⁵⁾

w

₃⁽⁵⁾

w

₄⁽⁵⁾

!

"

#

$

%

&

w

₁⁽¹⁾

w

₃⁽¹⁾

w

₂⁽¹⁾

w

₄⁽¹⁾

!

"

#

$

% &

w

₁⁽³⁾

w

₃⁽³⁾

w

₂⁽³⁾

w

₄⁽³⁾

!

"

#

$

% &

Filtre de convolution

(12)

Réseaux structurés

LeNet pour la reconnaissance de chiffres **[Yann LeCun*]**

* Le Cun Y., Boser B., Denker J.S., Henderson D., Howard R.E., Hubbard W., Jackel L.D. (1989) : «Back-propagation applied to handwritten zip code recognition»

Neural Computation, Vol. 1, pp. 541-551.

Réseaux structurés

TDNN (Time Delay Neural Network)

Extracteurs de traits Fréquence

Sortie

temps

w1 (1)

w5(1)

w2 (1)

w6(1)

w3 (1)

w7(1)

w4 (1)

w8(1)

!

"

#

$

% &

Filtre de convolution

w₁⁽¹⁾ w₅⁽¹⁾

w₂⁽¹⁾ w₆⁽¹⁾

w₃⁽¹⁾ w₇⁽¹⁾

w₄⁽¹⁾ w₈⁽¹⁾

!

"

#

$

% &

w1 (2)

w5(2 )

w2 (2)

w6(2)

w3 (2)

w7(2)

w4 (2)

w8(2)

!

"

#

$

% &

w1(3)

w5(3)

w2(3)

w6(3)

w3(3)

w7(3)

w4(3)

w8(3)

!

"

#

$

% &

w1 (4)

w5(4 )

w2 (4)

w6(4)

w3 (4)

w7(4)

w4 (4)

w8(4)

!

"

#

$

% &

w₁⁽⁵⁾ w₂^{(5 )} w₃⁽⁵⁾

[ ]

n=2 d=1

N=6 M=((N-n)/d)+1

(13)

Réseaux structurés

**Reconnaissance de la parole [Alex Waibel*]**

* Waibel A., Hanazawa T., Hinton G., Shikano K., Lang K. (1987) :

«Phoneme recognition using Time-Delay Neural Networks»

Tech. Rep. ATR, TR-1-006.

Organisation des données

Codage des sorties

Codage « 1-parmi-C »

Bleu

Rouge Blanc Catégorie

1 0 0

0 0 1 0 1 0

Codage (0/1)

1 -1 -1

-1 -1 1 -1 1 -1

Codage (-1/1)

1 0

0 0 0 1

Codage (0/1)

1 -1

-1 -1 -1 1

Codage (-1/1) Codage « 1-parmi-(C-1) »

1 1 1

0 0 1 0 1 1

Codage (0/1) Codage thermomètre

1 1 1 1

1 1

1 2 2 2 2 2

2 3 3 3 3 3

3 x

? x _Réseau 2

Ordre entre les classes

(14)

Organisation des données

les entrées

Comme pour la plupart des techniques statistiques, on a toujours intérêt à pré-traiter les données de façon à ce qu’elles soient centrées, réduites.

x

^k

! x

^k

" x

# x = 1

N x

ⁱ

i=1 N

$ ^#

²

⁼ ¹ _N ( ^x

ⁱ

^" ^x )

²

i=1 N

$

a

_i

= w

_ij

x

_j

j!Amont(i

"

)

E a [ ]

_i

⁼ ^E ^w

^ij

^x

^j

j!Amont(

"

i)

#

$ %

&

' ( = E w [ ]

_ij

^{E x} [ ]

^j

⁼ ⁰

j!Amont(i)

"

Var a [ ]

_i

⁼ ^Var ^w

^ij

^x

^j

j!Amont(i

"

)

#

$ %

&

' ( = E w

ij

[ ]

2

^{E x} [ ]

²^j

⁼ ^Amont(i ⁾ ^{Var w} ^{[ ]}

^ij

^{Var x} ^{[ ]}

^j

j!Amont(i)

"

= Amont(i ) )

²

[ ] w

_ij

=0

Initialisation des poids

Var a [ ]

_i

⁼ ^Amont(i) ^!

²

[ ] ^w

^ij

⁼ ¹

!

²

[ ] w

_ij

⁼ _Amont(i) ¹

! [ ] w

_ij

^" ^Amont(i ⁾

^#¹²

On peut donc initialiser les poids selon une loi uniforme sur un intervalle :

w

_ij

! " k

Amont(i ) , k Amont(i)

#

$ %

&

' (

0.5 < k < 2 f ( x) = 1.71 tanh 2 3 x

!

" #

$

(15)

Exercice à faire

D = 1 1 1

!

"

# #

$

%

&

& , 1 '1

!

"

# $

%

!

"

# #

$

%

&

& ; 1 '1 '1

!

"

# #

$

%

&

& , 1 '1

!

"

# $

%

!

"

# #

$

%

&

& ; 1 1 '1

!

"

# #

$

%

&

& , '1 1

!

"

# $

%

!

"

# #

$

%

&

& ; 1 '1 '1

!

"

# #

$

%

&

& , '1 1

!

"

# $

%

!

"

# #

$

%

&

( )

* + *

, -

* . *

1 1

0.5 0.3 0.4 0.1

0.3 0.6

!0.8 0.9 0.2

!0.4

!0.2

!0.7 ƒ ( ) x = e

^x

! 1

e

^x

+ 1

9 Younès BENNANI