8 Younès BENNANI

(1)

Université Paris 13/Younès Bennani Traitement Informatique des Données 1

8 Younès BENNANI

ILOG 3

Traitement

Informatique des Données

(2)

x0 wi0

ai yi

x1

xn M

F(xj,wij) f(ai) wi1

w_{i n}

x₀ x1

xn

M

x y

z = ! ( x, w)

D = { ( x

¹

, y

¹

) ^, ( ^x

²

^, ^y

²

) ^,..., ( ^x

^N

^, ^y

^N

) } ^/ ^p(x, ^y) ⁼ ^p(x)p(y ^/ ^x)

F = { ! (x,w) / w "# }

C( w) = " L y,! [ (x, w) ] ^dp( ^x, ^y)

^!(x,w

*)=

!(x,w)

Arg min

^C(w)

!(x,w⁺)=

!(x,w)

Arg min

^Cemp(w) C_emp(w)=1

N L y

[

^k,

!

(x^k,w)

]

k=1 N

"

w_ji(t+1)=w_ji(t)!1

N"(t) #C_emp^k (w)

#w_ji(t)

k=1 N

$

^w^ji^(t⁺¹⁾⁼^w^ji^(t)^!^"(t)^#^C^emp

k (w)

#wji(t) Choix

Choix

Apprentissage Induction

ou

Rappels

x₀=1 w0

a x₁

xn

M

wixi i=0

n

!

w1

w_n

z= wixi i=0

n

!

f(x)=

1si x>0

!1si x<0

"

#

$

% $

C

_Adaline^k

(w) = ( y

^k

! wx

^k

)

²

!

w

C

Adaline

k

( w)

( ) ⁼ ^"C

^Adaline^k

_"w ^(w) ⁼ ^#2 [ ^y

^k

^# ^wx

^k

] ^x

^k

w(t + 1) = w(t) ! " (t)#

w

C

Adaline

k

(w )

( )

Adaline

Fonction de Coût

Apprentissage

w

₀

(t + 1) w

₁

(t + 1)

M w

_n

(t + 1)

!

"

#

# #

$

%

&

= w

₀

(t) w

₁

(t) M w

_n

(t)

!

"

#

# #

$

%

&

' 2 ((t) y

^k

' ( w

₀

(t) w

₁

(t) K w

_n

(t) )

x

0 k

x

₁^k

M x

_n^k

!

"

#

# #

$

%

&

)

* + + +

,

- . . .

x

0 k

x

₁^k

M x

_n^k

!

"

#

# #

$

%

&

(3)

Exercice à faire

x₀ 0.3

a

y

x1

x2

wixi i=0

n

0.8 !

0.4

!

(

x,w

)

D = 1 1 1

!

"

# #

$

%

&

& ,'1

!

"

# #

$

%

&

& ;

1 1 '1

!

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

& ; 1 '1

1 !

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

& ;

1 '1 '1

!

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

( )

* + *

, -

* . *

Représenter dans un repère orthogonal l’ensemble des échantillons.

Utiliser l ’algorithme Adaline pour adapter les poids du modèle (!=0.1).

Donner l ’équation de l ’hyperplan séparant les deux classes.

Représenter l ’hyperplan dans le même repère orthogonal

f (x) =

1 si x > 0

!1 si x < 0

"

#

$

% $

Représentation graphique

x₀ 0.3

a y

x1

x₂

wixi i=0

n

0.8 !

0.4

! (

x,w

)

D = 1 1 1

!

"

# #

$

%

&

& , '1

!

"

# #

$

%

&

& ;

1 1 '1

!

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

& ;

1 '1

1 !

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

& ;

1 '1 '1

!

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

( )

* + *

, -

*

. * x ₂

x ₁

(4)

Adaptation des poids

x0

0.3

a y

x1

x2

w_ix_i

i=0 n

0.8 !

0.4

! (

x,w

)

!

w

₀

(t + 1) w

1

(t + 1) w

₂

(t + 1)

"

#

$

$ $

%

&

' ' ' =

0.3 0.8 0.4

"

#

$

$ $

%

&

'

' ' ( 2 ) 0.1) (1( ( 0.3 0.8 0.4 ) ⁾

1 1 1

"

#

$

$ $

%

&

' ' '

* + , , ,

- . / / / )

1 1 1

"

#

$

$ $

%

&

' ' '

D = 1 1 1

!

"

# #

$

%

&

& ,'1

!

"

# #

$

%

&

& ;

1 1 '1

!

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

& ;

1 '1

1 !

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

& ;

1 '1 '1

!

"

# #

$

%

&

& ,1

!

"

# #

$

%

&

( )

* + *

, -

* . *

w

₀

(t + 1) w

₁

(t + 1)

M w

_n

(t + 1)

!

"

#

# #

$

%

&

= w

₀

(t) w

₁

(t) M w

_n

(t)

!

"

#

# #

$

%

&

' 2 ((t) y

^k

' ( w

₀

(t) w

₁

(t) K w

_n

(t) )

x

0 k

x

₁^k

M x

_n^k

!

"

#

# #

$

%

&

)

* + + +

,

- . . .

x

0 k

x

₁^k

M x

_n^k

!

"

#

# #

$

%

&

!

w

0

(t + 1) w

₁

(t + 1) w

2

(t + 1)

"

#

$

$ $

%

&

' ' ' =

0.3 0.8 0.4

"

#

$

$ $

%

&

' ' ' (

(0.5 (0.5 (0.5

"

#

$

$ $

%

&

' ' '

!

w

₀

(t + 1) w

₁

(t + 1) w

₂

(t + 1)

"

#

$

$ $

%

&

' ' ' =

0.8 1.3 0.9

"

#

$

$ $

%

&

' ' '

Évolution de l’apprentissage

x₀ 0.3

a y

x₁

x₂

w_ix_i

i=0 n

0.8 !

0.4

! (

x,w

)

x ₂

x ₁

t = 10 t = 15

w

₁^*

x

₁

+ w

₂^*

x

₂

+ w

₀^*

= 0

t = 20 t = 5

t = 0

0.8 x

₁

+ 0.4 x

₂

+ 0.3 = 0

(5)

Adaline : limites

! ( x, w ) = x

₁

" x

₂

= XOR( x

₁

, x

₂

)

x ₂

x ₁

1 1

1 -1

-1 1 1

1 -1 1 1

1 -1 -1 -1

1 ?

x

₀

x

₁

x

₂

y

x

¹

x

²

x

³

x

⁴

«Linear Separability»

A

B

A

B

Linéairement séparable Non-linéairement séparable

x₂

x1

« Linearly separable classes can be separated by a hyperplane.

In 2 dimensions, the hyperplane is a line. »

(6)

Madaline : Multi-Adaptive Linear Element

! ( x, w ) = z = x

₁

" x

₂

Madaline = un ensemble d’Adalines parallèles

Adaline 1

Adaline 2

x

₀

w

₀₁

x

₁

x

₂

w

₁₁

w

₂₁

w

₀₂

w

₁₂

w

₂₂

Adaline 1 Adaline 2

1 1

1 -1

-1 1 1

1 -1 1 1

1 -1 -1 1

1 1 1 1 -1

-1 1 1 -1

Adaline 3

w

₃₁

w

₃₂

z1!z2

z

₁

z

₂

Adaline 3

x

₀

x

₁

x

₂

y y y

x

¹

x

²

x

³

x

⁴

Madaline : Multi-Adaptive Linear Element

x ₂

x ₁

Adaline 1

Adaline 2

! ( x, w ) = x

₁

" x

₂

Adaline 1

Adaline 2

x

₀

w

₀₁

x

₁

x

₂

w

₁₁

w

₂₁

w

₀₂

w

₁₂

w

₂₂

Adaline 3

w

₃₁

w

₃₂

z1!z2

z

₁

z

₂

(7)

Madaline : Multi-Adaptive Linear Element

!(x,w)=x1"x2

x ₂

x ₁

Adaline avec pré-traitement polynomial

x

₀

x

₁

x

₂

Adaline

x

²

x Id x

²

Id

w

₀

w

₁

w

₂

w

₃

w

₄

w

₅

w

₁

x

₁

+ w

₂

x

₁²

+ w

₃

x

₁

x

₂

+ w

₄

x

₂

+ w

₅

x

₂²

+ w

₀

= 0

Ellipse de séparation

Perceptron

Rosenblatt F., 1957, 1962*

* Rosenblatt F. (1957) : « The perceptron: a perceiving and recognizing automaton », Reports 85-460-1, Cornell Aeronautical Lab., Ithaca, N.Y.

* Rosenblatt F. (1962) : « Principles of Neurodynamics: perceptrons and theory of brain mechanisms », Spartan Books, Washington.

1957, Frank Rosenblatt

Le perceptron ne désigne pas un seul modèle mais regroupe une importante famille

d ’algorithmes.

Le perceptron = machine adaptative employée pour résoudre des problèmes de classement (discrimination).

X ^! ⁽ ^x, ^w ⁾

(8)

Perceptron

Rosenblatt F., 1957, 1962

X ^"

ⁱ⁼⁰ⁿ ^wⁱ

^!

ⁱ^(x)

^! ⁽

^x,w

⁾

⁼^f(w^T

^"

⁾

1

^w⁰

w1

wn

!

1

!

2

!

n

f(x)=

1 si x!0

"1 si x<0

#

$

%

&

%

la rétine

qui reçoit les informations de l'extérieur

la cellule de décision les cellules d'association

chaque cellule possède une fonction

de transition définie sur la rétine : ! ( x, w ) ⁼ ^{f w}

0

+ w

_i

"

_i

( x)

i=1 n

$ #

%

& '

(

!

_i

( ) x ^: ^R " # R

w^T=

(

w₀,w₁,K,w_n

)

!=

(

1,!₁(x),!₂(x),K,!n(x)

)

Perceptron

Cas de deux classes

x x x

x x

x x x x

x

o o o

o o

o o o

x

ⁱ

C

₂

C

₁

"(x,w)=1

"(x,w)= -1

! (x, w) = f w

0

+ w

i

"

i

(x)

i=1 n

$ #

% & '

( =

1 si x )C

₁

1 si x* )C

₂

+

, - . -

Un perceptron peut être vu comme un classificateur à 2 classes :

C

₁

= { x ! R : " ( x,w ) = 1 }

C

₂

= { x ! R : " ( x,w) = #1 }

(9)

Perceptron

Cas de deux classes

w

^T

!

^k

> 0 pour x

^k

"C

1

w

^T

!

^k

< 0 pour x

^k

"C

₂

#

$

%

&

% ! x

^k

, w

^T

( "

^k

y

^k

) ^> ⁰

w

D(w)= 1 wmin

k

(

w^T!^ky^k

)

2

3

1

2 3

1

w

y

^k

= 1 si x

^k

!C

1

y

^k

= "1 si x

^k

!C

₂

#

$

%

&

%

Transformation

D_max reflète la difficulté du problème :

Si D_maxgrand alors le problème est facile Si D_max< 0 alors pas de solution.

D_max=max

w D(w)

! !

Perceptron

Cas de deux classes

Si l ’on appelle la forme prise en compte à l’itération k,

On définit le carré de l’erreur instantanée associée à la forme par : x

^k

, y

^k

( )

L’erreur quadratique globale ou (MSE) est :

x

^k

, y

^k

( )

C

_Perceptron

(w) = 1

N C

_Perceptron^k

(w)

k=1 N

! ⁼ ^" ^w

^T

( ^#

^k

^y

^k

)

k:x^k mal classé

{ ! }

Il existe plusieurs algorithmes d’apprentissage.

!

^k

= ( 1, !

1

(x

^k

),!

2

(x

^k

), K , !

n

(x

^k

) )

y

^k

= 1 si x

^k

!C

₁

y

^k

= "1 si x

^k

!C

₂

#

$

%

&

%

w

^T

!

^k

> 0 pour x

^k

"C

₁

w

^T

!

^k

< 0 pour x

^k

"C

₂

#

$

%

&

% ! x

^k

, w

^T

( "

^k

y

^k

) ^> ⁰

C

_Perceptron^k

(w) = ! w

^T

( "

^k

y

^k

)

(10)

Perceptron

Cas de deux classes

Critère d'arrêt = taux de classement satisfaisant.

Techniques de descente de gradient (la plus grande pente) : supposons qu ’à l’instant , les poids du Perceptron soient et qu ’on présente la forme , alors les poids seront modifiés par :

!

w

( C

_Perceptron^k

(w) ) ⁼ ^" ^C

^Perceptron

k

(w)

" w = # $

^k

y

^k

w(t +1) = w(t) ! " (t)#

_w

( C

_Perceptron^k

(w) )

t

Le pas du gradient Le gradient instantané

w t ( )

x

^k

, y

^k

( )

!^k=

(

1,!₁(x^k),!₂(x^k),K,!n(x^k)

)

Perceptron

1- A t=0, tirer au hasard 2- Présenter une forme

3- Calculer la sortie du perceptron et la comparer à

4- Si est bien classé :

Si est mal classé :

Où est le pas du gradient.

5- Répéter de 2 à 4 jusqu’à l’obtention d’une valeur acceptable de l’erreur

w(0) = ( w

0

(0), w

1

(0), K , w

n

(0) )

^T

! ( ) t

x

^k

, y

^k

( )

f w

₀

+ !

_i

(x)

i=1 n

"

#

$ % &

' = f w (

^T

!

^k

)

w(t + 1) = w(t) x

^k

x

^k

w(t + 1) = w(t) + ! (t ) "

^k

y

^k

Algorithme d ’apprentissage : cas de 2 classes

!^k=

(

1,!1(x^k),!2(x^k),K,!n(x^k)

)

y

^k

f w (

^T

!

^k

) ⁼ ^y

^k

f w (

^T

!

^k

) ^" ^y

^k

w0(t+1) w1(t+1)

M wn(t+1)

!

"

#

# #

$

%

&

= w0(t) w1(t) M wn(t)

!

"

#

# #

$

%

&

+'(t)y^k 1 (1(x^k)

M (n(x^k)

!

"

#

# #

$

%

&

(11)

Exemple de Perceptron

1 1

1 -1

1 1 1

1 -1 1 1

1 -1 -1 -1

1 x

₀

x

₁

x

₂

y

x

¹

x

²

x

³

x

⁴

D = { ( x

¹

, y

¹

) ^, ( ^x

²

^, ^y

²

) ^,..., ( ^x

^N

^, ^y

^N

) }

! ( x, w ) = x

₁

" x

₂

wi

!

i i=0

n

"

^(x)

1 w

₀

w

₁

w

₂

!

₁

( ) x

^k

⁼ ^x

1 k

x ^k

x2

x1

!

₂

( ) x

^k

⁼ ^x

²^k

p i 2

Perceptron

Cas de p classes : C

₁

, C

₂

, …, C

_p

X ^"

^j=0ⁿ^w^ij^!^j^(x)

^! ⁽ ^x, ^w ⁾

1

wi0

wi1

win

!

1

!

2

!

n

1 Max

wpj!j j=0

n

"

^(x)

w2j!j j=0

n

"

^(x)

w1j!j j=0

n

"

^(x)

! ( x, w ) = C

i

si " j # i, w

i T

$ > w

j

T

$ où $ = ( 1,$

1

( ) x ,$

2

( ) x ,K,$

n

( ) x )

ou

! ( x,w ) = C

i

si " j # i, w

i k

$

k

> w

j k

$

k k=1

n k=1

%

n

%

&

' (

)

(

(12)

Perceptron

1- A t=0, tirer au hasard la matrice des poids 2- Présenter une forme

3- Calculer la sortie du perceptron et la comparer à

4- Si est bien classé :

Si est mal classé :

Où est le pas du gradient.

5- Répéter de 2 à 4 jusqu’à l’obtention d’une valeur acceptable de l’erreur

w(0)

! ( ) t

x

^k

! C

_i

! ( x

^k

, w ) ⁼ ^C

^j

^" ^j ⁼ ^Arg ^max

l

w

_l^T

#

^k

( )

w(t + 1) = w(t) x

^k

x

^k

w

_i

(t +1) = w

_i

(t) + ! ( t) "

^k

Algorithme d ’apprentissage : cas de p classes

C

_i

w

_j

(t +1) = w

_j

(t) ! "(t ) #

^k

! ( x

^k

, w )

C

_j

= C

_i

C

_j

! C

_i

w

_l

(t +1) = w

_l

(t) ! l " i et l " j

Perceptron et Adaline

Cas de 2 classes

Classe A Classe B

Solution trouvée Par l ’Adaline

Meilleure séparation robuste entre les classes.

Solution trouvée Par le Perceptron Sépartion qui minimise Le nombre d ’erreur.

Perceptron

Adaline

+

Professeur Sortie calculée

Erreur

+

Professeur Sortie calculée

Erreur

(13)

Théorème de convergence du Perceptron

« Si un ensemble de formes est linéairement séparable, alors l’algorithme d’apprentissage du Perceptron converge vers une solution correcte en un nombre fini d’itération. »

Arbib M.A. (1987) : « Brains, Machines, and Mathematics » Berlin, Springer-Verlag.

Rosenblatt F. (1962) : « Principles of Neurodynamics » N.Y., Spartan.

Block H.D. (1962) : « The Perceptron: A Model for Brain Functioning » Reviews of Modern Physics 34, 123-135.

Minsky M.L. & Papert S.A. (1969) : « Perceptrons » Cambridge, MIT Press.

Diederich S. & Opper M. (1987) : « Learning of Correlated Patterns Spin-Glass Networks by Local Learning Rules »

Physical Review letters 58, 949-952.

Théorème de convergence du Perceptron

w

^*

: une bonne solution

m

^k

: nombre d' adaptation après les présentations de x

^k

w = ! m

^k

"

^k

#

k

^y

^k

w(t + 1) = w(t) + !(t ) "

^k

y

^k

w.w

^*

= ! m

^k

"

^k

y

^k

.w

^*

#

k

m = m

^k

!

k

w.w

^*

! " m min

k

( #

^k

y

^k

.w

^*

) ⁼ ^" ^{m D w} ( )

^*

^w

^*

! = ( w.w ^* ) ²

w ² . w ^{* 2} " 1

la règle d ’adaptation est : Hypothèse :

après ces itérations, devient : m

^k

w

et Soit Soit

Alors

D(w)= 1 wmin

k

(

w^T!^ky^k

)

Le nombre total d’adaptation

(14)

Théorème de convergence du Perceptron

w(t + 1)

²

= w(t)

²

+ !

²

"

^k ²

( ) y

^k ²

⁺ ^2! ^w(t)"

^k

^y

^k

# w(t)

²

+ !

²

"

^k ²

( ) y

^k ²

! w

²

" #

²

$

^k ²

( ) y

^k ²

" #

²

n

w

²

! m "

²

n

1 ! " ! m D w ( )

^* ²

w.w

^*

! " m min n.

k

( #

^k

y

^k

.w

^*

) ⁼ ^" ^{m D w} ( )

^*

^w

^*

m ! n D

max D(w)= 1 2

wmin

k

(

w^T!^ky^k

)

D_max=max

w D(w)

= n = 1

! 0

Alors

Après itérations on a donc : m

or Alors

Par conséquent :

Cqfd.

Capacités d’un séparateur linéaire *

= le nombre maximum de séparations linéaires possibles en 2 classes dans un ensemble de points en dimension . C N, ( d )

N d

• Si , on peut calculer par récurrence : C N ( , d )

C N,1 ( ) = 2 N

C ( 1, d ) = 2 ( d ! 1 )

* Cover T. (1965) : « Geometrical and statistical properties of linear inequalities with applications

C N, ( d ) = N ! 1 k

"

#

$ %

&

k=0 N!1

' ^C ⁽ ^1,d ^! ^k ⁾

Pour un ensemble de points il y a dichotomies possibles.

- Pour les dichotomies sont linéairement séparables - Pour seulement sont linéairement séparables.

N 2

^N

C N, ( d )

2

^N

N ! d + 1 N > d + 1

p q

!

"

# $

% & p!

(p ' q)! q!

• Si alors N ! d +1 C N, ( d ) = 2

^N

N > d + 1

(15)

Capacités d’un séparateur linéaire *

= la probabilité qu’une dichotomie aléatoire soit linéairement séparable dans un ensemble de points en dimension .

F N ( , d )

N d

* Cover T. (1965) : « Geometrical and statistical properties of linear inequalities with applications in pattern recognition »

IEEE Transactions on Electronic Computers, 14:326-334.

F N ( , d ) =

1 si N ! d + 1

1

2

^N^"1

N " 1 k

#

$

% &

' si N > d +1

k=0 d

(

)

* + , +

d=20

d=1 d=!

N /(d + 1) F(N, d)

0.0 0.5 1.0

0 1 2 3 4

F(N, d) = 0.5 pour N = 2(d + 1)

Limites du Perceptron

Sur les fonctions booléennes à entrées, moins de sont réalisables par un perceptron (linéairement séparables).

Une limitation importante, qui a justifié à l’époque l ’abandon des recherches sur les perceptrons.

2

²^d

d

* Lewis P.M. & Coates C.L. (1967) : « Threshold Logic » N.Y., John Wiley.

** Minsky M.L. & Papert S.A. (1969) : « Perceptrons » Cambridge, MA : MIT Press.

2

^d²

d!

*

1969, Minsky Marvin, Papert Seymour**

Un violent coup d’arrêt aux recherches dans le domaine

(16)

Architecture multi-couches

Le « credit assignment problem »

On se donne un réseau en couches, et un ensemble d ’exemples composés de paires entrées-sorties.

L ’algorithme de rétro-propagation du gradient apporte une solution d ’une simplicité déroutante à ce problème.

x0

x₁

xn

M

x y

Perceptron

Appliquer l ’algorithme d ’apprentissage pour déterminer W₂ On ne connaît pas les sorties

Désirées des unités cachées ! On ne peut pas appliquer l ’algorithme d ’apprentissage du Perceptron pour

déterminer W₁

W₁

W₂

L’après pause connexionniste

Bryson A., Denham W., Dreyfuss S. (1963) : « Optimal Programming Problem With Ineduality Constraints. I: Necessary Conditions for Extremal Solutions », AIAA Journal, Vol. 1, pp. 25-44.

LeCun Y. (1986) : « Learning Processes in Asymmetric Threshold Network »

Disordered Systems and Biological Organizations, Les Houches, France, Springer, pp. 223-240.

Rumelhart D., Hinton G.E., Williams R. (1986) : « Learning Internal Representations by Error Propagation »

In Parallel Distributed Processing: exploring the microstructure of cognition, Vol I, Badford Books, Cambridge, MA, pp. 318-362, MIT Press.

Problèmes de contrôle

1963, Bryson A., Denham W., Dreyfuss S.,

“la rétro-propagation du gradient”

Les années 80, la portée exacte de l ’ouvrage de Minsky & Papert sera correctement perçue.

La “redécouverte” de la “rétro-propagation du gradient”

1986, LeCun Y.

1986, Rumelhart D., Hinton G.E., Williams R.

8 Younès BENNANI