1Réseaux de neuronesIFT 780Réseaux récurrentsParPierre-Marc JodoinRéseau de neurones de base (régression)

(1)

Réseaux de neurones

IFT 780

Réseaux récurrents

Par Pierre-Marc Jodoin

Réseau de neurones de base (régression)

xk 1 (...)

fa

W1

x1

x2

x3

W0

 x W f

 

Wx

y a



 1 0

fa

 

 x Wh y

x W f

h a

 

1 0



 x  y

activation d' fonction

a: f

Réseau de neurones de base (classification)

e

xk 1 (...)

fa

fa x1

x2

x3

e

e y x SMAX



Wfa

 

Wx





  1 0

fa

 

 x SMAX y y

h W y

x W f

h a





 

 



1 0

n o r m Softmax W1

W0

 x y

(2)

Illustration simplifiée

y  x 

1 entrée et 1 sortie

RN de base

h 

Réseau de neurones de base (2 classes)

(on pourrait également ajouter un biais)

fa

w1 x1

x2

x3

W0

 

^x ^w ^f

 

^W ^x

y  ₁^T _a ₀

fa

 

^x

 

^w ^h

y

x W f h

T

a  



 

1 0





5 1

5 3 5 0

3

R w

R h

R W

R x







 y

Réseau récurrent : la sortie des neurones est réinjectée dans leur entrée

x1

x2

x3

fa

w1

W0

y

(3)

x1

x2

x3

fa

h1

h2

h3

h4

h5

copie

W0

y w1

x1

x2

x3

fa

copie h1

h2

h3

h4

h5

Ici, au lieu d’avoir 3 entrées, chaque neurone a 3+5=8 entrées.

W0

y w1

x1

x2

x3

fa

f w1

h1

h2

h3

h4

h5

 

^x ^w ^f



^W

 

^x ^h



y   _a  

0 ,

 1

 

 

^x

 

^w ^h

y

h x W f h

T

a  



 



1

0 ,





 

8 5 0

8 3

, R W

R h x

R x

 x





 

 y

(4)

Pour simplifier la notation à venir

x1

x2

x3

fa

w1

h1

h2

h3

h4

h5

 

^x ^w ^f



^W

 

^x ^h



y  ^h ^y _a ^h  

 ,



 

 

^x



^w ^h



y

h x W f h

yT h h

a  



 



 



 ,

 

5 5

8 5

8 3

,

R w

R h

R W

R h x

R x

y h

x h



 



 



Wh

y

De façon équivalente

x1

x2

x3

fa

fa y

wh^

h1

h2

h3

h4

h5

 

^x ^w ^f



^W ^x ^W ^h



y  ^h_^y _a ^x_^h ^h_^h

 

^x



^w ^h



y

h W x W f h

yT h

h h h x

a  



 













5 5 5

3 5 5 3

R w

R W

R h

R x

y h

h h

h x















h

Wx^ h

Wh^

y

Illustration plus compacte

x y h

 

^x ^w ^f



^W ^x ^W ^h



y   ^hy _a ^xh ^hh

 

^x

 

^w ^h

y

h W x W f h

hyT hh xh

a  



 











5 5 5

3 5 5 3

R w

R W

R h

R x

hy hh xh







Wxh w^hy

Whh

(5)

Dans le cas général avec K sorties (régression)

y x

h

 

^x ^w ^f



^W ^x ^W ^h



y   ^hy _a ^xh ^hh

 

x W h

y

h W x W f h

hy hh xh

a 



 









5 5 5

3 5 5 3





K hy hh xh

R W

R h

R x



Wxh W^hy

Whh

Dans le cas général avec K sorties (classification)

x h

 

^x ^W ^f



^W ^x ^W ^h



y   ^hy _a ^xh ^hh

 

x SMAX

 

y y

h W y

h W x W f h

hy

hh xh a

ˆ ˆ











 



5 5 5

3 5 5 3





K hy hh xh

R W

R h

R x



Wxh

Whh

Why

y

Illustration simplifiée

x y



1 entrée et 1 sortie

RN de base

(6)

Différentes configurations pour différentes applications

x y



y1

x

1 entrée et N sorties

y2

y3

Ex.: description d’une image 1 image => N mots

x y



y1

x

y2

y3

x1



y x2



x3



Ex.: Classification de texte N mots => 1 classe

x

y1

x

y2

y3

y1

x1



N entrées et N sorties

y2

y3

x2

x3

x1



N entrées et 1 sortie

y x2



x3



Ex.: Classification d’images vidéo N images => N classes

y

(7)

x y



y1

x

y2

y3

y1

x1



N entrées et N sorties

y2

y3

x2

x3

x1



M entrées et N sorties

y4

x2



x3

y1

y2

y3

x1



y x2



x3



Ex.: Traduction Français-Anglais N mots => M mots

Exemple pour N entrées et 1 sortie:

Analyse grammaticale (classification) : (ê.t.r.e)=>’verbe’

h1

h2

h3

verbe'

'

4 y h h0

Puisque et ydoivent être des variables numériqueson utilise généralement un encodage de type «one hot».

h x

,

 

] 0 ,...

1 , 0 , 0 [ ' '

] 0 ,...

0 , 1 , 0 [ ' '

] 0 ...

0 , 0 , 1 [ ' ' ....

0 ,..., 1 , 0 , 0 ' '

0 ,..., 0 , 1 , 0 ' '

0 ,..., 0 , 0 , 1 ' '



adjectif nom verbe c b a

R256



RM

 ê'

1' x

t'

2' x

r'

3' x

e'

4' x

Exemple pour N entrées et 1 sortie:

Analyse grammaticale (classification) : (ê.t.r.e)=>’verbe’

ê'

1' x

h1

h2

h3



 h0

t'

2' x

r'

3' x

e'

' x

 

1





_a ^xh _i ^hh _i

i

f W x W h

h   

3 4

4

W x W h

h 

^xh



^hh





  ^ 



(8)

Même idée pour N entrées et N sorties:

ex.: Analyse vidéo (régression) : Images vidéo => nb_piétons

0 x

h1

h2

h3

h4

h0

1 x

2 x

3 x

00 y

11 y

23 y

32 y



 1



 i

hh i xh a

i f W x W h

h  

i hy

i W h

y 



Même idée pour N entrées et N sorties:

0 x

h1

h2

h3

h4

h0

1 x

2 x

3 x

00 y

11 y

23 y

32 y

L1

L2

L3

L4

S

^Loss

Même idée pour N entrées et N sorties:

0 x

h1

h2

h3

h4

h0

1 x

2 x

3 x

00 y

11 y

23 y

32 y

L1

L2

L3

L4

S

^Loss

NOTE: la couche verte peut être remplacée par un réseau à convolution convertissant ainsi une image en un vecteur à d-dim

(9)

Autre exemple: prédiction de caractères (modèle de langue)

Alphabet jouet:[a,e,m,s]

Représentation « one hot » jouet:

‘a’ = [1, 0, 0, 0]

‘e’ = [0, 1, 0, 0]

‘m’ = [0, 0, 1, 0]

‘s’ = [0, 0, 0, 1]

But : Entraîner un modèle à prédire les lettres du mot «masse».

'

0 'm

x 

'

1 'a x



'

2 's x 



'

3 's x 



0 0 1 0

1 0 0 0

0 0 0 1

Alphabet :[a,e,m,s]

Entraînerun modèle à prédire les lettres du mot «masse».

'

0 'm

x 

'

1 'a x



'

2 's x 



' 's x 



0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3

-0.9 Wxh

Whh

0 0 0 1

0 0

Alphabet :[a,e,m,s]

(10)

'

1 'm x

 ⁰₀

1 0

-0.3 -0.1

0.9 tanh



 i_1



hh i xh

i W x W h

h  

Wxh Whh

  

i



hy

i SMAXW h

x

y 

 

Why ^.9_.1 0 0

Alphabet :[a,e,m,s]

'

1 'm x



'

2 'a x 



0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1



1



tanh  _

 i

hh i xh

i W x W h

h  

Wxh W^hy ^.9_.1 0 0

.1 .5 .4 0

Alphabet :[a,e,m,s]

Whh

  

i



hy

i SMAXW h

x

y 

 

'

1 'm x



'

2 'a x 



'

3 's x 



0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3



1



tanh  _

 hh i

i xh

i W x W h

h  

Wxh

0 0 0 1

Why ^.9_.1 0 0

0 0 .09 .91 .1 .5 .4 0

Alphabet :[a,e,m,s]

Whh

  

i



hy

i SMAXW h

x

y 

 

(11)

'

1 'm x



'

2 'a x 



'

3 's x 



'

4 's x 



0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3

-0.9 0.2 0.9



1



tanh  _

 i

hh i xh

i W x W h

h  

Wxh

0 0 0 1

Why ^.9_.1 0 0

0 0 .09 .91

.3 .2 .4 .1 .1 .5 .4 0

Alphabet :[a,e,m,s]

Whh

  

i



hy

i SMAXW h

x

y 

 

Alphabet :[a,e,m,s]

'

1 'm x



'

2 'a x 



'

3 's x 



'

4 's x 



1000

'

1'a t

0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3

-0.9 0.2 0.9 Wxh

0 0 0 1

Why

0001

'

2's t

0010

'

4'e t

.9 .1 0 0

0 0 .09 .91

.3 .2 .4 .1 .1 .5 .4 0

0001

'

3's t

Cibles

Whh

'

1 'm x



'

2 'a x 



'

3 's x 



' 's x 



0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3

-0.9 Wxh

0 0 0 1

0 0

Why ^.9_.1 0 0

0 0 .09 .91

.3 .2 .1 .5 .4

0

S

^Loss

y1,t1

L 



y2,t2



L 

y3,t3

L 

^ ^

Alphabet :[a,e,m,s]

1000

'

1'a t

0001

'

2's t

 





0001

'

3's t

Whh

(12)

Alphabet :[a,e,m,s]

En test: prédire les lettres les unes après les autres '

1 'm x

 ⁰₀

1 0

-0.3 -0.1 0.9 Wxh

Étape 1 : Calcul de la couche cachée

'

1 'm x

 ⁰₀

1 0

-0.3 -0.1 0.9 Wxh W^hy ^.9_.1

0 0

Étape 2 : Calcul de la sortie (softmax)

Alphabet :[a,e,m,s]

En test: prédire les lettres les unes après les autres

'

1 'm x

 ⁰₀

1 0

-0.3 -0.1 0.9 Wxh W^hy ^.9_.1

0 0

Étape 3: Sélectionner le caractère le plus probable

'

1 'a y Alphabet :[a,e,m,s]

(13)

'

1 'm x

 ⁰₀

1 0

-0.3 -0.1 0.9 Wxh W^hy ^.9_.1

0 0

Étape 4 : Injecter le caractère prédit au début du réseau

' '

1 a

y '

1 'a y ¹⁰₀

0

Alphabet :[a,e,m,s]

'

1 'm x

 ⁰₀

1 0

-0.3 -0.1 0.9 Wxh W^hy ^.9_.1

0 0

Et on recommence!

'

1 'a y '

1 'a y ¹⁰₀

0 0.1 -0.3 -0.4 Whh

.1 .1 .2 .6

'

2 's y  Alphabet :[a,e,m,s]

'

1 'm x

 ⁰₀

1 0

-0.3 -0.1 0.9 Wxh W^hy ^.9_.1

0 0

'

1 'a y '

1 'a y ¹⁰₀

0 0.1 -0.3 -0.4 Whh

.1 .1 .2 .6

' '

2 s

y 

'

2 's y  ⁰⁰₀

1 0.9 -0.8 -0.7

.1 .1 .0 .8

'

3 's y  Alphabet :[a,e,m,s]

(14)

'

1 'm x

 ⁰₀

1 0

-0.3 -0.1 0.9 Wxh W^hy ^.9_.1

0 0

' '

1 a

y '

1 'a y ¹⁰₀

0 0.1 -0.3 -0.4 Whh

.1 .1 .2 .6

'

2 's y 

'

2 's y  ⁰⁰₀

1 0.9 -0.8 -0.7

.1 .1 .0 .8

'

3 's y 

'

3 's y  ⁰⁰₀

1 -0.1 -0.7 0.5

.0 .9 .1 .0

'

4 'e y  Alphabet :[a,e,m,s]

Code python: “mini-char-RNN” de A. Karpathy https://gist.github.com/karpathy/d4dee566867f8291f086

Crédit: A. Karpathy, CS231

(15)

Texte généré une fois le modèle entraîné

Entraînement sur le code source de Linux en C++

Texte généré une fois le modèle entraîné

Limites des « one-hot vectors »

Souvent les modèles de langue utilisent l’encodage « one hot »

 

...

0 ,..., 1 , 0 , 0 ' '

0 ,..., 0 , 1 , 0 ' '

0 ,..., 0 , 0 , 1 ' '



 c b a

R256

 Pour des caractères…

(16)

Limites des « one-hot vectors »

46

Souvent les modèles de langue utilisent l’encodage « one hot »

 

...

0 ,..., 1 , 0 , 0 ..., ' '

0 ,..., 0 , 1 , 0 ..., ' '

0 ,..., 0 , 0 , 1 ..., ' ' ...



 grandeur grandement grand

000 ,

R10

 Pour des mots…

Limites des « one-hot vectors »

Bien que simple, cet encodage a plusieurs inconvénients

1- Peu efficace en mémoire lorsque non compressé

ex.: 10,000 bits pour encoder le mot «je» dans une langue à 10,000 mots!

2- Pas de distance sémantique entre les codes:

Ex.

distance[one-hot(‘bon’), one-hot(‘bien’)]=distance[one-hot(‘bon’), one-hot(‘trottoir’)]

Or, on souhaiterait un codetel que

distance[code(‘bon’), code(‘bien’)] << distance[code(‘bon’), code(‘trottoir’)]

distance[code(‘Jean’), code(‘Chantal’)] << distance[code(‘bon’), code(‘trottoir’)]

distance[code(‘Inde’), code(‘Liban’)] << distance[code(‘bon’), code(‘trottoir’)]

Limites du « one-hot vector »

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Un solution est d’utiliser l’encodage Word2Vecde [Mikolov et al. ’13]

T.Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space”, in ICLR 2013 Très bon tutoriel!

(17)

Word2Vecs’appuie sur 2 idées fondamentales

Idée 1: Dictionnaire = matrice d’encodage

Exemple jouet: on veut représenter ces 8 mots par un code à 4 éléments

2 3 4 5

-1 -3 -2 2

11 6 4 -3

-4 8 -4 4

24 -6 42 17

91 13 14 -5

0 36 4 56

-1 0 1 35

1 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

‘the’

‘quick’

‘brown’

‘fox’

‘jumps’

‘over’

‘lazy’

‘dog’

« one-hot » Dictionnaire

1 ligne = code pour 1 mot

Comment sélectionner le code d’un mot? En multipliant son vecteur One-hot par la matrice d’encodage (le dictionnaire!)

2 3 4 5

-1 -3 -2 2

11 6 4 -3

-4 8 -4 4

24 -6 42 17

91 13 14 -5

0 36 4 56

-1 0 1 35

Ex: sélectionner le code de « brown »

Dictionnaire (matrice d’encodage)

0 0 1 0 0 0 0 0

( ( =

11 6

)

4

-3

(

Première couche d’un réseau de neurones

= matrice d’encodage

0 0

1

...

0 W0

' :'brown x

8 4

0

 R



W

(18)

Première couche d’un réseau de neurones

= matrice d’encodage

x W code _x 

  0

Word2Vecs’appuie sur 2 idées fondamentales Idée 1: Dictionnaire = matrice d’encodage

On pourra donc utiliser un réseau de neurones pour calculer le contenu du dictionnaire

Idée 2: 2 mots proches dans un texte = 2 mots proches sémantiquement

Basé sur un corpus de texte, on va créer des millions de pairs de mots

(19)

Word2Vec

[Mikolov et al. ’13]

0 0

0 0 1

0 0 0

W0

' :'brown x

0 0

1 0 0

0 0 0 W1 t:'fox'

Entraîner un réseau de neurones à reproduire le 2^emot partant du 1^er

Word2Vec

0 0

0 0 1

0 0 0

W0

' :'brown x

0 0

1 0 0

0 0 0 W1 t:'fox'

Puisque la sortie est de type «one-hot» on utilise un softmax

S M

Word2Vec

0 0

0 0 1

0 0 0

W0

' :'brown x

0 0

1 0 0

0 0 0 W1 S t:'fox'

M

 

^x ^SMAX



^W ^f

 

^W ^x



y   1 0

(20)

Word2Vec

0 0

0 0 1

0 0 0

W0

' :'brown x

0 0

1 0 0

0 0 0 W1 S t:'fox'

M

Lorsqu’entraîné, utiliser comme dictionnaire

W0

Word2Vec

Cet algorithme vient avec d’autres détails

• Réduire l’occurrence des mots fréquents et sémantiquement faibles (the, of, for, this, or, and,…)

• Combiner des mots qui forment une entité (ex: nations unies)

• Divers trucs pour simplifier/accélérer l’entraînement

Word2Vec

[Ahmia et al. ’18]

(21)

Comment entraîner un RNN?

Histoire de gradients

y

 

^x ^S



^W

 

^W ^x



y   _M 1tanh 0

 

yt L, L

 

yt L L _EC  

 ,

RN de classification avec entropie croisée

xW⁰ W¹

Histoire de gradients

y x

 

   

y t L L

o S y

h W o

x W h

CE

M  



 

, tanh

1 0



 

yt L

, L

Simple RN de classification avec entropie croisée

o h W0 W¹

(22)

Histoire de gradients

y

x L

 

y,t L Simple RN de classification avec entropie croisée

L

Wo

 _W1L Pour entraîner le réseau

il faut calculer

et

 

   

^y ^t

L L

o S y

h W o

x W h

CE

M  



 

, tanh

1 0



o h W0 W¹

Histoire de gradients

y

x L

 

y,t L Simple RN de classification avec entropie croisée

o y L

L _y _o _W

W





 ₁

1   



Dérivée en chaîne

h o y L

L _y _o _h _W

W^o

 



    ₀







o h W0 W¹

 

   

^y ^t

L L

o S y

h W o

x W h

CE

M  



 

, tanh

1 0



Histoire de gradients

y x L

 

yt

, L

Rétro- propagation

o h W0 W¹

 

   

y t L L

o S y

h W o

x W h

CE

M  



 

, tanh

1 0



L

y



yL



(23)

Histoire de gradients

y

x L

 

y,t L

Rétro- propagation

o h W0 W¹

 

   

^y ^t

L L

o S y

h W o

x W h

CE

M  



 

, tanh

1 0



L y

y o







 y L o y



 



Histoire de gradients

y

x L

 

y,t L

Rétro- propagation

o h W0 W¹

 

   

^y ^t

L L

o S y

h W o

x W h

CE

M  



 

, tanh

1 0



L y o

y o W









 1

o y L o _W y





   ₁



Histoire de gradients

y x L

 

yt

, L

o h W0 W¹

 

   

y t L L

o S y

h W o

x W h

CE

M  



 

, tanh

1 0



L y o o

y o W h





1

o y L o _h y



 



  



(24)

Histoire de gradients

y

x L

 

y,t L

Rétro- propagation

o h W0 W¹

 

   

^y ^t

L L

o S y

h W o

x W h

CE

M  



 

, tanh

1 0



L y o o

h

y o W h W





1 0

h o y

L o h W

y

 



    ₀



Ex.: 3 données, 3 rétro-propagations

y1

 x1

 o1 L

 

y1,t1 L1 h1

W0 W¹

y2

x2 o₂ Ly2,t2 L2 h2

W0 W¹

y3

x3 o3 Ly3,t3 L3 h3

W0 W¹

1 1L

W

2 1L

W

3 1L

W

y1

 x1

 o₁ L

 

y1,t1 L1 h1

W0 W¹

y2

x2 o₂ Ly2,t2 L2 h2

W0 W¹

y3

x3 Ly3,t3

3 L3 o h3

W0 W¹

1 0L

W

2 0L

W

3 0L

W

1Réseaux de neuronesIFT 780Réseaux récurrentsParPierre-Marc JodoinRéseau de neurones de base (régression)

IFT 780

 

 



 



 

Illustration simplifiée

y  x 

1 entrée et 1 sortie

h 

 

 

 

 

 

 



 



 

 

 

 

 

Pour simplifier la notation à venir

 



 



 

 

 





 

De façon équivalente

 





 

 





Illustration plus compacte

 





 

 

 

Dans le cas général avec K sorties (régression)

 





 

 

Dans le cas général avec K sorties (classification)

 





 

 

 

Exemple pour N entrées et 1 sortie:

Exemple pour N entrées et 1 sortie:

 





f W x W h

h   

W x W h

h 







   



Même idée pour N entrées et N sorties:

  ^ 

x W code _x 