1Réseaux de neuronesIFT 780Réseaux récurrentsParPierre-Marc JodoinRéseau de neurones de base (régression)

(1)

Réseaux de neurones

IFT 780

Réseaux récurrents

Par Pierre-Marc Jodoin

Réseau de neurones de base (régression)

xk 1

(...)

fa

W1

x1

x2

x3

W0

 

^x ^W ^f

 

^W ^x

y   ₁ _a ₀

fa

 

x W h y

x W f

h _a

 

1 0



 

x  y 

activation d'

fonction

a: f

(2)

Réseau de neurones de base (classification)

e

xk 1

(...)

fa

fa x1

x2

x3

e

e ^y

 

^x ^SMAX



^W ^fa

 

^W ^x





  1 0

fa

 

x SMAX

 

y y

h W y

x W f

h _a

 

 

 



1 0

n o r m Softmax W1

W0

 

x y 

Illustration simplifiée

y  x 

1 entrée et 1 sortie

RN de base

h 

(3)

Réseau de neurones de base (2 classes)

(on pourrait également ajouter un biais)

fa

w1

x1

x2

x3

W0

  ^x ^w ^f   ^W ^x

y  

₁^T _a ₀





fa

 

  ^x   ^w ^h

y

x W f h

T

a

 



 

1 0





5 1

5 3 5 0

3

R w

R h

R W

R x







y

Réseau récurrent : la sortie des neurones est réinjectée dans leur entrée

x1

x2

x3

fa

w1

W0

y

(4)

x1

x2

x3

fa

h1

h2

h3

h4

h5

copie

W0

y w1

x1

x2

x3

fa

copie h1

h2

h3

h4

h5

Ici, au lieu d’avoir 3 entrées, chaque neurone a 3+5=8 entrées.

W0

y w1

(5)

x1

x2

x3

fa

w1

h1

h2

h3

h4

h5

  ^x ^w ^f  ^W   ^x ^h 

y  

_a

 

0

,



1

 

 

  ^x   ^w ^h

y

h x W f h

T

a

 



 



1

0

,





 

5 1

5 8 5 0

8 3

,

R w

R h

R W

R h x

R x

x





 



W0

y

Pour simplifier la notation à venir

x1

x2

x3

fa

w1

h1

h2

h3

h4

h5

  ^x ^w ^f  ^W   ^x ^h 

y  

^h ^y _a ^h

 



,



 

 

  ^x  ^w ^h 

y

h x W f h

yT h h

a

 



 









 ,

 

5 5

8 5

8 3

,

R w

R h

R W

R h x

R x

y h

x h









 



Wh

y

(6)

De façon équivalente

x1

x2

x3

fa

y

wh^

h1

h2

h3

h4

h5

  ^x ^w ^f  ^W ^x ^W ^h 

y   

^h_^y _a ^x_^h

 

^h_^h



 

  ^x  ^w ^h 

y

h W x W f h

yT h

h h h

x

a

 



 













5 5 5

3 5 5 3

R w

R W

R h

R x

y h

h h

h x















h

Wx^ h

Wh^

y

Illustration plus compacte

x y

 h

  ^x ^w ^f  ^W ^x ^W ^h 

y  

^hy _a ^xh



^hh







 

  ^x   ^w ^h

y

h W x W f h

hyT

hh xh

a

 



 











5 5 5

3 5 5 3

R w

R W

R h

R x

hy hh xh







Wxh w^hy

Whh

(7)

Dans le cas général avec K sorties (régression)

y x

h

  ^x ^w ^f  ^W ^x ^W ^h 

y   

^hy _a ^xh

 

^hh



 

  x W h y

h W x W f h

hy

hh xh

a



 









5 5 5

3 5 5 3





K hy hh xh

R W

R h

R x 



Wxh W^hy

Whh

Dans le cas général avec K sorties (classification)

x

h

  ^x ^W ^f  ^W ^x ^W ^h 

y 

^hy _a ^xh



^hh







 

x SMAX

 

y y

h W y

h W x W f h

hy

hh xh

a

ˆ ˆ







 



 



5 5 5

3 5 5 3





K hy hh xh

R W

R h

R x 



Wxh

Whh

Why

y

(8)

Illustration simplifiée

x y

1 entrée et 1 sortie

RN de base

Différentes configurations pour différentes applications

x y

y1

x

1 entrée et N sorties

y2

y3 Ex.: description d’une image

1 image => N mots

(9)

x y

y1

x

y2

y3

x1

y x2

x3

Ex.: Classification de texte N mots => 1 classe

x

y1

x

y2

y3

y1

x1

N entrées et N sorties

y2

y3

x2

x3

x1

N entrées et 1 sortie

y x2

x3

Ex.: Classification d’images vidéo N images => N classes

y

(10)

x y

y1

x

y2

y3

y1

x1

N entrées et N sorties

y2

y3

x2

x3

x1

M entrées et N sorties

y4

x2

x3

y1

y2

y3

x1

y x2

x3

Ex.: Traduction Français-Anglais N mots => M mots

Exemple pour N entrées et 1 sortie:

Analyse grammaticale (classification) : (ê.t.r.e)=>’verbe’

h1

h2

h3

verbe'

'

4 y h h0

Puisque et ydoivent être des variables numériqueson utilise généralement un encodage de type «one hot».

h x  ,

 

] 0 ,...

1 , 0 , 0 [ ' '

] 0 ,...

0 , 1 , 0 [ ' '

] 0 ...

0 , 0 , 1 [ ' ' ....

0 ,..., 1 , 0 , 0 ' '

0 ,..., 0 , 1 , 0 ' '

0 ,..., 0 , 0 , 1 ' '



adjectif nom verbe c b a

R256



RM

 ê'

1' x

t'

2' x

r'

3' x

e'

4' x

(11)

Exemple pour N entrées et 1 sortie:

Analyse grammaticale (classification) : (ê.t.r.e)=>’verbe’

ê'

1' x

h1

h2

h3

verbe'

' y h4

h0

t'

2' x

r'

3' x

e'

4' x

 

1





_a ^xh _i ^hh _i

i

f W x W h

h   

3 4

4

W x W h

h 

^xh



^hh





 SMAX  W h

3



y  

^hy



Même idée pour N entrées et N sorties:

ex.: Analyse vidéo (régression) : Images vidéo => nb_piétons

0 x

h1

h2

h3

h4

h0

1 x

2 x

3 x

00 y

11 y

23 y

32 y

 

1





_a ^xh _i ^hh _i

i

f W x W h

h   

i hy

i

W h

y 



(12)

Même idée pour N entrées et N sorties:

0 x

h1

h2

h3

h4

h0

1 x

2 x

3 x

00 y

11 y

23 y

32 y

L1

L2

L3

L4

S

^Loss

Même idée pour N entrées et N sorties:

0 x

h1

h2

h3

h4

h0

1 x

2 x

3 x

00 y

11 y

23 y

32 y

L1

L2

L3

L4

S

^Loss

NOTE: la couche verte peut être remplacée par un réseau à convolution convertissant ainsi une image en un vecteur à d-dim

(13)

Autre exemple: prédiction de caractères (modèle de langue)

Alphabet jouet:[a,e,m,s]

Représentation « one hot » jouet:

‘a’ = [1, 0, 0, 0]

‘e’ = [0, 1, 0, 0]

‘m’ = [0, 0, 1, 0]

‘s’ = [0, 0, 0, 1]

But : Entraîner un modèle à prédire les lettres du mot «masse».

Autre exemple: prédiction de caractères (modèle de langue)

'

0 'm x 

'

1 'a x 

'

2 's x 

'

3 's x 

0 0 1 0

1 0 0 0

0 0 0 1

Alphabet :[a,e,m,s]

Entraînerun modèle à prédire les lettres du mot «masse».

(14)

Autre exemple: prédiction de caractères (modèle de langue)

'

0 'm x 

'

1 'a x 

'

2 's x 

'

3 's x 

0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3

-0.9 0.2 0.9

Wxh

Whh

0 0 0 1

Alphabet :[a,e,m,s]

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x  ⁰₀

1 0

-0.3 -0.1

0.9

h 

_i

 tanh  W

^xh

x 

_i

 W

^hh

h 

_i_1



Wxh

Whh

  

i



hy

i

SMAX W h

x

y    

Why ^.9_.1 0 0

Alphabet :[a,e,m,s]

(15)

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x 

'

2 'a x 

0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1



1



tanh 

_



^xh _i ^hh _i

i

W x W h

h   

Wxh W^hy ^.9_.1

0 0

.1 .5 .4 0

Alphabet :[a,e,m,s]

Whh

  

i



hy

i

SMAX W h

x

y    

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x 

'

2 'a x 

'

3 's x 

0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3



1



tanh 

_



^xh _i ^hh _i

i

W x W h

h   

Wxh

0 0 0 1

Why ^.9_.1 0 0

0 0 .09 .91 .1 .5 .4 0

Alphabet :[a,e,m,s]

Whh

  

i



hy

i

SMAX W h

x

y    

(16)

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x 

'

2 'a x 

'

3 's x 

'

4 's x 

0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3

-0.9 0.2 0.9



1



tanh 

_



^xh _i ^hh _i

i

W x W h

h   

Wxh

0 0 0 1

Why ^.9_.1 0 0

0 0 .09 .91

.3 .2 .4 .1 .1 .5 .4 0

Alphabet :[a,e,m,s]

Whh

  

i



hy

i

SMAX W h

x

y    

Autre exemple: prédiction de caractères (modèle de langue)

Alphabet :[a,e,m,s]

'

1 'm x 

'

2 'a x 

'

3 's x 

'

4 's x 



¹⁰⁰⁰



'

1'a t

0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3

-0.9 0.2 0.9

Wxh

0 0 0 1

Why



0001



'

2's t



0010



'

4'e t

.9 .1 0 0

0 0 .09 .91

.3 .2 .4 .1 .1 .5 .4 0



0001



'

3's t

Cibles

Whh

(17)

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x 

'

2 'a x 

'

3 's x 

'

4 's x 

0 0 1 0

1 0 0 0

-0.3 -0.1 0.9

1.0 0.3 0.1

-0.1 -0.5 0.3

-0.9 0.2 0.9

Wxh

0 0 0 1

Why ^.9_.1 0 0

0 0 .09 .91

.3 .2 .4 .1 .1 .5 .4

0

S

^Loss



y1,t1



L  



y2,t2



L  



y3,t3



L  



y4,t4



L   Alphabet :[a,e,m,s]



1000



'

1'a t



0001



'

2's t



0010



'

4'e t



⁰⁰⁰¹



'

3's t

Whh

Autre exemple: prédiction de caractères (modèle de langue)

Alphabet :[a,e,m,s]

En test

: prédire les lettres les unes après les autres

'

1 'm x  ⁰₀

1 0

-0.3 -0.1 0.9

Wxh

Étape 1 : Calcul de la couche cachée

(18)

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x  ⁰₀

1 0

-0.3 -0.1 0.9

Wxh W^hy ^.9_.1

0 0

Étape 2 : Calcul de la sortie (softmax)

Alphabet :[a,e,m,s]

En test

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x  ⁰₀

1 0

-0.3 -0.1 0.9

Wxh W^hy ^.9_.1

0 0

Étape 3 : Sélectionner le caractère le plus probable

'

1 'a y  Alphabet :[a,e,m,s]

En test

(19)

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x  ⁰₀

1 0

-0.3 -0.1 0.9

Wxh W^hy ^.9_.1

0 0

Étape 4 : Injecter le caractère prédit au début du réseau

'

1 'a y 

'

1 'a y  ¹⁰₀

0

Alphabet :[a,e,m,s]

En test

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x  ⁰₀

1 0

-0.3 -0.1 0.9

Wxh W^hy ^.9_.1

0 0

Et on recommence!

'

1 'a y 

'

1 'a y  ¹⁰₀

0 0.1 -0.3 -0.4

Whh .1 .1 .2 .6

'

2 's y  Alphabet :[a,e,m,s]

En test

(20)

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x  ⁰₀

1 0

-0.3 -0.1 0.9

Wxh W^hy ^.9_.1

0 0

'

1 'a y 

'

1 'a y  ¹⁰₀

0 0.1 -0.3 -0.4

Whh .1 .1 .2 .6

'

2 's y 

'

2 's y  ⁰⁰₀

1 0.9 -0.8 -0.7

.1 .1 .0 .8

'

3 's y  Alphabet :[a,e,m,s]

En test

Autre exemple: prédiction de caractères (modèle de langue)

'

1 'm x  ⁰₀

1 0

-0.3 -0.1 0.9

Wxh W^hy ^.9_.1

0 0

'

1 'a y 

'

1 'a y  ¹⁰₀

0 0.1 -0.3 -0.4

Whh .1 .1 .2 .6

'

2 's y 

'

2 's y  ⁰⁰₀

1 0.9 -0.8 -0.7

.1 .1 .0 .8

'

3 's y 

'

3 's y  ⁰⁰₀

1 -0.1 -0.7 0.5

.0 .9 .1 .0

'

4 'e y  Alphabet :[a,e,m,s]

En test

(21)

Autre exemple: prédiction de caractères (modèle de langue)

Code python: “mini-char-RNN” de A. Karpathy https://gist.github.com/karpathy/d4dee566867f8291f086

Autre exemple: prédiction de caractères (modèle de langue)

Crédit: A. Karpathy, CS231

(22)

Autre exemple: prédiction de caractères (modèle de langue)

Texte généré une fois le modèle entraîné

Autre exemple: prédiction de caractères (modèle de langue)

Entraînement sur le code source de Linux en C++

Texte généré une fois le modèle entraîné

(23)

Limites des « one-hot vectors »

45

Souvent les modèles de langue utilisent l’encodage « one hot »

 

...

0 ,..., 1 , 0 , 0 ' '

0 ,..., 0 , 1 , 0 ' '

0 ,..., 0 , 0 , 1 ' '



 c b a

R256

 Pour des caractères…

Limites des « one-hot vectors »

46

Souvent les modèles de langue utilisent l’encodage « one hot »

 

...

0 ,..., 1 , 0 , 0 ..., ' '

0 ,..., 0 , 1 , 0 ..., ' '

0 ,..., 0 , 0 , 1 ..., ' ' ...



 grandeur grandement grand

000 ,

R

10



Pour des mots…

(24)

Limites des « one-hot vectors »

Bien que simple, cet encodage a plusieurs inconvénients

1- Peu efficace en mémoire lorsque non compressé

ex.: 10,000 bits pour encoder le mot «je» dans une langue à 10,000 mots!

2- Pas de distance sémantique entre les codes:

Ex.

distance[one-hot(‘bon’), one-hot(‘bien’)]=distance[one-hot(‘bon’), one-hot(‘trottoir’)]

Or, on souhaiterait un codetel que

distance[code(‘bon’), code(‘bien’)] << distance[code(‘bon’), code(‘trottoir’)]

distance[code(‘Jean’), code(‘Chantal’)] << distance[code(‘bon’), code(‘trottoir’)]

distance[code(‘Inde’), code(‘Liban’)] << distance[code(‘bon’), code(‘trottoir’)]

Limites du « one-hot vector »

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Un solution est d’utiliser l’encodage Word2Vecde [Mikolov et al. ’13]

T.Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space”, in ICLR 2013

Très bon tutoriel!

(25)

Word2Vec s’appuie sur 2 idées fondamentales

Idée 1: Dictionnaire = matrice d’encodage

Exemple jouet: on veut représenter ces 8 mots par un code à 4 éléments

2 3 4 5

-1 -3 -2 2

11 6 4 -3

-4 8 -4 4

24 -6 42 17

91 13 14 -5

0 36 4 56

-1 0 1 35

1 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

‘the’

‘quick’

‘brown’

‘fox’

‘jumps’

‘over’

‘lazy’

‘dog’

« one-hot » Dictionnaire

1 ligne = code pour 1 mot

Word2Vec s’appuie sur 2 idées fondamentales

Comment sélectionner le code d’un mot? En multipliant son vecteur One-hot par la matrice d’encodage (le dictionnaire!)

2 3 4 5

-1 -3 -2 2

11 6 4 -3

-4 8 -4 4

24 -6 42 17

91 13 14 -5

0 36 4 56

-1 0 1 35

Ex: sélectionner le code de « brown »

Dictionnaire (matrice d’encodage)

0 0 1 0 0 0 0 0

( ( =

₁₁ ₆

)

₄

_-3

(

(26)

Word2Vec s’appuie sur 2 idées fondamentales

Première couche d’un réseau de neurones

=

matrice d’encodage

0 0

1

...

0 0 0

W0

' :'brown x

8 4

0

 R



W

Word2Vec s’appuie sur 2 idées fondamentales

Première couche d’un réseau de neurones

=

matrice d’encodage

x W

code _x 

  0

(27)

Word2Vec s’appuie sur 2 idées fondamentales

On pourra donc utiliser un réseau de neurones pour calculer le contenu du dictionnaire

Word2Vec s’appuie sur 2 idées fondamentales

Idée 2: 2 mots proches dans un texte = 2 mots proches sémantiquement

Basé sur un corpus de texte, on va créer des millions de pairs de mots

(28)

Word2Vec

[Mikolov et al. ’13]

0 0

0 0 1

0 0 0

W0

' :'brown x

0 0

1 0 0

0 0 0

W1 t:'fox'

Entraîner un réseau de neurones à reproduire le 2^emot partant du 1^er

Word2Vec

0 0

0 0 1

0 0 0

W0

' :'brown x

0 0

1 0 0

0 0 0

W1 t:'fox'

Puisque la sortie est de type «one-hot» on utilise un softmax

S M

(29)

Word2Vec

0 0

0 0 1

0 0 0

W0

' :'brown x

0 0

1 0 0

0 0 0

W1 S t:'fox'

M

  ^x ^SMAX  ^W ^f   ^W ^x 

y 

₁ _a ₀





Word2Vec

0 0

0 0 1

0 0 0

W0

' :'brown x

0 0

1 0 0

0 0 0

W1 S t:'fox'

M

Lorsqu’entraîné, utiliser comme dictionnaire

W0

(30)

Word2Vec

Cet algorithme vient avec d’autres détails

• Réduire l’occurrence des mots fréquents et sémantiquement faibles (the, of, for, this, or, and,…)

• Combiner des mots qui forment une entité (ex: nations unies)

• Divers trucs pour simplifier/accélérer l’entraînement

Word2Vec

[Ahmia et al. ’18]

(31)

Comment entraîner un RNN?

Histoire de gradients

y

  ^x ^S  ^W   ^W ^x 

y  

_M ₁ ₀



 tanh

 

^y ^t

L , L

  ^y ^t

L

_EC

 

 ,

RN de classification avec entropie croisée x W⁰ W¹

(32)

Histoire de gradients

y x

 

    ^y ^t

L L

o S y

h W o

x W h

CE

M

 



 

, tanh

1 0



 

^y ^t

L  

, L

Simple RN de classification avec entropie croisée

Propagation avant

o h W0 W¹

Histoire de gradients

y

x ^L

 

^y^^,^t^ L Simple RN de classification avec entropie croisée

L

Wo

 

_W1

L

Pour entraîner le réseau il faut calculer

et

 

    ^y ^t

L L

o S y

h W o

x W h

CE

M

 



 

, tanh

1 0



o h W0 W¹

(33)

Histoire de gradients

y

x ^L

 

^y^^,^t^ L Simple RN de classification avec entropie croisée

o y L

L

_y _o _W

W



 ₁

1

   



Dérivée en chaîne

h o y L

L

_y _o _h _W

W^o

 



  

₀







o h W0 W¹

 

    ^y ^t

L L

o S y

h W o

x W h

CE

M

 



 

, tanh

1 0



Histoire de gradients

y

x ^L

 

^y^^,^t^ L

Rétro- propagation

o h W0 W¹

 

    ^y ^t

L L

o S y

h W o

x W h

CE

M

 



 

, tanh

1 0



yL

 yL



(34)

Histoire de gradients

y

x ^L

 

^y^^,^t^ L

Rétro- propagation

o h W0 W¹

 

    ^y ^t

L L

o S y

h W o

x W h

CE

M

 



 

, tanh

1 0



L y

y o







 y L _o

y



 



Histoire de gradients

y

x ^L

 

^y^^,^t^ L

Rétro- propagation

o h W0 W¹

 

    ^y ^t

L L

o S y

h W o

x W h

CE

M

 



 

, tanh

1 0



L y o

y o W









 1

o y L _o _W

y





   ₁



1Réseaux de neuronesIFT 780Réseaux récurrentsParPierre-Marc JodoinRéseau de neurones de base (régression)

IFT 780

Réseau de neurones de base (régression)

 

 

 

 

 

Réseau de neurones de base (classification)

 



 



 

 

 

 

Illustration simplifiée

y  x 

1 entrée et 1 sortie

h 

  x w f   W x

y  





 

  x   w h

y

x W f h

 



 







R w

R h

R W

R x















  x w f  W   x h 

y  

 

,



 

 

  x   w h

y

h x W f h

 



 



,







 

,

R w

R h

R W

R h x

R x















 



  ^x ^w ^f   ^W ^x

  ^x   ^w ^h

  ^x ^w ^f  ^W   ^x ^h 

  ^x   ^w ^h

  ^x ^w ^f  ^W   ^x ^h 

  ^x  ^w ^h 

  ^x ^w ^f  ^W ^x ^W ^h 

  ^x  ^w ^h 

  ^x ^w ^f  ^W ^x ^W ^h 