Recurrent Neural Networks

(1)

Recurrent Neural Networks

clement.chatelain@insa-rouen.fr

14 f ´evrier 2018

(2)

Sommaire

1 Introduction

2 Recurrent neural networks : principles Classical RNN’s

Training Recurrent Neural nets

3 RNN with internal memory LSTM

BLSTM MDLSTM GRUs

4 Attention based RNN

5 RNN Applications Gradient Recall Legos

6 RNN with external memory Introduction

(3)

Introduction

Neural Networks are cool :

State of the art performance in almost every pattern recognition/computer vision/IA tasks

Efficient training algorithms (Backprop), Modular architectures

(4)

Introduction

But DNN & CNN are static architectures

Estimation of a functionf :x →y, where :

x^T = [x₁,x₂, . . . ,x_E]∈R^E y^T = [y₁,y₂, . . . ,y_S]∈R^S Able to process multidimensional signals by flatteningx, but requirefixed-sizeEandS

(5)

Unfortunately, most of real-world signals are dynamics

Text, speech, DNA sequence,handwriting, Stock exchange, etc.

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the bank, and of ...

(6)

Introduction

Dynamic Machine learning Problems

Many machine learning problems involve dynamic input and/or output signals.

Static input: Feature set, images with fixed, defined size, etc.

Static output: Mono/multidimensional classification problems, Mono/multidimensional regression problems

Dynamic input: sequence of feature sets, variable size signals Dynamic output: sequence of labels/values, mono or

multidimensional

(7)

Static input / Static output problems

MNIST : 784 input (28²), 10 output classes

imageNet : 256²input pixels, 5247 output classes

(8)

Introduction

Dynamic input / Static output problems

sentiment classification : Positive or Negative ?

"This is by far the worst hotel experience i’ve ever had.

the owner overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that room wasnt even a hotel room!"

"We enjoyed our stay very much."

Variable-size image classification

→”chocolate” →”strawberry”

And also : sentence classification, etc.

(9)

Static input / Dynamic output problems

Caption generation

Object detection (VOC Challenge)

(10)

Introduction

Dynamic Input / Dynamic Output problems

Handwriting/Speech recognition

→ ”Herr B ¨urgermaister Hans”

(11)

Introduction

Need for dynamic models to process dynamic signals

Able to handle input/output variable-size signals

Able to model the dependencies between variables = the knowledge

"j ’ ai malencontrousement marchésur le chion de ma vosrre . cele - ci , tenant brauemryp à son chien , a éle

parthaliérement afestée par l ’ acaident et me demande un dédammagement de taille"

Two kind of models :

Graphical models: Dependencies areexplicitelymodeled.

Recurrent Neural networks: Dependencies areimplicitelymodeled

(12)

Introduction

Non-neural sequence models : Graphical models

Hidden Markov Models

I State-of-the-art for sequence modeling 80’s and 90’s (still widely used)

I Generative model NN/HMM hybrid models

Conditional Random Fields (CRF)

I Discriminative model [Lafferty 2001]

I RequireE=S

I Suit labeling task

Hidden-CRF, Latent-Dynamic CRF

I Difficult to train

I Do not scale very well

HMM, CRF and HCRF

(13)

Plan

1 Introduction

BLSTM MDLSTM GRUs

(14)

Recurrent neural networks : principles Classical RNN’s

Recurrence principle (1)

Handle dynamic signals

Signals are processed using a sliding window

Windows (AKA frames) are of fixed size→fit static NN architectures

(15)

Recurrence principle (2)

Signal dependencies = memory from previous time steps

How to model memory in the NN ?

Recurrent connexions y(t)is processed using :

I x(t)the current input

I y(t−1)the output at previous time step

Recurrent NN6=feedforward

Recurrence is generally a hidden-to-hidden link

(16)

Recurrence principle (3)

Recurrent neural networks may : Output a value at each time step

⇒sequence labeling

Output a unique value after having read an entire sequence

⇒sequence classification

Output a sequence whose sizeSis different from input sizeE

⇒sequence 2 sequence applications

(17)

Unfolded network (1)

Recurrence ↔ unfolded network

Signal from the recurrence = neuron output at the previous time step Recurrence can be modeled by a duplication of the network fed with data att−1, which itself require a duplication att−2, etc.

(18)

Unfolded network (2)

Non-recurrent modelization of recurrent networks

A recurrent networkRis approximated by adeepernon-recurrent networkR^∗

Recurrence is limited :

Processing a sequence of sizeT requiresT duplications of the network Limited memory⇒T is limited to a certain value

Parameter sharing :

Network’s parameters are the same over time steps Can be viewed as a form of regularization (as CNN)

(19)

Unfolded network (3)

Temporal structure→spatial structure

Credit figure :http://www.deeplearningbook.org.

Left : recurrence hidden to hidden ; Right : recurrence output to hidden.

Lis the Loss that compares the net output with the label for each time step.

(20)

Recurrent neural networks : principles Training Recurrent Neural nets

Training recurrent neural networks

Backpropagation Through Time (BPTT)

One unfolded,R^∗is a classical neural networks without recurrence Then Classical Backprop can be carried out

But the network can be quite deep, depending onT ...

... and the vanishing gradient issues may (re)appear

For these reasons, RNN’s have been used with smallT for a long time, with limited success.

Let us consider these gradient problems [Pascanu2015] :

I Exploding Gradient

I Vanishing Gradient

(21)

Exploding Gradients

Exploding gradient : Large increase in the gradient norm due to long term components

The solution(s)

Old solutions :L1andL2regularization

Echo State Networks [Lukosevicius2009] : recurrent weights are not learned but sampled from handcrafted distributions

Recent solution for exploding gradient : cap the gradient error

[Pascanu2015] inspired by [Mikolov2011phd]

(22)

Recurrent neural networks : principles Training Recurrent Neural nets

Vanishing Gradients

Vanishing gradient : long term error tends exponentially fast to norm 0

→temporal correlation impossible to learn

The solution(s)

LSTM [Hochreiter97] : special memory cell (see next section) Echo State Networks [Lukosevicius2009] : recurrent weights are not learned but sampled from handcrafted distributions

Recent solution :”regularization that promotes parameter values such that back-propagated gradients neither increase or decrease too much in magnitude.” [Pascanu 2015]

⇒Using these tricks, recurrent architectures are now trained with success on many applications, withT >100.

(23)

Alternative to BPTT : RTRL

Real Time Recurrent Learning (RTRL) [Williams 1989]

Neuronj takes as input :

All the outputsx(t)from previous layer weighted byw_je All the outputsy(t−1)from its own layer weighted bywjj⁰

then :

y_j(t) =ϕ





E

X

e=0

w_jexe(t) +

J

X

j⁰=0

w_jj⁰y_j⁰(t−1)





Training by RTRL :

CriterionJ = (y_j^d−y_j)² Computing _∂w^∂J

sj

,

_∂w^∂J

je

and

_∂w^∂J

jj0 for computing ”classical” gradient Second order available

4

(24)

RNN with internal memory LSTM

Plan

1 Introduction

BLSTM MDLSTM GRUs

(25)

LSTM Cells (1)

Long Short Term Memory Cells [Hochreiter & Schmidhuber 1997]

Idea :Explicitelymodel the memory of previous observations.

Each neuron has :

A memory cell that contain a real value

3 gates controlling the memory cell : input, output and forget gates The gates prevent the network to suffer from gradient problems

(26)

LSTM Cells (2)

Input gate : a^t_ι=

I

P

i=1

wiιx_i^t+

H

P

h=1

whιb_h^t−1+

C

P

c=1

wcιs^t−1_c b^t_ι=f(a^t_ι)

Forget gate : a^t_φ=

I

P

i=1

wiφx_i^t+

H

P

h=1

whφb^t−1_h +

C

P

c=1

wcφs_c^t−1 b^t_φ=f(a^t_φ)

Cell : a^t_c=

I

P

i=1

wicx_i^t+

H

P

h=1

whcb^t−1_h s^t_c=b_φ^ts^t−1_c +b^t_ιg(a^t_c)

Output gate a^t_ω=

PI i=1

wiωx_i^t+ PH h=1

whωb^t−1_h + PC c=1

wcωs^t_c b^t_ω=f(a^t_ω)andb_c^t =b_ω^th(s^t_c)

(27)

LSTM Cells (3)

Long Short Term Memory properties

Gates are sigmoid→differentiable.

Trained with gradient in a BPTT fashion

Most of time : few LSTM layers followed by a dense layer for classification

More parameters than for a classical RNN ... but allowsT >1000 !

(28)

Generative LSTMs

LSTM (RNN) can also be used in a generative way !

Standard LSTM architecture, where the net simply learns a mapping function from the current sequence to the next item.

For example :{’g’, ’o’, ’o’} →’d’.

In decision, the net is initialized with a seed = the past Example from ”Alice in Wonderland” (seed inblue) :

But no mistake about it: it was neither more nor less than a pig, and she felt that it would be quit e aelin that she was a little want oe toiet ano a grtpersent to the tas a little war th tee the tase oa teettee the had been tinhgtt a little toiee at the cadl in a long tuiee aedun thet sheer was a little tare gereen to be a gentle of the tabdit ...

More difficult problem : Handwriting generation demo

(29)

Bidirectional-LSTM (BLSTM)

A novel connectionist system for unconstrained handwriting recognition A Graves, M Liwicki, S Fern´andez, R Bertolami, H Bunke, J Schmidhuber IEEE PAMI 31 (5), 855-868, 2009

Bidirectionnal Long Short Term Memory

Sequences are processed from left to right, and from right to left Results are combined to take a local decision given the whole sequence

Training using Connectionist Temporal Classification (CTC)+ BPTT CTC : does not require frame-labeled sequences

(30)

RNN with internal memory MDLSTM

MDLSTM

A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. NIPS, 2009

Multi Dimensionnal Long Short Term Memory

2D version of the BLSTM, suits Images Images are processed in 4 directions Can be generalized to more than 2D

(31)

GRUs

J.Chung, C.Gulcehre, K.Cho, Y.Bengio "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555, 2014

LSTM have too many parameters !

GRUs are simplified version of LSTM, without forget gate

It has been empirically shown that it provides similar performance to LSTM, with less parameters

(32)

Attention based RNN

Plan

1 Introduction

BLSTM MDLSTM GRUs

(33)

Attention based RNN (1)

Classical RNN applied to signals

Slides 1D signals from left to right Requires annotation at the frame level But the running direction is often not that clear :

Inclined signals Signals to be located Non trivial reading order

(34)

Attention based RNN

Attention based RNN (2)

Non trivial running direction

Where to start ? where to stop ? How to slide the signal ?

⇒Let’s learn it ! (generic answer to most of ML problem :) )

The RNN should output at each time step : The desired output : classif. or reg.

(as for a classical RNN) The next position to look at

= theattention mechanism

(35)

Attention based RNN (3)

General decision Scheme :

Feature extraction on the whole signal :f Attention mechanismα_(i,j),t computes a mask overf

Product off andαfeed a standard RNN

Attention mechanism

Attention-based location : α_(i_,j),t =g(αt−1,ct−1) Content-based location : α_(i_,j),t =g(f,ct−1)

both approach are often combined : α_(i_,j),t =g(f, α_t−1,ct−1)

g is generally alearnedlinear layer +ϕ:

(36)

Attention based RNN

Attention based RNN (3)

(37)

Plan

1 Introduction

BLSTM MDLSTM GRUs

(38)

RNN Applications Gradient Recall

Gradient descent

Iterative algorithm : W_t=0randomly initialized

Good direction = the one that lower criterionJ Good length (η) = not too small, not too large

W_t+1←W_t−η dJ(W) dW

Wt

where :

Wparameters ; η : learning rate ; ^dJ_dW^(W) _W

t

: theGooddirection

1 0 1 2

(39)

Backprop Generalization (1)

(Y.Lecun, Cf. 12/02/16 coursehttps://www.college-de-france.fr/

site/yann-lecun/course-2016-02-12-14h30.htm)

Feedforward Network :

Stacking LayersFi that compute outputHi using inputHi−1, and for some layers parametersWi or outputY. Examples :

linear layer :

H_i =F_i(Hi−1,W_i) =W_iHi−1

activation layer

(f :tanh, ReLU, Softmax, etc.) : Hi =Fi(Hi−1) =f(Hi−1) Loss layer, for example MSE :

Shallow network (MLP) :

(40)

Backprop Generalization (2)

Gradients and Backprop

Abstraction of net layers Gradients computing :

∂J

∂Hi−1

= ∂J

∂Hi

× ∂H_i

∂Hi−1

For layersFi that have parameters : updating parameters

∂J

∂W_i = ∂J

∂H_i × ∂H_i

∂W_i

These two equations are recursively applied from output downto input :

I blue term : computed at previous step

(41)

Backprop Generalization (3)

Instanciation of classical F

i

:

Linear layer :H_i =F_i(H_i−1,W_i) =W_iHi−1donc :

∂H_i

∂H_i−1

=W_i ; ∂H_i

∂W_i =Hi−1

Activation layer :Hi =Fi(Hi−1) =f(Hi−1)

avecf :tanh, sigmoide, ReLU, Softmax, etc. therefore :

∂H_i

∂H_i₋₁ =f⁰(Hi−1)

MSE layer :H_i =F_i(H_i−1,Y) =||H_i−1−Y||²donc :

∂H_i

∂Hi−1

=2×(H_i−1−Y)

(42)

Backprop Generalization (4)

Generic backprop for a L layer network :

foreach sample fori =Ldownto 1

∂J

∂Hi−1 ← _∂H^∂J

i ×_∂H^∂Hⁱ

i−1

∂J

∂Wi ← _∂H^∂J

i ×_∂W^∂Hⁱ

i // if necessary endfor

end foreach

Legos

Using this framework, modular architectures can be build RNN can be easily combined with dense layers, CNN, etc.

A lot of ”lego” architectures have been proposed

(43)

Plan

1 Introduction

BLSTM MDLSTM GRUs

(44)

RNN Applications Legos

Introduction

Modular architectures

Combination of :

Dense layers (for decision purpose) Convolutional layers (to handle images) Recurrent layers (to handle time) etc.

How to play legos ?

Stacking layers, whatever their nature,

Apply efficient training algorithms (Backprop) on the whole stack

⇒Apply deep networks to almost any kind of signal to tackle every pattern recognition/computer vision/IA tasks.

(45)

Handwriting recognition

Credit image : [puigcerver icdar 2017]

CNN/BLSTM/Dense architecture

CNN layer extract features

LSTM layers captures the input dependencies LSTM has a ’blank’ character to deal withS6=E Dense layer classifies

(46)

Image Captioning (1)

Credit image : [XXX]

CNN/LSTM generative architecture

CNN layer extract features (pretrained on imageNet)

LSTM layers generate the output signal (pretrained on a big text corpus)

(47)

Image Captioning (2)

Credit image : [deeplearningbook.org]

RNN Model

x is always the same

the RNN ”speaks” until it chooses the word ”stop”

(48)

Image Captioning (3)

results from the Google system

(49)

Image Captioning (4)

results from the Microsoft system

(50)

Video Captioning

Model CNN/LSTM/LSTM

CNN extracts features

The first LSTM models the input dependencies and output a hidden representation of the video

(51)

Automatic translation

(52)

Question answering

Credit image :https://github.com/farizrahman4u/seq2seq Question answering, chatbot

(53)

Resume of explicit NN

A good Resume of RNNs applications from [TUM] :

A lot of successful applications (more or less necessary :) )

CageNet based onI. Korshunova et al.: Fast Face-Swap Using Convolutional Neural Networks. ICCV 2017:

(54)

RNN with external memory Introduction

Plan

1 Introduction

BLSTM MDLSTM GRUs

(55)

Implicit vs. Explicit memory

Implicit memory : vanilla RNN

Limited dependencies (vanishing gradient)

Memory implicitely encoded in weights (namely recurrents)

Explicit memory : LSTM, GRU

Able to deal with longer dependencies

Memory implicitely encoded in weightsandexplicitely in hidden states Memory compressed into dense vectors

I Not compartmentalized

I Numerical values

I Still too small ...

⇒Need for an external memory

(56)

RNN with external memory Introduction

Memory augmented neural networks

External memory

Persistent : extremely long dependencies ! Much bigger than limited, encoded memory Can store not only numerical values

How to deal with an external memory

What to read/write ? Where to read/write ? When to read/write ?

⇒Let’s learn it !

(57)

Turing Machines

A Turing Machine is made of :

An infinite tape divided into cells that contain symbols

A head that can read/write symbols on the tape, and move the tape A state register that stores the state of the Turing machine

A finite instruction table :

I Either erase or write a symbol

I Move the head

I Change the machine state A controller that choose the instruction given the state and the current symbol

credit Image :http://www.

storyofmathematics.com

(58)

RNN with external memory Neural Turing Machines

Neural Turing Machines (1)

"Neural turing machines" A Graves, G Wayne, I Danihelka arXiv preprint arXiv:1410.5401

A NTM is a

differentiable Turing Machine

Made of 5 modules : The controller

The addressing module The reading module The writing module The memory

(59)

Neural Turing Machines (2)

Unfolded NTM fromhttps://medium.com/snips-ai/

ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315

M_t: memory at timet r_t : read vector

(60)

Neural Turing Machines (3)

The memory :

Mtis anN×M memory att N: memory size,M: vector size

Every operation is differentiable :

Reading module :ω_tis the attention that tells where to look : r_t=X

i

ω_t(i)M_t(i)

The writing module (=update the memory) : erasee_t then adda_t

I Erase :M˜t(i) =Mt−1(i) [1−ωt(i)et]

I Add :Mt(i) = ˜Mt(i) +ωt(i)at

Addressing module for updatingω_t : Attention-based mechanism for

(61)

Memory augmented networks : resume

Related/similar works

S. Sukhbaatar et al.: End-To-End Memory Networks.

NIPS 2015: 2440-2448

A.Kumar et al.: Ask Me Anything: Dynamic Memory Networks for NLP. ICML 2016: 1378-1387

Applications

Memorization tasks

Question/answering problems

Not perfectly mature, but strong potential for future tasks in AI

(62)

Bibliography

Some interesting pointers :

”Understanding LSTM Networks”, a very easy-to-understand page about LSTM :http:

//colah.github.io/posts/2015-08-Understanding-LSTMs/

Holger schwenk’s talk about automatic translation at the coll `ege de France

https://www.college-de-france.fr/site/yann-lecun/

Une page de John Platt sur les progr `es en image captioning :https:

//blogs.technet.microsoft.com/machinelearning/2014/

11/18/rapid-progress-in-automatic-image-captioning/

RNN part of deeplearningbook :

http://www.deeplearningbook.org/contents/rnn.html

(63)

Practical work :

Sentiment analysis :https://machinelearningmastery.com/

sequence-classification-lstm-recurrent-neural-networks-python-keras/

Generating Text :https://machinelearningmastery.com/

text-generation-lstm-recurrent-neural-networks-python-keras/

Generating Haikus :

https://github.com/aenrichus/haiku_rnn Handwriting Recognition :

https://git.litislab.fr/TextRecognition/CTCModel