Recurrent Neural Networks
clement.chatelain@insa-rouen.fr
14 f ´evrier 2018
Sommaire
1 Introduction
2 Recurrent neural networks : principles Classical RNN’s
Training Recurrent Neural nets
3 RNN with internal memory LSTM
BLSTM MDLSTM GRUs
4 Attention based RNN
5 RNN Applications Gradient Recall Legos
6 RNN with external memory Introduction
Introduction
Neural Networks are cool :
State of the art performance in almost every pattern recognition/computer vision/IA tasks
Efficient training algorithms (Backprop), Modular architectures
Introduction
Introduction
But DNN & CNN are static architectures
Estimation of a functionf :x →y, where :
xT = [x1,x2, . . . ,xE]∈RE yT = [y1,y2, . . . ,yS]∈RS Able to process multidimensional signals by flatteningx, but requirefixed-sizeEandS
Unfortunately, most of real-world signals are dynamics
Text, speech, DNA sequence,handwriting, Stock exchange, etc.
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the bank, and of ...
Introduction
Dynamic Machine learning Problems
Many machine learning problems involve dynamic input and/or output signals.
Static input: Feature set, images with fixed, defined size, etc.
Static output: Mono/multidimensional classification problems, Mono/multidimensional regression problems
Dynamic input: sequence of feature sets, variable size signals Dynamic output: sequence of labels/values, mono or
multidimensional
Static input / Static output problems
MNIST : 784 input (282), 10 output classes
imageNet : 2562input pixels, 5247 output classes
Introduction
Dynamic input / Static output problems
sentiment classification : Positive or Negative ?
"This is by far the worst hotel experience i’ve ever had.
the owner overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that room wasnt even a hotel room!"
"We enjoyed our stay very much."
Variable-size image classification
→”chocolate” →”strawberry”
And also : sentence classification, etc.
Static input / Dynamic output problems
Caption generation
Object detection (VOC Challenge)
Introduction
Dynamic Input / Dynamic Output problems
Handwriting/Speech recognition
→ ”Herr B ¨urgermaister Hans”
Introduction
Need for dynamic models to process dynamic signals
Able to handle input/output variable-size signals
Able to model the dependencies between variables = the knowledge
"j ’ ai malencontrousement march´esur le chion de ma vosrre . cele - ci , tenant brauemryp `a son chien , a ´ele
parthali´erement afest´ee par l ’ acaident et me demande un d´edammagement de taille"
Two kind of models :
Graphical models: Dependencies areexplicitelymodeled.
Recurrent Neural networks: Dependencies areimplicitelymodeled
Introduction
Non-neural sequence models : Graphical models
Hidden Markov Models
I State-of-the-art for sequence modeling 80’s and 90’s (still widely used)
I Generative model NN/HMM hybrid models
Conditional Random Fields (CRF)
I Discriminative model [Lafferty 2001]
I RequireE=S
I Suit labeling task
Hidden-CRF, Latent-Dynamic CRF
I Difficult to train
I Do not scale very well
HMM, CRF and HCRF
Plan
1 Introduction
2 Recurrent neural networks : principles Classical RNN’s
Training Recurrent Neural nets
3 RNN with internal memory LSTM
BLSTM MDLSTM GRUs
4 Attention based RNN
5 RNN Applications Gradient Recall Legos
6 RNN with external memory Introduction
Recurrent neural networks : principles Classical RNN’s
Recurrence principle (1)
Handle dynamic signals
Signals are processed using a sliding window
Windows (AKA frames) are of fixed size→fit static NN architectures
Recurrence principle (2)
Signal dependencies = memory from previous time steps
How to model memory in the NN ?
Recurrent connexions y(t)is processed using :
I x(t)the current input
I y(t−1)the output at previous time step
Recurrent NN6=feedforward
Recurrence is generally a hidden-to-hidden link
Recurrent neural networks : principles Classical RNN’s
Recurrence principle (3)
Recurrent neural networks may : Output a value at each time step
⇒sequence labeling
Output a unique value after having read an entire sequence
⇒sequence classification
Output a sequence whose sizeSis different from input sizeE
⇒sequence 2 sequence applications
Unfolded network (1)
Recurrence ↔ unfolded network
Signal from the recurrence = neuron output at the previous time step Recurrence can be modeled by a duplication of the network fed with data att−1, which itself require a duplication att−2, etc.
Recurrent neural networks : principles Classical RNN’s
Unfolded network (2)
Non-recurrent modelization of recurrent networks
A recurrent networkRis approximated by adeepernon-recurrent networkR∗
Recurrence is limited :
Processing a sequence of sizeT requiresT duplications of the network Limited memory⇒T is limited to a certain value
Parameter sharing :
Network’s parameters are the same over time steps Can be viewed as a form of regularization (as CNN)
Unfolded network (3)
Temporal structure→spatial structure
Credit figure :http://www.deeplearningbook.org.
Left : recurrence hidden to hidden ; Right : recurrence output to hidden.
Lis the Loss that compares the net output with the label for each time step.
Recurrent neural networks : principles Training Recurrent Neural nets
Training recurrent neural networks
Backpropagation Through Time (BPTT)
One unfolded,R∗is a classical neural networks without recurrence Then Classical Backprop can be carried out
But the network can be quite deep, depending onT ...
... and the vanishing gradient issues may (re)appear
For these reasons, RNN’s have been used with smallT for a long time, with limited success.
Let us consider these gradient problems [Pascanu2015] :
I Exploding Gradient
I Vanishing Gradient
Exploding Gradients
Exploding gradient : Large increase in the gradient norm due to long term components
The solution(s)
Old solutions :L1andL2regularization
Echo State Networks [Lukosevicius2009] : recurrent weights are not learned but sampled from handcrafted distributions
Recent solution for exploding gradient : cap the gradient error
[Pascanu2015] inspired by [Mikolov2011phd]
Recurrent neural networks : principles Training Recurrent Neural nets
Vanishing Gradients
Vanishing gradient : long term error tends exponentially fast to norm 0
→temporal correlation impossible to learn
The solution(s)
LSTM [Hochreiter97] : special memory cell (see next section) Echo State Networks [Lukosevicius2009] : recurrent weights are not learned but sampled from handcrafted distributions
Recent solution :”regularization that promotes parameter values such that back-propagated gradients neither increase or decrease too much in magnitude.” [Pascanu 2015]
⇒Using these tricks, recurrent architectures are now trained with success on many applications, withT >100.
Alternative to BPTT : RTRL
Real Time Recurrent Learning (RTRL) [Williams 1989]
Neuronj takes as input :
All the outputsx(t)from previous layer weighted bywje All the outputsy(t−1)from its own layer weighted bywjj0
then :
yj(t) =ϕ
E
X
e=0
wjexe(t) +
J
X
j0=0
wjj0yj0(t−1)
Training by RTRL :
CriterionJ = (yjd−yj)2 Computing ∂w∂J
sj
,
∂w∂Jje
and
∂w∂Jjj0 for computing ”classical” gradient Second order available
4
RNN with internal memory LSTM
Plan
1 Introduction
2 Recurrent neural networks : principles Classical RNN’s
Training Recurrent Neural nets
3 RNN with internal memory LSTM
BLSTM MDLSTM GRUs
4 Attention based RNN
5 RNN Applications Gradient Recall Legos
6 RNN with external memory Introduction
LSTM Cells (1)
Long Short Term Memory Cells [Hochreiter & Schmidhuber 1997]
Idea :Explicitelymodel the memory of previous observations.
Each neuron has :
A memory cell that contain a real value
3 gates controlling the memory cell : input, output and forget gates The gates prevent the network to suffer from gradient problems
RNN with internal memory LSTM
LSTM Cells (2)
Input gate : atι=
I
P
i=1
wiιxit+
H
P
h=1
whιbht−1+
C
P
c=1
wcιst−1c btι=f(atι)
Forget gate : atφ=
I
P
i=1
wiφxit+
H
P
h=1
whφbt−1h +
C
P
c=1
wcφsct−1 btφ=f(atφ)
Cell : atc=
I
P
i=1
wicxit+
H
P
h=1
whcbt−1h stc=bφtst−1c +btιg(atc)
Output gate atω=
PI i=1
wiωxit+ PH h=1
whωbt−1h + PC c=1
wcωstc btω=f(atω)andbct =bωth(stc)
LSTM Cells (3)
Long Short Term Memory properties
Gates are sigmoid→differentiable.
Trained with gradient in a BPTT fashion
Most of time : few LSTM layers followed by a dense layer for classification
More parameters than for a classical RNN ... but allowsT >1000 !
RNN with internal memory LSTM
Generative LSTMs
LSTM (RNN) can also be used in a generative way !
Standard LSTM architecture, where the net simply learns a mapping function from the current sequence to the next item.
For example :{’g’, ’o’, ’o’} →’d’.
In decision, the net is initialized with a seed = the past Example from ”Alice in Wonderland” (seed inblue) :
But no mistake about it: it was neither more nor less than a pig, and she felt that it would be quit e aelin that she was a little want oe toiet ano a grtpersent to the tas a little war th tee the tase oa teettee the had been tinhgtt a little toiee at the cadl in a long tuiee aedun thet sheer was a little tare gereen to be a gentle of the tabdit ...
More difficult problem : Handwriting generation demo
Bidirectional-LSTM (BLSTM)
A novel connectionist system for unconstrained handwriting recognition A Graves, M Liwicki, S Fern´andez, R Bertolami, H Bunke, J Schmidhuber IEEE PAMI 31 (5), 855-868, 2009
Bidirectionnal Long Short Term Memory
Sequences are processed from left to right, and from right to left Results are combined to take a local decision given the whole sequence
Training using Connectionist Temporal Classification (CTC)+ BPTT CTC : does not require frame-labeled sequences
RNN with internal memory MDLSTM
MDLSTM
A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. NIPS, 2009
Multi Dimensionnal Long Short Term Memory
2D version of the BLSTM, suits Images Images are processed in 4 directions Can be generalized to more than 2D
GRUs
J.Chung, C.Gulcehre, K.Cho, Y.Bengio "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555, 2014
LSTM have too many parameters !
GRUs are simplified version of LSTM, without forget gate
It has been empirically shown that it provides similar performance to LSTM, with less parameters
Attention based RNN
Plan
1 Introduction
2 Recurrent neural networks : principles Classical RNN’s
Training Recurrent Neural nets
3 RNN with internal memory LSTM
BLSTM MDLSTM GRUs
4 Attention based RNN
5 RNN Applications Gradient Recall Legos
6 RNN with external memory Introduction
Attention based RNN (1)
Classical RNN applied to signals
Slides 1D signals from left to right Requires annotation at the frame level But the running direction is often not that clear :
Inclined signals Signals to be located Non trivial reading order
Attention based RNN
Attention based RNN (2)
Non trivial running direction
Where to start ? where to stop ? How to slide the signal ?
⇒Let’s learn it ! (generic answer to most of ML problem :) )
The RNN should output at each time step : The desired output : classif. or reg.
(as for a classical RNN) The next position to look at
= theattention mechanism
Attention based RNN (3)
General decision Scheme :
Feature extraction on the whole signal :f Attention mechanismα(i,j),t computes a mask overf
Product off andαfeed a standard RNN
Attention mechanism
Attention-based location : α(i,j),t =g(αt−1,ct−1) Content-based location : α(i,j),t =g(f,ct−1)
both approach are often combined : α(i,j),t =g(f, αt−1,ct−1)
g is generally alearnedlinear layer +ϕ:
Attention based RNN
Attention based RNN (3)
Plan
1 Introduction
2 Recurrent neural networks : principles Classical RNN’s
Training Recurrent Neural nets
3 RNN with internal memory LSTM
BLSTM MDLSTM GRUs
4 Attention based RNN
5 RNN Applications Gradient Recall Legos
6 RNN with external memory Introduction
RNN Applications Gradient Recall
Gradient descent
Iterative algorithm : Wt=0randomly initialized
Good direction = the one that lower criterionJ Good length (η) = not too small, not too large
Wt+1←Wt−η dJ(W) dW
Wt
where :
Wparameters ; η : learning rate ; dJdW(W) W
t
: theGooddirection
1 0 1 2
1 0 1 2
Backprop Generalization (1)
(Y.Lecun, Cf. 12/02/16 coursehttps://www.college-de-france.fr/
site/yann-lecun/course-2016-02-12-14h30.htm)
Feedforward Network :
Stacking LayersFi that compute outputHi using inputHi−1, and for some layers parametersWi or outputY. Examples :
linear layer :
Hi =Fi(Hi−1,Wi) =WiHi−1
activation layer
(f :tanh, ReLU, Softmax, etc.) : Hi =Fi(Hi−1) =f(Hi−1) Loss layer, for example MSE :
Shallow network (MLP) :
RNN Applications Gradient Recall
Backprop Generalization (2)
Gradients and Backprop
Abstraction of net layers Gradients computing :
∂J
∂Hi−1
= ∂J
∂Hi
× ∂Hi
∂Hi−1
For layersFi that have parameters : updating parameters
∂J
∂Wi = ∂J
∂Hi × ∂Hi
∂Wi
These two equations are recursively applied from output downto input :
I blue term : computed at previous step
Backprop Generalization (3)
Instanciation of classical F
i:
Linear layer :Hi =Fi(Hi−1,Wi) =WiHi−1donc :
∂Hi
∂Hi−1
=Wi ; ∂Hi
∂Wi =Hi−1
Activation layer :Hi =Fi(Hi−1) =f(Hi−1)
avecf :tanh, sigmoide, ReLU, Softmax, etc. therefore :
∂Hi
∂Hi−1 =f0(Hi−1)
MSE layer :Hi =Fi(Hi−1,Y) =||Hi−1−Y||2donc :
∂Hi
∂Hi−1
=2×(Hi−1−Y)
RNN Applications Gradient Recall
Backprop Generalization (4)
Generic backprop for a L layer network :
foreach sample fori =Ldownto 1
∂J
∂Hi−1 ← ∂H∂J
i ×∂H∂Hi
i−1
∂J
∂Wi ← ∂H∂J
i ×∂W∂Hi
i // if necessary endfor
end foreach
Legos
Using this framework, modular architectures can be build RNN can be easily combined with dense layers, CNN, etc.
A lot of ”lego” architectures have been proposed
Plan
1 Introduction
2 Recurrent neural networks : principles Classical RNN’s
Training Recurrent Neural nets
3 RNN with internal memory LSTM
BLSTM MDLSTM GRUs
4 Attention based RNN
5 RNN Applications Gradient Recall Legos
6 RNN with external memory Introduction
RNN Applications Legos
Introduction
Modular architectures
Combination of :
Dense layers (for decision purpose) Convolutional layers (to handle images) Recurrent layers (to handle time) etc.
How to play legos ?
Stacking layers, whatever their nature,
Apply efficient training algorithms (Backprop) on the whole stack
⇒Apply deep networks to almost any kind of signal to tackle every pattern recognition/computer vision/IA tasks.
Handwriting recognition
Credit image : [puigcerver icdar 2017]
CNN/BLSTM/Dense architecture
CNN layer extract features
LSTM layers captures the input dependencies LSTM has a ’blank’ character to deal withS6=E Dense layer classifies
RNN Applications Legos
Image Captioning (1)
Credit image : [XXX]
CNN/LSTM generative architecture
CNN layer extract features (pretrained on imageNet)
LSTM layers generate the output signal (pretrained on a big text corpus)
Image Captioning (2)
Credit image : [deeplearningbook.org]
RNN Model
x is always the same
the RNN ”speaks” until it chooses the word ”stop”
RNN Applications Legos
Image Captioning (3)
results from the Google system
Image Captioning (4)
results from the Microsoft system
RNN Applications Legos
Video Captioning
Model CNN/LSTM/LSTM
CNN extracts features
The first LSTM models the input dependencies and output a hidden representation of the video
Automatic translation
RNN Applications Legos
Question answering
Credit image :https://github.com/farizrahman4u/seq2seq Question answering, chatbot
Resume of explicit NN
A good Resume of RNNs applications from [TUM] :
A lot of successful applications (more or less necessary :) )
CageNet based onI. Korshunova et al.: Fast Face-Swap Using Convolutional Neural Networks. ICCV 2017:
RNN with external memory Introduction
Plan
1 Introduction
2 Recurrent neural networks : principles Classical RNN’s
Training Recurrent Neural nets
3 RNN with internal memory LSTM
BLSTM MDLSTM GRUs
4 Attention based RNN
5 RNN Applications Gradient Recall Legos
6 RNN with external memory Introduction
Implicit vs. Explicit memory
Implicit memory : vanilla RNN
Limited dependencies (vanishing gradient)
Memory implicitely encoded in weights (namely recurrents)
Explicit memory : LSTM, GRU
Able to deal with longer dependencies
Memory implicitely encoded in weightsandexplicitely in hidden states Memory compressed into dense vectors
I Not compartmentalized
I Numerical values
I Still too small ...
⇒Need for an external memory
RNN with external memory Introduction
Memory augmented neural networks
External memory
Persistent : extremely long dependencies ! Much bigger than limited, encoded memory Can store not only numerical values
How to deal with an external memory
What to read/write ? Where to read/write ? When to read/write ?
⇒Let’s learn it !
Turing Machines
A Turing Machine is made of :
An infinite tape divided into cells that contain symbols
A head that can read/write symbols on the tape, and move the tape A state register that stores the state of the Turing machine
A finite instruction table :
I Either erase or write a symbol
I Move the head
I Change the machine state A controller that choose the instruction given the state and the current symbol
credit Image :http://www.
storyofmathematics.com
RNN with external memory Neural Turing Machines
Neural Turing Machines (1)
"Neural turing machines" A Graves, G Wayne, I Danihelka arXiv preprint arXiv:1410.5401
A NTM is a
differentiable Turing Machine
Made of 5 modules : The controller
The addressing module The reading module The writing module The memory
Neural Turing Machines (2)
Unfolded NTM fromhttps://medium.com/snips-ai/
ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315
Mt: memory at timet rt : read vector
RNN with external memory Neural Turing Machines
Neural Turing Machines (3)
The memory :
Mtis anN×M memory att N: memory size,M: vector size
Every operation is differentiable :
Reading module :ωtis the attention that tells where to look : rt=X
i
ωt(i)Mt(i)
The writing module (=update the memory) : eraseet then addat
I Erase :M˜t(i) =Mt−1(i) [1−ωt(i)et]
I Add :Mt(i) = ˜Mt(i) +ωt(i)at
Addressing module for updatingωt : Attention-based mechanism for
Memory augmented networks : resume
Related/similar works
S. Sukhbaatar et al.: End-To-End Memory Networks.
NIPS 2015: 2440-2448
A.Kumar et al.: Ask Me Anything: Dynamic Memory Networks for NLP. ICML 2016: 1378-1387
Applications
Memorization tasks
Question/answering problems
Not perfectly mature, but strong potential for future tasks in AI
RNN with external memory Neural Turing Machines
Bibliography
Some interesting pointers :
”Understanding LSTM Networks”, a very easy-to-understand page about LSTM :http:
//colah.github.io/posts/2015-08-Understanding-LSTMs/
Holger schwenk’s talk about automatic translation at the coll `ege de France
https://www.college-de-france.fr/site/yann-lecun/
Une page de John Platt sur les progr `es en image captioning :https:
//blogs.technet.microsoft.com/machinelearning/2014/
11/18/rapid-progress-in-automatic-image-captioning/
RNN part of deeplearningbook :
http://www.deeplearningbook.org/contents/rnn.html
Practical work :
Sentiment analysis :https://machinelearningmastery.com/
sequence-classification-lstm-recurrent-neural-networks-python-keras/
Generating Text :https://machinelearningmastery.com/
text-generation-lstm-recurrent-neural-networks-python-keras/
Generating Haikus :
https://github.com/aenrichus/haiku_rnn Handwriting Recognition :
https://git.litislab.fr/TextRecognition/CTCModel