Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209)
Neural Networks and Deep Learning Recurrent Neural Networks
Nicolas Thome Prenom.Nom@cnam.fr
http://cedric.cnam.fr/vertigo/Cours/ml2/
Département Informatique
Conservatoire National des Arts et Métiers (Cnam)
Outline
1 Recurrent Neural Networks
2 Training RNNs
3 Applications of RNNs for Natural Language Processing (NLP)
Motivation: Sequence processing
● Sequence: input{xt}t∈{1;T}
● Predictionf(xt)depends onf(xt′)fort′≤t Text, time series, DNA, social media,etc
nicolas.thome@cnam.fr RCP209 / RNNs 1/ 37
Sequence processing: options
● Fully connected network (FCN) on localized window, sizeL
Sequence processing with FCNs: limitations
● ⊖Increasing L⇒# parameter explosion!
● ⊖Independent decisions between time steps
● ⊖Cannot handle variable lengthL
nicolas.thome@cnam.fr RCP209 / RNNs 3/ 37
Sequence processing: options
● Convolutional neural network (ConvNet) on localized window, sizeL
Sequence processing with ConvNets
● ⊕More compact than FCNs, locality, stability (see ConvNet course)
● ⊖Cannot handle variable length L, or resolve to global pooling, maybe arbitrary
nicolas.thome@cnam.fr RCP209 / RNNs 5/ 37
Sequence processing: options
● Structured prediction: explicit modeling betweenf(xt)andf(xt′)t′≤t
● Markov models (GenerativeP(x,y)), Conditional Random Fields (CRF, discriminative) [LMP01]
Sequence processing with CRFs
● ⊕Can handle variable lengthL
● ⊖Limited to linear predictors
● ⊖Complex inference procedure, exact solutions need approximation e.g.Markovian assumption: f(xT∣xt,t≤T) =f(xT∣xT−1)
nicolas.thome@cnam.fr RCP209 / RNNs 7/ 37
Recurrent Neural Networks (RNNs) [Elm90]
● Input sequence{xt}t∈{1;T}, xt∈Rd
● Internal RNN state{ht}t∈{1;T}, ht ∈Rl
● RNN Cell: ht=φt(xt,ht−1)
Loop, ht depends on current xt and previous state ht−1 ht:memory of the network⇔history up to time t In RNNs, functionφt=φshared across time
Recurrent RNN view Unfolded RNN view
Recurrent Neural Networks (RNNs) [Elm90]
● RNN Cell: ht=φ(xt,ht−1)
φ: linear projection of xt and ht−1,i.e.fully connected layers ht=f(Uxt +Wht−1+bh)
U matrix sizel×d, W matrix sizel×l(b vector sizel) f ←tanhnon-linearity
nicolas.thome@cnam.fr RCP209 / RNNs 9/ 37
Recurrent Neural Networks (RNNs) [Elm90]
● At each time stept, RNN output yt =f′(Vht+by) f′←soft-max if yt↔class probabilities
RNNs modeling power
● Recap:Feed-forward neural networks are universal function approximators [Cyb89]
● Expressibility of the mapping between{xt}t∈{1;T} and{yt}t∈{1;T}? RNNs are universal program approximators [SS95]
Can approximate any any computable function,i.e.Turing machine
RNNs can approximate any measurable sequence to sequence mapping [Ham00]
nicolas.thome@cnam.fr RCP209 / RNNs 11/ 37
Example: Computing sum with RNNs
Example: Comparing dimension sum with RNNs
● Determining if the sum of the values of the first dimension is greater than the sum from the second dimension
dim1−dim2 and then sum
nicolas.thome@cnam.fr RCP209 / RNNs 13/ 37
Outline
1 Recurrent Neural Networks
2 Training RNNs
3 Applications of RNNs for Natural Language Processing (NLP)
Training RNNs
● RNN: mapping input sequence{xt}t∈{1;T} into{yt}t∈{1;T}
● Different tasks⇔different mappings
many-to-one: sentiment classification, text generation (practical session), time series forecasting, VQA (next course)
one-to-Many: image captioning (next course)
many-to-many parallel: char-nn (predict next character)
many-to-many (sub-part): machine translation (text2text, spech2text), video classification (frame level)
nicolas.thome@cnam.fr RCP209 / RNNs 14/ 37
Training RNNs: Formulation
● Comparing output prediction{yt}t∈{1;T} with supervision{y∗t} Task-dependent,e.g.only{y∗T}in many-to-one
Training RNNs: Formulation
● Loss function at timet: Lt(yt,y∗t),e.g.cross-entropy (classification)
● Total loss functionL({yt},{y∗t}) =∑T
t=1Lt(yt,y∗t)
Credit: Fei-Fei
nicolas.thome@cnam.fr RCP209 / RNNs 16/ 37
Back-Propagation Through Time (BPTT)
Credit: Fei-Fei
Truncated BPTT
Credit: Fei-Fei
nicolas.thome@cnam.fr RCP209 / RNNs 18/ 37
BPTT: Gradient Computation
● BPTT:computing gradient ∂L∂Wt, ∂L∂Ut, ∂L∂Vt (+biases)
● Unfolded RNN:same spirit as back-prop with fully connected networks (chain rule) BUT:shared parameters W, U, V across time
BPTT: Gradient Computation
● Shared parametersW,U,Vacross time
⇒gradients depend on the whole past history
● Ex: for W: ∂∂WLt = ∑t
k=1
∂Lt
∂yt
∂yt
∂ht
∂ht
∂hk
∂hk
∂W
nicolas.thome@cnam.fr RCP209 / RNNs 20/ 37
BPTT: Optimization
● First order methods:
RMSProp [TH12], Adam [KB15]
● Second order methods: Hessian-free [MS11]
● Vanishing gradient issues⇒next course
BPTT: Bayesian Dropout [GG16]
nicolas.thome@cnam.fr RCP209 / RNNs 22/ 37
Outline
1 Recurrent Neural Networks
2 Training RNNs
3 Applications of RNNs for Natural Language Processing (NLP)
Applications
● RNN state of the art for many sequence processing applications
nicolas.thome@cnam.fr RCP209 / RNNs 23/ 37
Applications: RNN for text processing
Deep NLP strategy:
1 Extracts text input, "tokens",e.g.characters or words
2 One hot encoding of tokens
3 Split the text into a "temporal" sequence
4 RNN to model the temporal structure
Option: use an embedding layer on top of one-hot encoding
NLP and Representation Learning
● Text: extracts "tokens",i.e.raw inputs
● How to represent raw inputs,e.g.characters, words, sentences?
● Handcraftedvs learned representations⇒
"deep embeddings"
nicolas.thome@cnam.fr RCP209 / RNNs 25/ 37
One-hot Representations
● Simplest encoding of text inputs: one-hot representation
● Binary vector of vocabulary size∣V∣, with 1 corresponding to term index
● ∣V∣small for chars (∼10), large for words (∼104), huge for sentences
● Basis for constructing Bag of Word (BoW) Models
Handcrafted feature used with ML shallow models,e.g.kernels methods Still very competitive for some NLP tasks,e.g.text topic classification Can be extended to (bags of) bi-grams fore.g.language identification
Beyond one-hot Representations
● Limitation: ⟨r(”motel”);r(”hotel”) ⟩ =0
● Text embedding motivation: extract representation reflecting semantic similarities between text primitives ("Tokens")
nicolas.thome@cnam.fr RCP209 / RNNs 27/ 37
Word embeddings
● Learn mapping from one-hot encoding to a smaller vectorial space
● General idea:representing a word by means of its neighbors
● Distributional Hypothesis: One of the most successful ideas of modern statistical NLP
● Different approaches: Word2vec [MSC+13], Glove [PSM14], BERT [DCLT18]
Applications: Many-to-one
● Sentiment classification
Input: "At first exciting but over sentimental"
Token⇔word Output: -60 = Bad review
nicolas.thome@cnam.fr RCP209 / RNNs 29/ 37
Applications: Many-to-one
● Visual Question Answering⇒next week Token⇔word
Applications: One to Many
● Image captioning⇒next week Token⇔word
[KL15]
nicolas.thome@cnam.fr RCP209 / RNNs 31/ 37
Applications: Sequence Generation
● Text (or music generation),e.g.Char-nn Input sequence of characters (Token⇔char) Output: next character
● Many-to-many parallel
● In practice, trained with TBBTT⇒use many-to-one (for have a prediction with a sufficiently long sequence): predict next character from previous (K) chars
Text Generation - Char-nn
● Char-nn: applied to raw text
Char-nn: learns to correctly spell a given language, although semantic meaning of sentences more challenging
Capacity to learn language structural/syntactical rules
⇒applications for generating source code,e.g.wikipedia pages, XML, Latex, linux source code (C),etc
Seeherefor other examples
nicolas.thome@cnam.fr RCP209 / RNNs 33/ 37
Practical session: text generation with char-nn
● Poetry generator, trained from "Les fleurs du mal"
● Training: many-to-one with RNNs
● Generation (inference) : start from an input sequence,e.g.”souvent, p"
1 Predict next character
2 Shift input (1 char), and come back to 1
Side note: if prediction fails, misalignment train/test ("exposure bias") To add stochasticity, sample wrt softmax probabilities
Use temperature scaling to control the stochasticity level
Applications: Many to Many
● Many to Many↔Seq2Seq models
● Machine translation text2text
Input: "The agreement on the European Economic Area was signed in August 1992."
Output: "L’accord sur la zone économique européenne a été signé en août 1992."
[BCB14] [OC16]
nicolas.thome@cnam.fr RCP209 / RNNs 35/ 37
Many to Many - Machine translation speech2text
● Machine translation speech2text Input : Audio mp3 (speech utterance)
Output: "How much would a woodchuck chuck"
[CJLV15] [OC16]
Attention Mechanisms
● Used to focus the analysis of sequence on some specific inputs Translation
Image captioning VQA
nicolas.thome@cnam.fr RCP209 / RNNs 37/ 37
References I
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,Neural machine translation by jointly learning to align and translate, CoRRabs/1409.0473(2014).
William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals,Listen, attend and spell, CoRR abs/1508.01211(2015).
George Cybenko,Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems2(1989), no. 4, 303–314.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,BERT: pre-training of deep bidirectional transformers for language understanding, CoRRabs/1810.04805(2018).
Jeffrey L. Elman,Finding structure in time, COGNITIVE SCIENCE14(1990), no. 2, 179–211.
Yarin Gal and Zoubin Ghahramani,A theoretically grounded application of dropout in recurrent neural networks, Proceedings of the 30th International Conference on Neural Information Processing Systems (USA), NIPS’16, Curran Associates Inc., 2016, pp. 1027–1035.
Barbara Hammer,On the approximation capability of recurrent neural networks, Neurocomputing31 (2000), no. 1-4, 107–123.
Andrej Karpathy,The unreasonable effectiveness of recurrent neural networks, 2015.
Diederik P. Kingma and Jimmy Ba,Adam: A method for stochastic optimization, ICLR, vol.
abs/1412.6980, 2015.
Andrej Karpathy and Fei-Fei Li,Deep visual-semantic alignments for generating image descriptions, IEEE
References II
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira,Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the Eighteenth International Conference on Machine Learning (San Francisco, CA, USA), ICML ’01, Morgan Kaufmann Publishers Inc., 2001, pp. 282–289.
James Martens and Ilya Sutskever,Learning recurrent neural networks with hessian-free optimization, Proceedings of the 28th International Conference on International Conference on Machine Learning (USA), ICML’11, Omnipress, 2011, pp. 1033–1040.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), Curran Associates, Inc., 2013, pp. 3111–3119.
Chris Olah and Shan Carter,Attention and augmented recurrent neural networks, Distill (2016).
Jeffrey Pennington, Richard Socher, and Christopher D. Manning,Glove: Global vectors for word representation, In EMNLP, 2014.
H.T. Siegelmann and E.D. Sontag,On the computational power of neural nets, J. Comput. Syst. Sci.50 (1995), no. 1, 132–150.
T. Tieleman and G. Hinton,RMSprop Gradient Optimization.