Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209) Neural Networks and Deep Learning Recurrent Neural Networks

(1)

Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209)

Neural Networks and Deep Learning Recurrent Neural Networks

Nicolas Thome Prenom.Nom@cnam.fr

http://cedric.cnam.fr/vertigo/Cours/ml2/

Département Informatique

Conservatoire National des Arts et Métiers (Cnam)

(2)

Outline

1 Recurrent Neural Networks

2 Training RNNs

3 Applications of RNNs for Natural Language Processing (NLP)

(3)

Motivation: Sequence processing

● Sequence: input{xt}^t∈{1;T}

● Predictionf(xt)depends onf(xt^′)fort^′≤t Text, time series, DNA, social media,etc

nicolas.thome@cnam.fr RCP209 / RNNs 1/ 37

(4)

Sequence processing: options

● Fully connected network (FCN) on localized window, sizeL

(5)

Sequence processing with FCNs: limitations

● ⊖Increasing L⇒# parameter explosion!

● ⊖Independent decisions between time steps

● ⊖Cannot handle variable lengthL

(6)

Sequence processing: options

● Convolutional neural network (ConvNet) on localized window, sizeL

(7)

Sequence processing with ConvNets

● ⊕More compact than FCNs, locality, stability (see ConvNet course)

● ⊖Cannot handle variable length L, or resolve to global pooling, maybe arbitrary

(8)

Sequence processing: options

● Structured prediction: explicit modeling betweenf(xt)andf(xt^′)^t^′≤t

● Markov models (GenerativeP(x,y)), Conditional Random Fields (CRF, discriminative) [LMP01]

(9)

Sequence processing with CRFs

● ⊕Can handle variable lengthL

● ⊖Limited to linear predictors

● ⊖Complex inference procedure, exact solutions need approximation e.g.Markovian assumption: f(x_T∣xt,t≤T) =f(x_T∣x_T−1)

(10)

Recurrent Neural Networks (RNNs) [Elm90]

● Input sequence{xt}t∈{1;T}, xt∈R^d

● Internal RNN state{ht}^t∈{1;T}, ht ∈R^l

● RNN Cell: ht=φt(xt,ht−1)

Loop, ht depends on current xt and previous state ht−1 h_t:memory of the network⇔history up to time t In RNNs, functionφt=φshared across time

Recurrent RNN view Unfolded RNN view

(11)

Recurrent Neural Networks (RNNs) [Elm90]

● RNN Cell: ht=φ(xt,h_t−1)

φ: linear projection of xt and ht−1,i.e.fully connected layers ht=f(Uxt +Wht−1+b_h)

U matrix sizel×d, W matrix sizel×l(b vector sizel) f ←tanhnon-linearity

(12)

Recurrent Neural Networks (RNNs) [Elm90]

● At each time stept, RNN output yt =f^′(Vht+by) f^′←soft-max if yt↔class probabilities

(13)

RNNs modeling power

● Recap:Feed-forward neural networks are universal function approximators [Cyb89]

● Expressibility of the mapping between{xt}^t∈{1;T} and{yt}^t∈{1;T}? RNNs are universal program approximators [SS95]

Can approximate any any computable function,i.e.Turing machine

RNNs can approximate any measurable sequence to sequence mapping [Ham00]

(14)

Example: Computing sum with RNNs

(15)

Example: Comparing dimension sum with RNNs

● Determining if the sum of the values of the first dimension is greater than the sum from the second dimension

dim1−dim2 and then sum

(16)

Outline

2 Training RNNs

(17)

Training RNNs

● RNN: mapping input sequence{xt}^t∈{1;T} into{yt}^t∈{1;T}

● Different tasks⇔different mappings

many-to-one: sentiment classification, text generation (practical session), time series forecasting, VQA (next course)

one-to-Many: image captioning (next course)

many-to-many parallel: char-nn (predict next character)

many-to-many (sub-part): machine translation (text2text, spech2text), video classification (frame level)

(18)

Training RNNs: Formulation

● Comparing output prediction{yt}^t∈{1;T} with supervision{y^∗_t} Task-dependent,e.g.only{y^∗_T}in many-to-one

(19)

Training RNNs: Formulation

● Loss function at timet: L^t(yt,y^∗_t),e.g.cross-entropy (classification)

● Total loss functionL({yt},{y^∗_t}) =∑^T

t=1L^t(yt,y^∗_t)

Credit: Fei-Fei

(20)

Back-Propagation Through Time (BPTT)

Credit: Fei-Fei

(21)

Truncated BPTT

Credit: Fei-Fei

(22)

BPTT: Gradient Computation

● BPTT:computing gradient ^∂L_∂W^t, ^∂L_∂U^t, ^∂L_∂V^t (+biases)

● Unfolded RNN:same spirit as back-prop with fully connected networks (chain rule) BUT:shared parameters W, U, V across time

(23)

BPTT: Gradient Computation

● Shared parametersW,U,Vacross time

⇒gradients depend on the whole past history

● Ex: for W: ^∂_∂W^L^t = ∑^t

k=1

∂Lt

∂yt

∂y_t

∂ht

∂h_t

∂h_k

∂W

(24)

BPTT: Optimization

● First order methods:

RMSProp [TH12], Adam [KB15]

● Second order methods: Hessian-free [MS11]

● Vanishing gradient issues⇒next course

(25)

BPTT: Bayesian Dropout [GG16]

(26)

Outline

2 Training RNNs

(27)

Applications

● RNN state of the art for many sequence processing applications

(28)

Applications: RNN for text processing

Deep NLP strategy:

1 Extracts text input, "tokens",e.g.characters or words

2 One hot encoding of tokens

3 Split the text into a "temporal" sequence

4 RNN to model the temporal structure

Option: use an embedding layer on top of one-hot encoding

(29)

NLP and Representation Learning

● Text: extracts "tokens",i.e.raw inputs

● How to represent raw inputs,e.g.characters, words, sentences?

● Handcraftedvs learned representations⇒

"deep embeddings"

(30)

One-hot Representations

● Simplest encoding of text inputs: one-hot representation

● Binary vector of vocabulary size∣V∣, with 1 corresponding to term index

● ∣V∣small for chars (∼10), large for words (∼10⁴), huge for sentences

● Basis for constructing Bag of Word (BoW) Models

Handcrafted feature used with ML shallow models,e.g.kernels methods Still very competitive for some NLP tasks,e.g.text topic classification Can be extended to (bags of) bi-grams fore.g.language identification

(31)

Beyond one-hot Representations

● Limitation: ⟨r(”motel”);r(”hotel”) ⟩ =0

● Text embedding motivation: extract representation reflecting semantic similarities between text primitives ("Tokens")

(32)

Word embeddings

● Learn mapping from one-hot encoding to a smaller vectorial space

● General idea:representing a word by means of its neighbors

● Distributional Hypothesis: One of the most successful ideas of modern statistical NLP

● Different approaches: Word2vec [MSC⁺13], Glove [PSM14], BERT [DCLT18]

(33)

Applications: Many-to-one

● Sentiment classification

Input: "At first exciting but over sentimental"

Token⇔word Output: -60 = Bad review

(34)

Applications: Many-to-one

● Visual Question Answering⇒next week Token⇔word

(35)

Applications: One to Many

● Image captioning⇒next week Token⇔word

[KL15]

(36)

Applications: Sequence Generation

● Text (or music generation),e.g.Char-nn Input sequence of characters (Token⇔char) Output: next character

● Many-to-many parallel

● In practice, trained with TBBTT⇒use many-to-one (for have a prediction with a sufficiently long sequence): predict next character from previous (K) chars

(37)

Text Generation - Char-nn

● Char-nn: applied to raw text

Char-nn: learns to correctly spell a given language, although semantic meaning of sentences more challenging

Capacity to learn language structural/syntactical rules

⇒applications for generating source code,e.g.wikipedia pages, XML, Latex, linux source code (C),etc

Seeherefor other examples

(38)

Practical session: text generation with char-nn

● Poetry generator, trained from "Les fleurs du mal"

● Training: many-to-one with RNNs

● Generation (inference) : start from an input sequence,e.g.”souvent, p"

1 Predict next character

2 Shift input (1 char), and come back to 1

Side note: if prediction fails, misalignment train/test ("exposure bias") To add stochasticity, sample wrt softmax probabilities

Use temperature scaling to control the stochasticity level

(39)

Applications: Many to Many

● Many to Many↔Seq2Seq models

● Machine translation text2text

Input: "The agreement on the European Economic Area was signed in August 1992."

Output: "L’accord sur la zone économique européenne a été signé en août 1992."

[BCB14] [OC16]

(40)

Many to Many - Machine translation speech2text

● Machine translation speech2text Input : Audio mp3 (speech utterance)

Output: "How much would a woodchuck chuck"

[CJLV15] [OC16]

(41)

Attention Mechanisms

● Used to focus the analysis of sequence on some specific inputs Translation

Image captioning VQA

(42)

References I

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,Neural machine translation by jointly learning to align and translate, CoRRabs/1409.0473(2014).

William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals,Listen, attend and spell, CoRR abs/1508.01211(2015).

George Cybenko,Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems2(1989), no. 4, 303–314.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,BERT: pre-training of deep bidirectional transformers for language understanding, CoRRabs/1810.04805(2018).

Jeffrey L. Elman,Finding structure in time, COGNITIVE SCIENCE14(1990), no. 2, 179–211.

Yarin Gal and Zoubin Ghahramani,A theoretically grounded application of dropout in recurrent neural networks, Proceedings of the 30th International Conference on Neural Information Processing Systems (USA), NIPS’16, Curran Associates Inc., 2016, pp. 1027–1035.

Barbara Hammer,On the approximation capability of recurrent neural networks, Neurocomputing31 (2000), no. 1-4, 107–123.

Andrej Karpathy,The unreasonable effectiveness of recurrent neural networks, 2015.

Diederik P. Kingma and Jimmy Ba,Adam: A method for stochastic optimization, ICLR, vol.

abs/1412.6980, 2015.

Andrej Karpathy and Fei-Fei Li,Deep visual-semantic alignments for generating image descriptions, IEEE

(43)

References II

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira,Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the Eighteenth International Conference on Machine Learning (San Francisco, CA, USA), ICML ’01, Morgan Kaufmann Publishers Inc., 2001, pp. 282–289.

James Martens and Ilya Sutskever,Learning recurrent neural networks with hessian-free optimization, Proceedings of the 28th International Conference on International Conference on Machine Learning (USA), ICML’11, Omnipress, 2011, pp. 1033–1040.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), Curran Associates, Inc., 2013, pp. 3111–3119.

Chris Olah and Shan Carter,Attention and augmented recurrent neural networks, Distill (2016).

Jeffrey Pennington, Richard Socher, and Christopher D. Manning,Glove: Global vectors for word representation, In EMNLP, 2014.

H.T. Siegelmann and E.D. Sontag,On the computational power of neural nets, J. Comput. Syst. Sci.50 (1995), no. 1, 132–150.

T. Tieleman and G. Hinton,RMSprop Gradient Optimization.