• Aucun résultat trouvé

Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209) Neural Networks and Deep Learning Recurrent Neural Networks

N/A
N/A
Protected

Academic year: 2022

Partager "Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209) Neural Networks and Deep Learning Recurrent Neural Networks"

Copied!
43
0
0

Texte intégral

(1)

Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209)

Neural Networks and Deep Learning Recurrent Neural Networks

Nicolas Thome Prenom.Nom@cnam.fr

http://cedric.cnam.fr/vertigo/Cours/ml2/

Département Informatique

Conservatoire National des Arts et Métiers (Cnam)

(2)

Outline

1 Recurrent Neural Networks

2 Training RNNs

3 Applications of RNNs for Natural Language Processing (NLP)

(3)

Motivation: Sequence processing

● Sequence: input{xt}t∈{1;T}

● Predictionf(xt)depends onf(xt)fort≤t Text, time series, DNA, social media,etc

nicolas.thome@cnam.fr RCP209 / RNNs 1/ 37

(4)

Sequence processing: options

● Fully connected network (FCN) on localized window, sizeL

(5)

Sequence processing with FCNs: limitations

● ⊖Increasing L⇒# parameter explosion!

● ⊖Independent decisions between time steps

● ⊖Cannot handle variable lengthL

nicolas.thome@cnam.fr RCP209 / RNNs 3/ 37

(6)

Sequence processing: options

● Convolutional neural network (ConvNet) on localized window, sizeL

(7)

Sequence processing with ConvNets

● ⊕More compact than FCNs, locality, stability (see ConvNet course)

● ⊖Cannot handle variable length L, or resolve to global pooling, maybe arbitrary

nicolas.thome@cnam.fr RCP209 / RNNs 5/ 37

(8)

Sequence processing: options

● Structured prediction: explicit modeling betweenf(xt)andf(xt)tt

● Markov models (GenerativeP(x,y)), Conditional Random Fields (CRF, discriminative) [LMP01]

(9)

Sequence processing with CRFs

● ⊕Can handle variable lengthL

● ⊖Limited to linear predictors

● ⊖Complex inference procedure, exact solutions need approximation e.g.Markovian assumption: f(xT∣xt,tT) =f(xT∣xT−1)

nicolas.thome@cnam.fr RCP209 / RNNs 7/ 37

(10)

Recurrent Neural Networks (RNNs) [Elm90]

● Input sequence{xt}t∈{1;T}, xt∈Rd

● Internal RNN state{ht}t∈{1;T}, ht ∈Rl

● RNN Cell: htt(xt,ht1)

Loop, ht depends on current xt and previous state ht1 ht:memory of the networkhistory up to time t In RNNs, functionφt=φshared across time

Recurrent RNN view Unfolded RNN view

(11)

Recurrent Neural Networks (RNNs) [Elm90]

● RNN Cell: ht=φ(xt,ht−1)

φ: linear projection of xt and ht1,i.e.fully connected layers ht=f(Uxt +Wht1+bh)

U matrix sizel×d, W matrix sizel×l(b vector sizel) f tanhnon-linearity

nicolas.thome@cnam.fr RCP209 / RNNs 9/ 37

(12)

Recurrent Neural Networks (RNNs) [Elm90]

● At each time stept, RNN output yt =f(Vht+by) fsoft-max if ytclass probabilities

(13)

RNNs modeling power

● Recap:Feed-forward neural networks are universal function approximators [Cyb89]

● Expressibility of the mapping between{xt}t∈{1;T} and{yt}t∈{1;T}? RNNs are universal program approximators [SS95]

Can approximate any any computable function,i.e.Turing machine

RNNs can approximate any measurable sequence to sequence mapping [Ham00]

nicolas.thome@cnam.fr RCP209 / RNNs 11/ 37

(14)

Example: Computing sum with RNNs

(15)

Example: Comparing dimension sum with RNNs

● Determining if the sum of the values of the first dimension is greater than the sum from the second dimension

dim1dim2 and then sum

nicolas.thome@cnam.fr RCP209 / RNNs 13/ 37

(16)

Outline

1 Recurrent Neural Networks

2 Training RNNs

3 Applications of RNNs for Natural Language Processing (NLP)

(17)

Training RNNs

● RNN: mapping input sequence{xt}t∈{1;T} into{yt}t∈{1;T}

● Different tasks⇔different mappings

many-to-one: sentiment classification, text generation (practical session), time series forecasting, VQA (next course)

one-to-Many: image captioning (next course)

many-to-many parallel: char-nn (predict next character)

many-to-many (sub-part): machine translation (text2text, spech2text), video classification (frame level)

nicolas.thome@cnam.fr RCP209 / RNNs 14/ 37

(18)

Training RNNs: Formulation

● Comparing output prediction{yt}t∈{1;T} with supervision{yt} Task-dependent,e.g.only{yT}in many-to-one

(19)

Training RNNs: Formulation

● Loss function at timet: Lt(yt,yt),e.g.cross-entropy (classification)

● Total loss functionL({yt},{yt}) =∑T

t=1Lt(yt,yt)

Credit: Fei-Fei

nicolas.thome@cnam.fr RCP209 / RNNs 16/ 37

(20)

Back-Propagation Through Time (BPTT)

Credit: Fei-Fei

(21)

Truncated BPTT

Credit: Fei-Fei

nicolas.thome@cnam.fr RCP209 / RNNs 18/ 37

(22)

BPTT: Gradient Computation

● BPTT:computing gradient ∂L∂Wt, ∂L∂Ut, ∂L∂Vt (+biases)

● Unfolded RNN:same spirit as back-prop with fully connected networks (chain rule) BUT:shared parameters W, U, V across time

(23)

BPTT: Gradient Computation

● Shared parametersW,U,Vacross time

⇒gradients depend on the whole past history

● Ex: for W: ∂WLt = ∑t

k=1

Lt

∂yt

∂yt

∂ht

∂ht

∂hk

∂hk

∂W

nicolas.thome@cnam.fr RCP209 / RNNs 20/ 37

(24)

BPTT: Optimization

● First order methods:

RMSProp [TH12], Adam [KB15]

● Second order methods: Hessian-free [MS11]

● Vanishing gradient issues⇒next course

(25)

BPTT: Bayesian Dropout [GG16]

nicolas.thome@cnam.fr RCP209 / RNNs 22/ 37

(26)

Outline

1 Recurrent Neural Networks

2 Training RNNs

3 Applications of RNNs for Natural Language Processing (NLP)

(27)

Applications

● RNN state of the art for many sequence processing applications

nicolas.thome@cnam.fr RCP209 / RNNs 23/ 37

(28)

Applications: RNN for text processing

Deep NLP strategy:

1 Extracts text input, "tokens",e.g.characters or words

2 One hot encoding of tokens

3 Split the text into a "temporal" sequence

4 RNN to model the temporal structure

Option: use an embedding layer on top of one-hot encoding

(29)

NLP and Representation Learning

● Text: extracts "tokens",i.e.raw inputs

● How to represent raw inputs,e.g.characters, words, sentences?

● Handcraftedvs learned representations⇒

"deep embeddings"

nicolas.thome@cnam.fr RCP209 / RNNs 25/ 37

(30)

One-hot Representations

● Simplest encoding of text inputs: one-hot representation

● Binary vector of vocabulary size∣V∣, with 1 corresponding to term index

● ∣V∣small for chars (∼10), large for words (∼104), huge for sentences

● Basis for constructing Bag of Word (BoW) Models

Handcrafted feature used with ML shallow models,e.g.kernels methods Still very competitive for some NLP tasks,e.g.text topic classification Can be extended to (bags of) bi-grams fore.g.language identification

(31)

Beyond one-hot Representations

● Limitation: ⟨r(”motel”);r(”hotel”) ⟩ =0

● Text embedding motivation: extract representation reflecting semantic similarities between text primitives ("Tokens")

nicolas.thome@cnam.fr RCP209 / RNNs 27/ 37

(32)

Word embeddings

● Learn mapping from one-hot encoding to a smaller vectorial space

● General idea:representing a word by means of its neighbors

● Distributional Hypothesis: One of the most successful ideas of modern statistical NLP

● Different approaches: Word2vec [MSC+13], Glove [PSM14], BERT [DCLT18]

(33)

Applications: Many-to-one

● Sentiment classification

Input: "At first exciting but over sentimental"

Tokenword Output: -60 = Bad review

nicolas.thome@cnam.fr RCP209 / RNNs 29/ 37

(34)

Applications: Many-to-one

● Visual Question Answering⇒next week Tokenword

(35)

Applications: One to Many

● Image captioning⇒next week Tokenword

[KL15]

nicolas.thome@cnam.fr RCP209 / RNNs 31/ 37

(36)

Applications: Sequence Generation

● Text (or music generation),e.g.Char-nn Input sequence of characters (Tokenchar) Output: next character

● Many-to-many parallel

● In practice, trained with TBBTT⇒use many-to-one (for have a prediction with a sufficiently long sequence): predict next character from previous (K) chars

(37)

Text Generation - Char-nn

● Char-nn: applied to raw text

Char-nn: learns to correctly spell a given language, although semantic meaning of sentences more challenging

Capacity to learn language structural/syntactical rules

applications for generating source code,e.g.wikipedia pages, XML, Latex, linux source code (C),etc

Seeherefor other examples

nicolas.thome@cnam.fr RCP209 / RNNs 33/ 37

(38)

Practical session: text generation with char-nn

● Poetry generator, trained from "Les fleurs du mal"

● Training: many-to-one with RNNs

● Generation (inference) : start from an input sequence,e.g.”souvent, p"

1 Predict next character

2 Shift input (1 char), and come back to 1

Side note: if prediction fails, misalignment train/test ("exposure bias") To add stochasticity, sample wrt softmax probabilities

Use temperature scaling to control the stochasticity level

(39)

Applications: Many to Many

● Many to Many↔Seq2Seq models

● Machine translation text2text

Input: "The agreement on the European Economic Area was signed in August 1992."

Output: "L’accord sur la zone économique européenne a été signé en août 1992."

[BCB14] [OC16]

nicolas.thome@cnam.fr RCP209 / RNNs 35/ 37

(40)

Many to Many - Machine translation speech2text

● Machine translation speech2text Input : Audio mp3 (speech utterance)

Output: "How much would a woodchuck chuck"

[CJLV15] [OC16]

(41)

Attention Mechanisms

● Used to focus the analysis of sequence on some specific inputs Translation

Image captioning VQA

nicolas.thome@cnam.fr RCP209 / RNNs 37/ 37

(42)

References I

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,Neural machine translation by jointly learning to align and translate, CoRRabs/1409.0473(2014).

William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals,Listen, attend and spell, CoRR abs/1508.01211(2015).

George Cybenko,Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems2(1989), no. 4, 303–314.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,BERT: pre-training of deep bidirectional transformers for language understanding, CoRRabs/1810.04805(2018).

Jeffrey L. Elman,Finding structure in time, COGNITIVE SCIENCE14(1990), no. 2, 179–211.

Yarin Gal and Zoubin Ghahramani,A theoretically grounded application of dropout in recurrent neural networks, Proceedings of the 30th International Conference on Neural Information Processing Systems (USA), NIPS’16, Curran Associates Inc., 2016, pp. 1027–1035.

Barbara Hammer,On the approximation capability of recurrent neural networks, Neurocomputing31 (2000), no. 1-4, 107–123.

Andrej Karpathy,The unreasonable effectiveness of recurrent neural networks, 2015.

Diederik P. Kingma and Jimmy Ba,Adam: A method for stochastic optimization, ICLR, vol.

abs/1412.6980, 2015.

Andrej Karpathy and Fei-Fei Li,Deep visual-semantic alignments for generating image descriptions, IEEE

(43)

References II

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira,Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the Eighteenth International Conference on Machine Learning (San Francisco, CA, USA), ICML ’01, Morgan Kaufmann Publishers Inc., 2001, pp. 282–289.

James Martens and Ilya Sutskever,Learning recurrent neural networks with hessian-free optimization, Proceedings of the 28th International Conference on International Conference on Machine Learning (USA), ICML’11, Omnipress, 2011, pp. 1033–1040.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), Curran Associates, Inc., 2013, pp. 3111–3119.

Chris Olah and Shan Carter,Attention and augmented recurrent neural networks, Distill (2016).

Jeffrey Pennington, Richard Socher, and Christopher D. Manning,Glove: Global vectors for word representation, In EMNLP, 2014.

H.T. Siegelmann and E.D. Sontag,On the computational power of neural nets, J. Comput. Syst. Sci.50 (1995), no. 1, 132–150.

T. Tieleman and G. Hinton,RMSprop Gradient Optimization.

Références

Documents relatifs

The range of an activation function usually prescribes the domain of state variables (the state space of the neural network). In the use of neural networks for optimization,

LSTM [Hochreiter97] : special memory cell (see next section) Echo State Networks [Lukosevicius2009] : recurrent weights are not learned but sampled from handcrafted

We have been able to repli- cate these findings on random recurrent neural networks (RRNN) with a classical hebbian learning rule [7].. Following Amari [1], and using a

In the second one, we take advantage of deep networks, by us- ing two different hidden layers: (i) the first level is split into different hidden layers, one for each type of

We introduce a hybrid architecture that combines a phonetic model with an arbi- trary frame-level acoustic model and we propose efficient algorithms for training, decoding and

How to construct Deep Recurrent Neural Networks Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio.. International Conference on Learning

After characterizing Siamese GRU as identical chaotic systems, we showed the contribution of introducing coupling inside the siamese architecture to achieve synchronization more

Interestingly, R-ANNs can be constructed to simulate any piecewise affine-linear system on a rectangular partition of the n-dimensional hypercube by extending the methods discussed