On Recurrent and Deep Neural Networks

Texte intégral

(1)On Recurrent and Deep Neural Networks Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Université de Montréal, LISA lab September 2014. Pascanu. On Recurrent and Deep Neural Networks. 1/ 38.

(2) Motivation. “A computer once beat me at chess, but it was no match for me at kick boxing” — Emo Phillips. Pascanu. On Recurrent and Deep Neural Networks. 2/ 38.

(3) Motivation. “A computer once beat me at chess, but it was no match for me at kick boxing” — Emo Phillips. Studying the mechanism behind learning provides a meta-solution for solving tasks.. Pascanu. On Recurrent and Deep Neural Networks. 2/ 38.

(4) Supervised Learing. I. fF : Θ × D → T. Pascanu. On Recurrent and Deep Neural Networks. 3/ 38.

(5) Supervised Learing. I I. fF : Θ × D → T fθ (x) = fF (θ, x). Pascanu. On Recurrent and Deep Neural Networks. 3/ 38.

(6) Supervised Learing. I I I. fF : Θ × D → T fθ (x) = fF (θ, x) f ? = arg minθ∈Θ E x,t∼π [d (fθ (x), t)]. Pascanu. On Recurrent and Deep Neural Networks. 3/ 38.

(7) Optimization for learning. θ[k+1] θ[k]. Pascanu. On Recurrent and Deep Neural Networks. 4/ 38.

(8) Neural networks Output neurons Last hidden layer. bias = 1. Second hidden layer First hidden layer Input layer Pascanu. On Recurrent and Deep Neural Networks. 5/ 38.

(9) Recurrent neural networks. Output neurons. Output neurons bias = 1. Last hidden layer. bias = 1 Recurrent Layer. Second hidden layer First hidden layer Input layer. Input layer. (b) Recurrent network. (a) Feedforward network. Pascanu. On Recurrent and Deep Neural Networks. 6/ 38.

(10) On the number of linear regions of Deep Neural Networks Razvan Pascanu, Guido Montufar∗ , Kyunghyun Cho? and Yoshua Bengio. International Conference on Learning Representations 2014 Submitted to Conference on Neural Information Processing Systems 2014. Pascanu, Montufar, Cho and Bengio. On Recurrent and Deep Neural Networks. 7/ 38.

(11) Big picture. . I. I. I. 0 ,x < 0 x ,x > 0 Idea: Composition of piece-wise functions is a piece-wise function Approach: count the number of pieces for a deep versus shallow model rect(x) =. Pascanu, Montufar, Cho and Bengio. On Recurrent and Deep Neural Networks. 8/ 38.

(12) Single Layer models. R12. R2. R2 L2. R23. L2. R∅. R1. R123 R12 R13. R∅. R1 L1. L1. L3. Zaslavsky’s Theorem (1975): Pninp n s=0. Pascanu, Montufar, Cho and Bengio. hid. s. On Recurrent and Deep Neural Networks. 9/ 38.

(13) Multi-Layer models: how would it work?. x1 PS3. 2. 4. -4. PS4. PS1. -2. Pascanu, Montufar, Cho and Bengio. x0. PS2. On Recurrent and Deep Neural Networks. 10/ 38.

(14) Multi-Layer models: how would it work? Input Space First Layer Space S40 S10 S30 S20. S10 S40 S20 S30. S4 S1 S3 S2 S20 S30 S10 S40. S40 S10 S30 S20. S30 S20 S40 S10. Second Layer Space. 3. 1. Fold along the vertical axis Pascanu, Montufar, Cho and Bengio. 2. Fold along the horizontal axis On Recurrent and Deep Neural Networks. 11/ 38.

(15) Multi-Layer models: how would it work?. Pascanu, Montufar, Cho and Bengio. On Recurrent and Deep Neural Networks. 12/ 38.

(16) Visualizing units. Pascanu, Montufar, Cho and Bengio. On Recurrent and Deep Neural Networks. 13/ 38.

(17) Revisiting Natural Gradient for Deep Networks Razvan Pascanu and Yoshua Bengio. International Conference on Learning Representations 2014. Pascanu and Bengio. On Recurrent and Deep Neural Networks. 14/ 38.

(18) Gist of this work. I I I. I. I I. 1. Natural Gradient is a generalized Trust Region method Hessian-Free Optimization is Natural Gradient1 Using the Empirical Fisher (TONGA) is not equivalent to the same trust region method as natural gradient Natural Gradient can be accelerated if we add second order information of the error Natural Gradient can use unlabeled data Natural Gradient is more robust to change in order of the training set. for particular pairs of activation functions and error functions Pascanu and Bengio. On Recurrent and Deep Neural Networks. 15/ 38.

(19) On the saddle point problem for non-convex optimization Yann Dauphin, Razvan Pascanu, Caglar Gulcehre? Kyunghyun Cho? , Surya Ganguli and Yoshua Bengio. Submitted to Conference of Neural Information Processing Systems 2014. Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio. On Recurrent and Deep Neural Networks. 16/ 38.

(20) Existing evidence Statistical physics (on random gaussian fields) 0.25 0.20 0.15 error. I. 0.10 0.05 0.00 0.050.2 0.0. 0.2. 0.4. 0.6. index. 0.8. 1.0. 1.2. Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01.0. 0.5. 0.0. 0.5. eigenvalue. On Recurrent and Deep Neural Networks. 1.0. 1.5. 17/ 38.

(21) Existing evidence empirical evidence. 20 10 0 0.00 0.12 0.25 Index of critical point α. Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio. p(λ). 30. Train error ² (%). I. 1021 Error 0.32% 100 Error 23.49% Error 28.23% 10-1 10-2 10-3 10-4 10 0.0 0.5 1.0 1.5 2.0 Eigenvalue λ. On Recurrent and Deep Neural Networks. 18/ 38.

(22) Problem I. saddle points are attractors of second order dynamics. 0.15. Newton SFN SGD. 0.10. Newton SFN SGD. 0.05 0.00 0.05 0.10 0.150.6. 0.4. 0.2. 0.0. 0.2. 0.4. 0.6. Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio. On Recurrent and Deep Neural Networks. 19/ 38.

(23) Solution. arg min∆θ T1 {L(θ)} s. t. kT2 {L(θ)} − T1 {L(θ)} k ≤ . Using Lagrange multipliers ∆θ = − ∂L(θ) ∂θ |H| Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio. On Recurrent and Deep Neural Networks. 20/ 38.

(24) 32 5. MSGD Damped Newton SFN. 25 50 # hidden units. 102. Train error ² (%). Train error ² (%). CIFAR-10. 60. MSGD Damped Newton SFN. 101 0 20 40 60 80 100 # epochs. Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio. |most negative λ|. Experiments. MSGD. Damped Newton 1010 SFN 10 10-1-2 10-3 10-4 10-5 10-6 10 0 20 40 60 80100 # epochs. On Recurrent and Deep Neural Networks. 21/ 38.

(25) Experiments. Deep Autoencoder 101. Recurrent Neural Network 101. MSGD SFN. 100. 100 500. 1300. 3000. 3150. Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio. 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.00. MSGD SFN. 2k. 4k 250k 300k. On Recurrent and Deep Neural Networks. 22/ 38. 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0.

(26) A Neurodynamical Model for Working Memory Razvan Pascanu, Herbert Jaeger. Journal of Neural Networks 2011. Pascanu, Jaeger. On Recurrent and Deep Neural Networks. 23/ 38.

(27) Gist of this work Reservoir (x). Input Units. Output units (y). (u). WM units (m). Pascanu, Jaeger. On Recurrent and Deep Neural Networks. 24/ 38.

(28) On the difficulty of training recurrent neural networks Razvan Pascanu, Tomas Mikolov, Yoshua Bengio. International Conference on Machine Learning 2013. Pascanu, Mikolov, Bengio. On Recurrent and Deep Neural Networks. 25/ 38.

(29) The exploding gradients problem C(t − 1). ∂h(t−1) ∂h(t−2). h(t − 1). ∂C(t) ∂h(t). =. ∂C (t) t ∂W. P. ∂C(t+1) ∂h(t+1). h(t). h(t + 1) ∂h(t+1) ∂h(t). ∂h(t) ∂h(t−1). x(t). x(t − 1). ∂C ∂W. C(t + 1). C(t). ∂C(t−1) ∂h(t−1). =. ∂h(t) ∂h(t−k). ∂h(t+2) ∂h(t+1). x(t + 1). ∂C (t) ∂h(t) ∂h(t−k) k=0 ∂h(t) ∂h(t−k) ∂W. P Pt t. =. ∂h(j) j=k+1 ∂h(j−1). Qt. Pascanu, Mikolov, Bengio. On Recurrent and Deep Neural Networks. 26/ 38.

(30) Possible geometric interpretation and norm clipping. Classical view:. The error is (h(50) − 0.7)2 for h(t) = w σ(h(t − 1)) + b with h(0) = 0.5. error θ θ. Pascanu, Mikolov, Bengio. On Recurrent and Deep Neural Networks. 27/ 38.

(31) The vanishing gradients problem C(t − 1). ∂h(t−1) ∂h(t−2). h(t − 1). ∂C(t) ∂h(t). =. ∂C (t) t ∂W. P. ∂C(t+1) ∂h(t+1). h(t). h(t + 1) ∂h(t+1) ∂h(t). ∂h(t) ∂h(t−1). x(t). x(t − 1). ∂C ∂W. C(t + 1). C(t). ∂C(t−1) ∂h(t−1). =. ∂h(t) ∂h(t−k). ∂h(t+2) ∂h(t+1). x(t + 1). ∂C (t) ∂h(t) ∂h(t−k) k=0 ∂h(t) ∂h(t−k) ∂W. P Pt t. =. ∂h(j) j=k+1 ∂h(j−1). Qt. Pascanu, Mikolov, Bengio. On Recurrent and Deep Neural Networks. 28/ 38.

(32) Regularization term.  Ω=. X k. X Ωk =  k. Pascanu, Mikolov, Bengio. ∂C ∂hk+1 ∂hk+1 ∂hk ∂C ∂hk+1. 2  − 1. On Recurrent and Deep Neural Networks. 29/ 38.

(33) Temporal Order. Important symbols : A,B Distractor symbols: c,d,e,f de..fAef | {z } ccefc..e | {z } fAef..e | {z } ef..c | {z } → AA 1 T 10. 4 T 10. 1 T 10. 4 T 10. edefcAccfef..ceceBedef..fedef → AB feBefccde..efddcAfccee..cedcd → BA Bfffede..cffecdBedfd..cedfedc → BB. Pascanu, Mikolov, Bengio. On Recurrent and Deep Neural Networks. 30/ 38.

(34) Results - Temporal order task sigmoid 1.0. Rate of success. 0.8 0.6. MSGD MSGD-C MSGD-CR. 0.4 0.2 0.0 50. 100 150 Sequence length. Pascanu, Mikolov, Bengio. 200. 250. On Recurrent and Deep Neural Networks. 31/ 38.

(35) Results - Temporal order task basic tanh 1.0. Rate of success. 0.8 0.6. MSGD MSGD-C MSGD-CR. 0.4 0.2 0.0 50. 100 150 Sequence length. Pascanu, Mikolov, Bengio. 200. 250. On Recurrent and Deep Neural Networks. 32/ 38.

(36) Results - Temporal order task smart tanh 1.0. Rate of success. 0.8 0.6. MSGD MSGD-C MSGD-CR. 0.4 0.2 0.0 50. 100 150 Sequence length. Pascanu, Mikolov, Bengio. 200. 250. On Recurrent and Deep Neural Networks. 33/ 38.

(37) Results - Natural tasks. Pascanu, Mikolov, Bengio. On Recurrent and Deep Neural Networks. 34/ 38.

(38) How to construct Deep Recurrent Neural Networks Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio. International Conference on Learning Representations 2014. Pascanu, Gulcehre, Cho, Bengio. On Recurrent and Deep Neural Networks. 35/ 38.

(39) Gist of this work yt. yt. yt ht. ht-1. ht-1. xt. DT-RNN. + ht-1. DOT-RNN. yt. yt. ht xt z t-1. zt. Operator view ht-1. ht. ht-1. ht xt. xt. Stacked RNNs Pascanu, Gulcehre, Cho, Bengio. ht xt. DOT(s)-RNN. On Recurrent and Deep Neural Networks. 36/ 38.

(40) Overview of contributions. I. The efficiency of deep feedforward models with piece-wise linear activation functions. I. The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient. I. Importance of saddle points for optimization algorithms when applied to deep learning. I. Training Echo-State Networks to exhibit short term memory. I. Training Recurrent Networks with gradient based methods to exhibit short term memory. I. How can one construct deep Recurrent Networks. Pascanu. On Recurrent and Deep Neural Networks. 37/ 38.






(46) Thank you !. Thank you !. Pascanu. On Recurrent and Deep Neural Networks. 38/ 38.

(47)