Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209) Neural Networks and Deep Learning History & Modern Deep Learning

(1)

Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209)

Neural Networks and Deep Learning History & Modern Deep Learning

Nicolas Thome [email protected]

http://cedric.cnam.fr/vertigo/Cours/ml2/

Département Informatique

Conservatoire Nationnal des Arts et Métiers (Cnam)

(2)

Outline

1 Deep Learning History Deep Learning Strengths Deep Learning Weaknesses Deep Learning Revival

2 Modern Deep Learning

(3)

History Modern Deep Learning

Deep Learning: Expressiveness

MLP: Universal Function Approximators

● Neural network with one single hidden layer⇒universal approximator Can represent any function on compact subsets ofRⁿ[Cybenko, 1989]

Ex pour classification: any decision boundaries can be expressed

⇒very rich modeling capacities

[email protected] RCP209 / Deep Learning 1/ 64

(4)

Deep Learning: Expressiveness

● 2 layers,i.e.one hidden layer, is enough... theoretically:

BUT:exponential number of hidden units [Barron, 1993]

● Challenge is NOT fitting training data

Simple models already have very large (infinite) modeling power

● Challenge: optimization, overfitting

(5)

Deep Learning: Expressiveness & Compactness

● Deeper Models: less units required

Functions representable compactly withk layers may require exponentially size withk−1 layers [Hastad, 1989, Bengio, 2009]

Digit reco., from [Goodfellow et al., 2016]

Same modeling power, fewer parameters

⇒better generalization!

(6)

Inductive Bias in Deep Learning

● Deep models: hierarchy of sequential layers

● Layers: fully connected,

convolution + non linearity

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ

convolution layer , pooling

● Convolutional architecture: prior knowledge,akaINDUCTIVE BIAS

Deep learning: feature design⇒architecture design

(7)

Inductive Bias in Deep Learning

ConvNets & Prior Distribution

● Prior:imposing distribution on fully connected parameters

● Weak prior: high entropy (uncertainty), strong prior: low entropy

● Infinitely strong prior:zero probability on some parameters

● ConvNet∼Infinitely Strong Prior on Fully Connected net weights

● Convolution: local interactions, shared weights⇒zero probability elsewhere

(8)

Inductive Bias in Deep Learning

ConvNets as Inductive Bias

● ConvNet∼Infinitely Strong Prior on Fully Connected net weights

● Convolution⇒support learning translation-equivariant features

● Pooling⇒support features invariant (stable) wrt local translations Very rich modeling capacities: local interactions⇒global with depth Significantly reduce # parameters⇒reducing over-fitting

From [Goodfellow et al., 2016]

(9)

Inductive Bias in Deep Learning

ConvNet for Learning Compositions

● Conv/Pool hierarchies: feature composition Depth: gradual complexity, larger spatial extend Intuitive processing for hierarchical information modeling Biological foundations: simple cells, complex cells

(10)

Inductive Bias in Deep Learning

ConvNet for Learning Compositions

● Hierarchical Compositions Low-level: edges, color Mid-level: corner, parts

Higher levels: objects, scene concepts

● Distributed Representations: sharing Lower-level: shared by many classes Higher-levels: more class specific

(11)

Deep Learning: Representation Learning

Latent representations: learned features

≠Handcrafted features for the task ≠Handcrafted kernels in kernel methods (SVM)

X-class classification,K classes

● Last hidden layer: R^L→R^K

● InR^L, linear separation required

● Deep Learning: learning representations that gradually project data toR^L spaces where linear separation possible

(12)

Deep Learning & Manifold Untangling

Manifold Untangling

Credit: DiCarlo

● DL: gradually projecting data toR^L spaces where linear separation possible

● This is the definition of manifold untangling!

● ConvNets: inductive bias making manifold untangling easier!

(13)

Manifold Untangling Visualization

● We want to visualize each layer activation for each class

● high-dimensional visualization?

⇒Projection to lower (e.g.2d) dimensions

(14)

t-distributed Stochastic Neighbor Embedding (t-SNE)

● t-SNE [van der Maaten and Hinton, 2008]:

non linear projection

● Intuitively: close distances in initial space

⇒close distances in projected (2d) space Distance preservation

Neighborhood preservationi.e.small distance preservation

(15)

t-SNE [van der Maaten and Hinton, 2008]

● Similarity between points(xi,xj)in initial space,e.g.R^d : pij= e⁻

∣∣xi−xj∣∣2 2σ2

k≠l∑e⁻

∣∣xk−xl∣∣2 2σ2

P= {pij}_(i,j)∈N×N

● Similarity between points(yi,yj)in projected space,e.g.R²: qij= (1+ ∣∣yi−yj∣∣²)⁻¹

k≠l∑(1+ ∣∣yk−yl∣∣²)⁻¹ Q= {qij}_(i_,j)∈N×N

● Loss function: Kullback-Leiber divergenceKL(P∣∣Q)

C= ∑

i

KL(P∣∣Q) = ∑

i

∑

j

pij logpij

qij

(16)

t-SNE Visualization: MNIST example

● MNIST dataset: 28×28 grayscale images of digits

● 10 classes⇔digit number∈ {0;9}

● Input space dimension: 28²=784

● Projection in 2d (3d) space for visualization

● t-SNE for computing projection: gradient descent

∂C

∂yi =4∑

i

(pij−qij)(yi−yj)(1+ ∣∣yi−yj∣∣²)⁻¹

● Optimization (projection) for a given closed dataset

⇒transductive learning

(17)

t-SNE Visualization: MNIST example

● Application of t-SNE in the test set of MNIST (10000) images

● Color⇔class ID

(18)

t-SNE Visualization: MNIST example

● Classes visually appear in 2d space,BUT overlap

● How to measure class separability?

Neighborhood Hit[Paulovich et al., 2008]:

NH = # pts in knn of the same class

# pts in knn

(19)

t-SNE Visualization: MNIST example

● How to measure class separability?

Fitting ellipses to each class points Ellipses non-overlap⇒linear separability

(20)

80’s: LeNet 5 Model

● Total # parameters∼60000

● Evaluation on MNIST: test error of 0.95%

● Successful deployment for postal code reading in the US

(21)

LeNet 5 Model: Manifold Untangling

Input space Latent space

(22)

LeNet 5 Model: Manifold Untangling

Latent space MLP Latent space LeNet

(23)

Outline

(24)

Deep Neural Networks: Weaknesses & Drawbacks

Criticisms at two main levels

1 Modeling level: Neural Networks⇔Black Boxes

2 Training level: ad hoc, expertise, efficiency, guaranty

(25)

Deep Neural Networks: Black Boxes

● Lack of explainability: why this decision?

Hidden units not directly interpretable≠others,e.g.decision trees, expert systems

⇒Challenges:Human machine interaction, failure analysis

(26)

Deep Neural Networks: Black Boxes

● Lack of theory for architecture design

● How many layers, neurons?

● Layer type: fully connected, convolution, pooling?

● Trial/test: optimize architecture on validation set

⇒Ad hoc, no theory to guide you

(27)

Deep Neural Networks: Training Issues

● Optimization: non convex objective No guaranty to reach global optimum Solution dependent on initialization

Importance of (random) initialization

⇒training reproducibility Stochastic training: noisy gradient Expertise: ad hochyper-parameter tuning:

# epochs, decay, optimizers (next week)etc Costly Tuning

(28)

Deep Neural Networks: Training Issues

● Deep models need huge annotated datasets

⇒Huge models, huge computational demand

⇒Long be impossible to train such models with existing resources

● Smaller datasets: inferior predictive performances Small models: not enough expressive power Large models overfit

⇒Performances↓handcrafted features

(29)

Deep Learning: Trends and methods in the last four decades

90’s: start of winter for deep learning

● Deep neural nets =’ black magic’, black boxes Lack of interpretability

Optimization issues for highly non-convex objective function

● Golden age of kernel methods

Generalization theory with Support Vector Machines Extension to non-linear modes: kernel trick

Kernel encode prior knowledge (structure) on data Convex optimization problem

(30)

Deep Learning: Trends and methods in the last four decades

2000’s: Bag of Words Model (BoW)

● Started from the Information Retrieval (IR) community

● Text classification : document as a histogram of word occurrences

● Bow representation as input for powerful classifiers,e.g.SVM

(31)

2000’s: Bag of Words Model

● Adapting the BoW model for visual recognition? ⇒Bag of Visual Word (BoV)

● Main challenge:definition of visual words unclear!

● Solution:compute a dictionary on local image regions (clustering) Local regions represented by handcrafted descriptors,e.g.SIFT

● 2000’s: BoW + SVM state-of-the-art

● Many works on kernel on BoW, coding & pooling→2012

(32)

Outline

(33)

Deep Learning: Trends and methods in the last four decades

Deep Learning renewal since 2006

● 2006: new unsupervised learning for Deep Belief Nets (DBN) [Hinton et al., 2006]

● Theoretical results for improving model quality with depth

● Unsupervised training used as init for supervised learning with back-prop

(34)

Deep Learning and ConvNet for Speech Recognition

● First DL breakthrough on large datasets: speech recognition

● Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, Dahl et al. (2010)

(35)

Deep Learning and ConvNet for Image Classification

● ImageNet ILSVRC Challenge (Stanford):

1,200,000 training images, 1,000 classes, mono-label Based on WordNet hierarchy (ontology)

Evaluation: top-5 error

● Up to 2012, leading approaches: BoW + SVM

● ILSVRC’12: the deep revolution⇒outstanding success of ConvNets [Krizhevsky et al., 2012]

(36)

2012: the deep revolution

Deep ConvNet success at ILSVRC’12 Two main practical reasons:

1 Huge number of labeled images (10⁶ images)

Possible to train very large models without over-fitting

Larger models enables to learn rich (semantic) features hierarchies

2 GPU implementation for training Relatively cheap and fast GPU

Training time reduced to 1-2 weeks (up to 50x speed up)

(37)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

● 60,000,000 parameters

● 650,000 neurons - 630,000,000 connections

● 5 convolutional layers, 3 Fully Connected (FC)

Convolution layer: Convolution + non linearity (ReLU) + pooling

Full= FC + non linearity - Final FC: 4096-dim

● Trained on 2 GPUs for a week

(38)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

First Convolutionnal Layer

● Input: Images: 227x227x3

● Filter (receptive field) size F: 11, S (stride) = 4

● 96 filters⇒output size 55*55*96 = 290,400 neurons

● Each Filter: 11*11*3 = 363 weights + 1 bias = 364 params

N.B.: Convolution in whole feature map depth (cf LeNet 5 discussion)

● # parms: 96 * 364 = 34, 944

(39)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

Credit: R. Fergus

(40)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

Credit: R. Fergus

(41)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

Credit: R. Fergus

(42)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

(43)

Deep Learning in 2012: Representation Learning

Deep: more semantic features

(44)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

● Same global architecture as older nets,e.g.LeNet Trained with back-prop and stochastic gradient descent

● But bigger (deeper and wider): 60 10⁶ parametersvs 60 10³ Needs more data (10⁶vs10⁴)

GPU implementation for fast training

● Also some architectural and optim improvements (see next):

Non-linearity: ReLUvssigmoid

Overlapping pooling (Local Response Normalisation, LRN) Regularization: data augmentation, dropout

(45)

Outline

1 Deep Learning History

2 Modern Deep Learning Modern Non-Linearities Modern Training

Modern Architectural Components

(46)

Modern Non-Linear Activation Modules

● Standard non-linear activation functions,e.g.sigmoid, tanh

● Saturating regime

⇒Vanishing gradient: no back-prop

⇒Slow convergence

(47)

Rectified Linear Unit (ReLU)

ReLU(z) =

⎧⎪

⎪

⎨

⎪⎪

⎩

z si z≥0

0 sinon =max{0,z}

(48)

Rectified Linear Unit (ReLU)

● Reducing vanishing gradients problems⇒ faster learning / convergence

● Ex: 4-layer ConvNet, CIFAR-10

⇒ReLUvs tanh: x6 speedup

From [Krizhevsky et al., 2012]

(49)

Non-Linear Activation Modules

Sigmoid

● Saturation

● Expensive

● Not zero-centered

Tanh

● Saturation

● Expensive

● Zero-centered

ReLU

● No saturation

● Very efficient

● Not zero-centered

● Negative activations ignored

(50)

Non-Linear Activation Modules

● ReLU: 0 for negative inputs⇒blocked gradient

● ReLU variants:

From [Gu et al., 2015]

● Leaky ReLU (LReLU):λempirically predefined

● Parametric ReLU (PReLU) :λklearned from data

● Randomized ReLU (RReLU):λⁿk uniform sampling

● Exponential Linear Unit (ELU):λfixed

(51)

Non-Linear Activations: Conclusion

● ReLU non-linearity: training speed-up Used in AlexNet at ImageNet’12

Now vanilla activation for essentially every network

(52)

Outline

(53)

Pre-Processing Modules

● Normalization bewteen input neurons known to help training

● Idea: enforcing fixed distribution

Data set X= {x^j},j∈ {1;N}, x^j= {x_i^j} ∈R^m

● Centering: mean subtractionµi across every individual featurex_i^j µi= ¹

N

∑N

i=1x_i^j⇒x_i^N,j=x_i^j−µi

● Normalization: centering + std division σi: σ_i²=_N¹ ∑^N

i=1(x_i^j−µi)²⇒x_i^N,j=^xⁱ^j_σ^−µ_i ⁱ

(54)

Pre-Processing Modules

More advanced processings:

● De-correlation: centering + covariance matrix wrt principal axes alignment

● Whitening: divide by each std⇒ N (0,1)

● In practice, not used with ConvNets

(55)

Weight Initialization

● Non-convex deep learning objective⇒param init important

● Zero-init: all neurons same output, thus same gradient

● Random init will small numbers,e.g.uniform or W∼ N (0, σⁱ)

● Input layer x with m neurons of outputs: Var[s] =mVar[w]Var[x] Xavier init: W∼√¹

mN (0, σⁱ) [Glorot and Bengio, 2010]

(56)

Weight Initialization Example

Activation Histogram

Credit: Sullivan

● 10-layer net, 500 nodes at each layer

● Tanh activation, parameters init W∼ N (0, σⁱ)

● σⁱ small: activation may be0, not good init

(57)

Weight Initialization Example

Activation Histogram

Credit: Sullivan

● 10-layer net, 500 nodes at each layer

● Tanh activation, parameters init W∼ N (0, σⁱ)

● σⁱ large: activation may be±1

⇒vanishing gradient

(58)

Weight Initialization Example

● Tanh activation, Xavier init W∼^√¹₅₀₀N (0,1) Activation Histogram

Credit: Sullivan

● Helps to control activation variance through depth

(59)

Weight Initialization with ReLU

● Input layer x with m neurons of outputs: Var[s] =2mVar[w]Var[x] [He et al., 2015]

W∼√¹

2m N (0, σⁱ)

(60)

Training: Data-Augmentation

● Jittering, mirroring, color perturbation, rotation, stretching, shearing, lens distortions,etc of the original images

● Increases # training samples, adds robustness to irrelevant variations

● Done in train AND in test

(61)

Training: Dropout [Hinton et al., 2012]

● Randomly omit each hidden unit with probabilityp,e.g.p=0.5

● Regularization technique, limits over-fitting (better generalization)

Prevent co-adaptation,i.e.feature only helpful when other specific features present May be viewed as averaging over many NN

Slower convergence

(62)

Training: Dropout [Hinton et al., 2012]

● Training: dropout layer easily differentiable, freezing some weight updates

● What to do at test time ?

Sample many different architectures, average output distributions Faster alternative: use all hidden units (but after /2 outgoing weights) Equivalent to the geometric mean in case of single hidden layer Pretty good approximation for multiple layers

(63)

Dropout: Conclusion

● Dropout: important for limiting over-fitting Used in AlexNet at ImageNet’12

Common in current archis, especially in FC layers

(64)

Outline

(65)

Local Response/Contrast Normalization

● Normalize value wrt spatial neighborsN (i,j)

Credit: A. Vedaldi

● Local equalizing effective

● Helps learning more invariant representations

⇒regularization, better generalization

(66)

Normalization: Local Feature Normalization

● Normalize value wrt neighbors in different feature maps

● Operates at each spatial position independently

● Feature groups: subset of maps (sliding window)

● ∼Lateral inhibition

(67)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

● Recap init: fixed input distribution known to help training

● Training deep neural networks: distribution of hidden layers unknown, change over training time⇒covariate shift

● Importance of init,e.g.Xavier

● Batch Normalization (BN):

↓importance of init,↓covariate shift

(68)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

● Normalize input feature distribution∼ N (0,1) Normalization across each mini-batch:

µB=_N¹ ∑^N

i=1xi

σ_B²=_N¹ ∑^N

i=1(x_i−µ_B)² ˆ

x_i=^x_σⁱ^−µ^B

B+ -for numerical stability

● Is input feature distribution∼ N (0,1)good idea?

Activation may not ever "saturate", e.g.sigmoid or tanh

Keeping in linear regime: depth useless,

∼global linear model

(69)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

● Scale and shift: yi=γxˆi+β,(γ, β)trained

● Apply after FC / conv and before non-linearity

● Batch Normalization differentiable

(70)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

● Applying BN at test time?

⇒Use train set statistics

● BN Strengths

Faster training convergence vs covariate shift Regularization & generalization

⇒better performances

● BN Conclusion:especially important for very deep models,e.g.ResNet (see next)

(71)

Pooling Modules

Overlapping Pooling- Ex: pooling size: 5, stride s=2

Credit: K. Matsui

(72)

Pooling Modules

Pooling across feature maps

● Aggregation for a given spatial position, between different tensor maps

● Tensor maps (filter output) associated to a given transformation⇒ invariance with max pooling

(73)

Pooling across feature maps

● Ex: scaling [Kanazawa et al., 2014]

(74)

Pooling across feature maps

● Ex: rotation [Marcos et al., 2017]

(75)

Locally Connected vs Convolution Layers

● Locally connected: different features detected across image positions

● Sucessful in specific context,e.g.DeepFace [Taigman et al., 2014]

(76)

References I

[Barron, 1993] Barron, A. R. (1993).

Universal approximation bounds for superpositions of a sigmoidal function.

Information Theory, IEEE Transactions on, 39(3):930–945.

[Bengio, 2009] Bengio, Y. (2009).

Learning deep architectures for ai.

Found. Trends Mach. Learn., 2(1):1–127.

[Bengio and Delalleau, 2011] Bengio, Y. and Delalleau, O. (2011).

On the expressive power of deep architectures.

In Proceedings of the 22Nd International Conference on Algorithmic Learning Theory, ALT’11, pages 18–36, Berlin, Heidelberg. Springer-Verlag.

[Cybenko, 1989] Cybenko, G. (1989).

Approximation by superpositions of a sigmoidal function.

Mathematics of control, signals and systems, 2(4):303–314.

[Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010).

Understanding the difficulty of training deep feedforward neural networks.

In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10).

[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep Learning.

MIT Press.

http://www.deeplearningbook.org.

[Gu et al., 2015] Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., and Wang, G.

(2015).

Recent advances in convolutional neural networks.

CoRR, abs/1512.07108.

[Hastad, 1989] Hastad, J. (1989).

Almost optimal lower bounds for small depth circuits.

In RANDOMNESS AND COMPUTATION, pages 6–20. JAI Press.

(77)

References II

[He et al., 2015] He, K., Zhang, X., Ren, S., and Sun, J. (2015).

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

In International Conference on Computer Vision (ICCV).

[Hinton et al., 2006] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006).

A fast learning algorithm for deep belief nets.

Neural Comput., 18(7):1527–1554.

[Hinton et al., 2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012).

Improving neural networks by preventing co-adaptation of feature detectors.

CoRR, abs/1207.0580.

[Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015).

Batch normalization: Accelerating deep network training by reducing internal covariate shift.

In Bach, F. R. and Blei, D. M., editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org.

[Kanazawa et al., 2014] Kanazawa, A., Sharma, A., and Jacobs, D. W. (2014).

Locally scale-invariant convolutional neural networks.

In Deep Learning and Representation Learning Workshop: NIPS 2014.

[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classification with deep convolutional neural networks.

In Advances in neural information processing systems, pages 1097–1105.

[Marcos et al., 2017] Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. (2017).

Rotation equivariant vector field networks.

In The IEEE International Conference on Computer Vision (ICCV).

[Paulovich et al., 2008] Paulovich, F. V., Nonato, L. G., Minghim, R., and Levkowitz, H. (2008).

Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping.

IEEE Trans. Vis. Comput. Graph., 14(3):564–575.

[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).

Deepface: Closing the gap to human-level performance in face verification.

In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

(78)

References III

[van der Maaten and Hinton, 2008] van der Maaten, L. and Hinton, G. E. (2008).

Visualizing high-dimensional data using t-sne.

Journal of Machine Learning Research, 9:2579–2605.