Master TRIED
Reconnaissance des formes et méthodes neuronales (US330X) - Neural Networks and Deep Learning
Nicolas Thome
Conservatoire Nationnal des Arts et Métiers (Cnam) Laboratoire CEDRIC - équipe MSDMA
1 Modern Deep Architectures
Modern Architectural Components Modern Macro Architectures
2 Modern Optimization
3 Transfer Learning
Local Response/Contrast Normalization
• Normalize value wrt spatial neighbors N (i,j)
Credit: A. Vedaldi
• Local equalizing effective
Normalization: Local Feature Normalization
• Normalize value wrt neighbors in different feature maps
• Operates at each spatial position independently
• Feature groups: subset of maps (sliding window)
• ∼ Lateral inhibition
Batch Normalization (BN) [Ioffe and Szegedy, 2015]
• Recap init: fixed input distribution known to help training
• Training deep neural networks: distribution of hidden layers unknown, change over training time ⇒covariate shift
• Importance of init, e.g. Xavier
• Batch Normalization (BN):
↓ importance of init, ↓ covariate shift
Batch Normalization (BN) [Ioffe and Szegedy, 2015]
• Normalize input feature distribution
∼ N (0,1)
Normalization across each mini-batch:
µB =N1
N i=1∑xi σB2 = N1
N
i=1∑(xi−µB)2 ˆ
xi= xi−µB
σB+ -for numerical stability
• Is input feature distribution ∼ N (0,1) good idea?
Activation may not ever "saturate", e.g.sigmoid or tanh
Keeping in linear regime: depth useless,
∼global linear model
Batch Normalization (BN) [Ioffe and Szegedy, 2015]
• Scale and shift: yi =γxˆi+β,(γ, β)trained
• Apply after FC / conv and before non-linearity
Batch Normalization (BN) [Ioffe and Szegedy, 2015]
• Applying BN at test time? ⇒Use train set statistics
• BN Strengths
Faster training convergence vs covariate shift Regularization & generalization
⇒ better performances
• BN Conclusion: especially important for very deep models,e.g. ResNet (see next)
Padding Modules
• Filter of size H: problem with the borders ⌊H
2⌋
• Solution 1: reduce processing to computable area ⇒decreased output size
• Solution 2: padding, i.e.fill missing info with (arbitrary) values
Zero-padding, recopy, mirror, etc
Zero-Padding
• To avoid shrinking the spatial extent of the network rapidly
Pooling Modules
Overlapping Pooling- Ex: pooling size: 5, stride s=2
Pooling Modules
Pooling across feature maps
• Aggregation for a given spatial position, between different tensor maps
• Tensor maps (filter output) associated to a given transformation⇒ invariance with max pooling
Pooling across feature maps
• Ex: scaling [Kanazawa et al., 2014]
Pooling across feature maps
• Ex: rotation [Marcos et al., 2017]
Locally Connected vs Convolution Layers
• Locally connected: different features detected across image positions
• Sucessful in specific context,
e.g. DeepFace [Taigman et al., 2014]
1 Modern Deep Architectures
Modern Architectural Components Modern Macro Architectures
2 Modern Optimization
3 Transfer Learning
Deep Learning since 2012
More & more data (Facebook 109 images / day) ILSVRC since 2012: larger & larger networks
Deep Learning since 2012: ImageNet’13
• [Zeiler and Fergus, 2014] (ZF): archi∼AlexNet’12 (conv+FC)
• With Local Contrast Normalization (LCN) after conv/pool
Deep Learning since 2012: ImageNet’14
ImageNet’14: VGG [Simonyan and Zisserman, 2014]
Still Conv + FC, BUT:
• No pooling between some conv layer
• Convolution stride 1
ImageNet’14: VGG [Simonyan and Zisserman, 2014]
• 3x3 convolulutions:
two 3x3 conv∼one 5x5 conv
ImageNet’14: GoogLeNet [Szegedy et al., 2015]
• GoogLeNet global archi: three main components
1 ’Stem’∼ [Conv-Pool] Layer
2 Inception modules: Networks in Networks
3 Auxiliary classifiers
GoogLeNet: ’Stem’ ∼ [Conv-Pool] Layers
GoogLeNet: Inception Module
• Inspired from Network in Network (NiN) idea [Lin et al., 2013]
• Each conv layer: linear + non-linearity
• NiN block: hierarchy of conv layer
1×1 conv ∼MLP⇒ universal approximator, more expressive power for each patch
GoogLeNet: 1 × 1 Convolution
GoogLeNet: Inception Module
• 1×1, 3×3, 5×5 ⇒multi-scale
• Optimal filter size, pooling or not: learned
• 1×1: dimensionality reduction
⇒reasonable # parameters
ImageNet’14: GoogLeNet
• Two auxiliary classifiers
ImageNet’14: GoogLeNet
• Output classifier: GLobal Average Pooling (GAP) ⇒ no FC!
Drastic↓# params!
Deep Learning since 2012
ILSVRC since 2012: larger & larger networks
Issue: Training Deeper Networks
• BUT: deeper nets have worse performances Ex: plain nets: stacking 3×3 conv
Ex: deeper VGG⇒VGG56 < VGG20
Issue: Training Deeper Networks
• B BNot a generalization issue, training error also higher!
• General phenomenon, observed in many datasets
Issue: Training Deeper Networks
• 18 layers (left) vs 34 layers (right)
• Deeper counterpart: richer solution space
⇒ should not have higher training error
• Construction: copy from a shallower model
Extra layers: set as identity
Identity Mapping with Plain Blocks
• BUT: optimization challenge ⇒ solvers cannot find this solution when going deeper...
• H(x) desired mapping to fit with a 2 weight layers
• Multiple non-linear layers ⇒ identity mapping difficult
Identity Mapping with Residual Blocks [He et al., 2016]
• H(x) desired mapping, 2 layers fitF(x) ⇒H(x) =F(x) +x
• If identity optimal, easy to set weights as 0
• If optimal mapping closer to identity, easier to find small fluctuations
Residual Block
ImageNet’15: Residual Networks (ResNet) [He et al., 2016]
• Ex for VGG-style: 3x3 conv
ResNet-34
Architecture
• Stride 2 at some layers: spatial size /2, # filters x2
⇒same complexity per layer
• When size variations with residual connections: 1x1 conv (dotted)
• Trained from scratch, standard hyper-parameters & augmentation
ImageNet’15: Residual Networks (ResNet) [He et al., 2016]
ResNet-34
ResNet: Results
CIFAR-10 results
• Deep ResNets trained without difficulty:
Deeper Models ⇒training & testing error lower
ResNet: Results
ImageNet
• Deep ResNets trained without difficulty:
Deeper Models ⇒training & testing error lower
ResNet: Conclusion
• ResNet: training deeper models
• Performance below 4%error at
ImageNet Large Scale Visual Recognition Competition’15!
1 Modern Deep Architectures
2 Modern Optimization
3 Transfer Learning
Beyond Stochastic Gradient Descent (SGD)
• Gradient descent optimization:
wt+1=wt−η∂f
∂w(wt) =wt−η∇f (wt)
• 1st issue:objectivef changes quickly in one direction and slowly in another
• 2nd issue:Stochastic Gradient Descent (SGD)
Beyond Stochastic Gradient Descent (SGD)
• Poor conditioning on Hessian matrix, i.e.large condition number (largest/smallest singular value)
• Gradient descent: very slow progress along shallow
[email protected] TRIED - US330X / Deep Learning 38/ 73
Beyond Stochastic Gradient Descent (SGD)
• Fonctionf(w) = ∑N
i=1
fi(w),e.g. fi(w) =`CE(yˆi,y∗i)
• SGD: approximation of the true gradient,e.g. for mini-batch:
∇f (wt) ≈ 1
B B
∑
i=1
∂fi(w)
∂w (w(t))
Momentum
• In all SGD variants/improvement: wt+1=wt−∆t+1
∆t+1: update vectorwt→wt+1 Ex: Gradient descent: ∆t+1=η∇f(wt)
• Momentum: use previous gradient memory,e.g.running average:
∆t+1 =η∇f(wt) +γ∆t γ∈ [0;1[ (0.5,0.9)
Momentum
• ∆t: vt ∼ velocity, inertia or memory: wt+1=wt−vt+1 Dimensions with oscillating gradient directions⇒ vt damped
Dimensions with small but consistent gradient direction⇒ vt increased
• More robust to local minima/saddle points, poor conditioning and noisy gradients
Nesterov Accelerate Gradient (NAG)
• NAG [Sutskever et al., 2013]
• ∼Momentum, but compute gradient at position predicted by∆t
∆t+1=η∇f(wt−γ∆t) +γ∆t γ∼0.9 wt+1=wt−∆t+1
Momentum NAG
Nesterov Accelerate Gradient (NAG) [Sutskever et al., 2013]
With xt=wt−γ∆t, more convenient update rule:
• ∆t+1 =η∇f (xt) +γ∆t
• xt+1=xt+γ∆t− (γ+1)∆t+1
• ⊕: anticipatory update ⇒too large updates and overshooting
• ⊕: Increased responsiveness to the landscape of loss functionf Momentum
NAG
Optimization Schemes: Conclusion
• First-order methods,e.g. Momentum, NAG: better convergence
• Learning Rate Adaptation
Adapting updates to each individual parameter wi
⇒ larger/smaller updates depending on the landscape cost function
Context:sparse data, scaling parameter variation in deep networks (cf batch norm)
Family of algorithms with adaptive learning rates:
AdaGrad AdaDelta RMSProp Adam
Adaptive Gradient (Adagrad) [Duchi et al., 2011]
• Adaptative update rule for each parameter of the form:
wit+1=wit−∆ti+1
• Definition: git= ∂w∂f
i(wit),i.e. gradient dimension
• Gti =
t
∑
i=1
(gti)2 ⇒ memory of squared gradients over time
√
Git: `2 norm of the gradient fort = {1;t}
Git↔E[gi2] ∶2nd moment, uncentered variance
• Adagrad (ε∼10−8,η∼10−2):
∆ti+1= η
√
Git+εgit=η′git
Intuition: largeGit ⇒ ↓η′, smallGit ⇒ ↑η′
Adaptive Gradient (Adagrad) [Duchi et al., 2011]
• Vectorial form: wt+1=wt−∆t+1
∆t+1 = η
√ Gt+ε
⊙gt
• SHORTCOMING: very aggressive learning rate decay
• (git)2≥0⇒Git monotonically increasing
⇒η′= η
√
Git+ε→0
• w updates stop
Root Mean Square Propagation
(RMSProp) [Tieleman and Hinton, 2012]
• RMSProp Idea: compute average of (git)2 only using a recent time window:
G˜it=ρGit−1+ (1−ρ)(git)2 ρ∼0.9
∆ti+1 = η
√ G˜it+ε
git =η′git
• RMSProp: Same update rule as Adagrad (G˜it↔Git)
⇒ G˜it NOT monotonically increasing
⇒ Lessen aggressive learning rate decay
• Final parameter update:
AdaDelta (AdaDelta) [Zeiler, 2012]
• Essentially: AdaDelta = RMSProp + momentum
• Compute git andG˜it∼ RMSProp: local average of previous(git)2 G˜it=ρGit−1+ (1−ρ)(git)2 ρ∼0.9
• New AdaDelta update vector: uti+1=
√Uti+ε
√G~ti+εgti =η′git
• New term Uit in numerator compared to RMSProp:
∼Momentum (acceleration term) accumulating prior updates Uit+1=ρUit+ (1−ρ)(uit)2
• Final parameter update:
wit+1=wit−uti+1
AdaDelta (AdaDelta) [Zeiler, 2012]
uit+1=
√ Uit+ε
√ G˜it+ε
git=
RMS(updates) RMS(gradients) git
• Approximate second order (Hessian H diagonal):
wt+1=wt−H−1g ⇒wit+1∝wit− f
′
f′′
Approximate f1′′ by RMS(updates) RMS(gradients)
Homogeneous dimension update: uit+1∼w! ≠SGD, momentum, RMSProp!
• Even no learning rate!
⊕no parameter
⊖no way to design variable update speeds (e.g. fine-tuning)
Adaptive Moment Estimation (Adam) [Kingma and Ba, 2015]
• Adam: 1st and 2st gradient moment estimation
• Strong similarities to Adadelta: 2nd gradient moment + momentum
2nd gradient moment∼ Adadelta/RMSProp with local accumulation:
vit=β2vit−1+ (1−β2)(git)2 1st gradient moment (mean)vs squared activation Adadelta
mit=β1mit−1+ (1−β1)(git)
Overcome Adadelta limitations on biased moment estimate (init)
ˆ mit= m
t i
1−β1 vˆit= v
t i
1−β2
Final update: wit+1=wit−η mˆ
t
√ i
ˆ vit+ε
Learning Rate Adaptation: Conclusion
• Per-dimension update: important in case of sparse data
• Adagrad, RMSProp, Adadelta and Adam: all use 2nd gradient moment (var) on denominator
• Adadelta, Adam: use ∼momentum on numerator
• Adam good default choice in many cases
1 Modern Deep Architectures
2 Modern Optimization
3 Transfer Learning
Training Deep ConvNets on Small Datasets
• Ex: PASCAL VOC’07: 20 categories, 5000 training ex
• Training VGG from scratchvs Hancrafted FV [Perronnin et al., 2010]
• Deep « Handcrafted Model Test mAP (%)
Training Deep ConvNets on Small Datasets
• ImageNet: deep » hancrafted
• VOC’07:deep « hancrafted Not enough training ex Complex images
Transfer Learning
• Idea: export knowledge from source domain to target domain Source: good performances, e.g.many samples
Target: more challenging, e.g.few samples
⇒ Deep ConvNet good in imagenet, but not as good in VOC’07
• Assumption: source and target classes different but related Ex: Various breeds of cat (tabby, persian)vs cat
ImageNet VOC’07
Tabby Cat Persian Cat Cat
Transferring Representations learned from ImageNet
• Most naive transfer learning approach:
Load ConvNet model pre-trained on ImageNet,e.g. VGG
Transferring Representations learned from ImageNet
• Most naive transfer learning approach:
Load ConvNet model pre-trained on ImageNet
Apply ConvNet on each target dataset image, e.g.VOC Extract a given layer activation: "Deep Features" (DF)
Deep Features (DF) for Classification
• Deep Feature (DF), e.g. fc7: use it as any visual descriptor
• Fc7: relevant for discriminating target classes, e.g. tabby/persian cat for cat
Deep Features (DF) for Classification
• Deep Feature (DF), e.g. fc7: visual descriptor
DF: highly non-linear feature
Can use flat ML model for target class prediction
Deep Features (DF) for Classification
Which layer to use for classification ?
• Layers close to classification:
specific to ImageNet
• Layers less close to
classification: more generic features
⇒ Depends on similarity between target & ImageNet
Deep Features (DF) for Visual Recognition: Conclusion
• Deep Features: ConvNet success beyond ImageNet
⇒No need huge dataset for using / training deep models
• Off-the-shelf features for any visual recognition task
• ImageNet: 1000 classes, large set of visual concepts
Transfer very well even to tasks with large domain shift,e.g.medical images
• Off-the-shelf features for any visual recognition task
Deep Features (DF) and Transfer Learning for Classification
• Classification: adapting final layer to match target classes
1 Remove last layer⇔ImageNet classes
Deep Features (DF) and Transfer Learning for Classification
• Classification: adapting final layer to match target classes
2 Add transfer layer withK target classes,e.g. K =20 VOC’07
Transfer Learning for multi-class Classification
• Multi-class Classification: only one class per image, i.e. exclusive labels (e.g.ImageNet, MNIST)
• Ex for transfer in MIT-67 (museum)
Transfer Learning for multi-class Classification
• Transfer layer of K classes + soft-max activation
• si=xiW+b,yˆk∼P(k/xi,W,b) = Kesk
∑ esk′
Multi-label Classification
• Several labels for a given image,e.g.dog + bicycle + car (PASCAL VOC)
• Classification: trainK binary models predicting class presence / absence
Transfer Learning for multi-label Classification
• Transfer layer of K classes + sigmoid activation
• si =xiW+b,yˆk = [1+e−λsk]−1
Deep Features (DF) and Transfer Learning for Classification
• Training strategy?, depending on:
a) Volume of the target dataset b) Semantic proximity wrt ImageNet
Training Strategy: Pure Transfer
Training Strategy: Fine-Tuning
• Decreased learning rate for fined-tuned wrt from scratch parameters e.g.factor 10 ↓
Training Strategy: Results
• Small datasets [103,104], semantically close to ImageNet
• Ex: VOC’07 (20 classes, 5000 ex)
Model mAP (%)
VGG from scratch ≈40 Handcrafted FV ≈70 VGG pure tranfer ≈83 VGG fine-tuning ≈85
• Fine Tuning >Transfer >>Handcrafted (BoW)
• From scratch low: not enough training data
Training Strategy: Results
• Medium/large datasets ≥105, semantically close to ImageNet
• Ex: UPMC-food-101 (100 classes, 100000 ex)
Model mAP (%)
Handcrafted ≈25 VGG pure tranfer ≈40 VGG from scratch ≈53 VGG fine-tuning ≈65
• From scratch does work (well !)
• Fine Tuning >>From scratch>>Transfer>> Handcrafted
Training Strategy: Results
• Medium/large datasets
≥105, semantically far from ImageNet
• Ex: M2CAI’16 challenge:
22 videos, ∼105 images, 8 classes
Model Accuracy (%) VGG pure tranfer ≈60 VGG from scratch ≈70 VGG fine-tuning ≈80
• Fine Tuning >> From scratch >> Transfer
• Transfer already good baseline despite big visual content shift
Transfer Learning: Conclusion
Small visual shift Large visual shift Small dataset Transfer top layers Transfer lower layers Larger dataset Fine-tune top layers Fine-tune all layers
• Small dataset, large visual shift: common in medical image classification
• How to implement transfer for localized tasks?
⇒ last course!
References I
[Azizpour et al., 2016] Azizpour, H., Razavian, A. S., Sullivan, J., Maki, A., and Carlsson, S. (2016).
Factors of transferability for a generic convnet representation.
IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1790–1802.
[Duchi et al., 2011] Duchi, J., Hazan, E., and Singer, Y. (2011).
Adaptive subgradient methods for online learning and stochastic optimization.
J. Mach. Learn. Res., 12:2121–2159.
[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016).
Deep residual learning for image recognition.
In CVPR.
[Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015).
Batch normalization: Accelerating deep network training by reducing internal covariate shift.
In Bach, F. R. and Blei, D. M., editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org.
[Kanazawa et al., 2014] Kanazawa, A., Sharma, A., and Jacobs, D. W. (2014).
Locally scale-invariant convolutional neural networks.
In Deep Learning and Representation Learning Workshop: NIPS 2014.
[Kingma and Ba, 2015] Kingma, D. P. and Ba, J. (2015).
Adam: A method for stochastic optimization.
In ICLR, volume abs/1412.6980.
[Marcos et al., 2017] Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. (2017).
Rotation equivariant vector field networks.
In The IEEE International Conference on Computer Vision (ICCV).
[Perronnin et al., 2010] Perronnin, F., Sánchez, J., and Mensink, T. (2010).
Improving the fisher kernel for large-scale image classification.
In Computer Vision–ECCV 2010, pages 143–156. Springer.
[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014).
Very deep convolutional networks for large-scale image recognition.
CoRR, abs/1409.1556.
[Sutskever et al., 2013] Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013).
On the importance of initialization and momentum in deep learning.
In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1139–III–1147. JMLR.org.
[Szegedy et al., 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015).
Going deeper with convolutions.
In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9. IEEE.
[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).
Deepface: Closing the gap to human-level performance in face verification.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[Tieleman and Hinton, 2012] Tieleman, T. and Hinton, G. (2012).
RMSprop Gradient Optimization.
References III
[Zeiler and Fergus, 2014] Zeiler, M. and Fergus, R. (2014).
Visualizing and understanding convolutional networks, volume 8689 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 818–833.
Springer Verlag, part 1 edition.
[Zeiler, 2012] Zeiler, M. D. (2012).
ADADELTA: an adaptive learning rate method.
CoRR, abs/1212.5701.