Master TRIED Reconnaissance des formes et méthodes neuronales (US330X) - Neural Networks and Deep Learning

(1)

Master TRIED

Reconnaissance des formes et méthodes neuronales (US330X) - Neural Networks and Deep Learning

Nicolas Thome

Conservatoire Nationnal des Arts et Métiers (Cnam) Laboratoire CEDRIC - équipe MSDMA

(2)

1 Modern Deep Architectures

Modern Architectural Components Modern Macro Architectures

2 Modern Optimization

3 Transfer Learning

(3)

Local Response/Contrast Normalization

• Normalize value wrt spatial neighbors N (i,j)

Credit: A. Vedaldi

• Local equalizing effective

(4)

Normalization: Local Feature Normalization

• Normalize value wrt neighbors in different feature maps

• Operates at each spatial position independently

• Feature groups: subset of maps (sliding window)

• ∼ Lateral inhibition

(5)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

• Recap init: fixed input distribution known to help training

• Training deep neural networks: distribution of hidden layers unknown, change over training time ⇒covariate shift

• Importance of init, e.g. Xavier

• Batch Normalization (BN):

↓ importance of init, ↓ covariate shift

(6)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

• Normalize input feature distribution

∼ N (0,1)

Normalization across each mini-batch:

µ_B =_N¹

N i=1∑x_i σ_B² = _N¹

N

i=1∑(xi−µB)² ˆ

x_i= ^xⁱ^−µ^B

σ_B+ -for numerical stability

• Is input feature distribution ∼ N (0,1) good idea?

Activation may not ever "saturate", e.g.sigmoid or tanh

Keeping in linear regime: depth useless,

∼global linear model

(7)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

• Scale and shift: y_i =γxˆ_i+β,(γ, β)trained

• Apply after FC / conv and before non-linearity

(8)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

• Applying BN at test time? ⇒Use train set statistics

• BN Strengths

Faster training convergence vs covariate shift Regularization & generalization

⇒ better performances

• BN Conclusion: especially important for very deep models,e.g. ResNet (see next)

(9)

Padding Modules

• Filter of size H: problem with the borders ⌊^H

2⌋

• Solution 1: reduce processing to computable area ⇒decreased output size

• Solution 2: padding, i.e.fill missing info with (arbitrary) values

Zero-padding, recopy, mirror, etc

(10)

Zero-Padding

• To avoid shrinking the spatial extent of the network rapidly

(11)

Pooling Modules

Overlapping Pooling- Ex: pooling size: 5, stride s=2

(12)

Pooling Modules

Pooling across feature maps

• Aggregation for a given spatial position, between different tensor maps

• Tensor maps (filter output) associated to a given transformation⇒ invariance with max pooling

(13)

Pooling across feature maps

• Ex: scaling [Kanazawa et al., 2014]

(14)

Pooling across feature maps

• Ex: rotation [Marcos et al., 2017]

(15)

Locally Connected vs Convolution Layers

• Locally connected: different features detected across image positions

• Sucessful in specific context,

e.g. DeepFace [Taigman et al., 2014]

(16)

Modern Architectural Components Modern Macro Architectures

3 Transfer Learning

(17)

Deep Learning since 2012

More & more data (Facebook 10⁹ images / day) ILSVRC since 2012: larger & larger networks

(18)

Deep Learning since 2012: ImageNet’13

• [Zeiler and Fergus, 2014] (ZF): archi∼AlexNet’12 (conv+FC)

• With Local Contrast Normalization (LCN) after conv/pool

(19)

Deep Learning since 2012: ImageNet’14

(20)

ImageNet’14: VGG [Simonyan and Zisserman, 2014]

Still Conv + FC, BUT:

• No pooling between some conv layer

• Convolution stride 1

(21)

ImageNet’14: VGG [Simonyan and Zisserman, 2014]

• 3x3 convolulutions:

two 3x3 conv∼one 5x5 conv

(22)

ImageNet’14: GoogLeNet [Szegedy et al., 2015]

• GoogLeNet global archi: three main components

1 ’Stem’∼ [Conv-Pool] Layer

2 Inception modules: Networks in Networks

3 Auxiliary classifiers

(23)

GoogLeNet: ’Stem’ ∼ [Conv-Pool] Layers

(24)

GoogLeNet: Inception Module

• Inspired from Network in Network (NiN) idea [Lin et al., 2013]

• Each conv layer: linear + non-linearity

• NiN block: hierarchy of conv layer

1×1 conv ∼MLP⇒ universal approximator, more expressive power for each patch

(25)

GoogLeNet: 1 × 1 Convolution

(26)

GoogLeNet: Inception Module

• 1×1, 3×3, 5×5 ⇒multi-scale

• Optimal filter size, pooling or not: learned

• 1×1: dimensionality reduction

⇒reasonable # parameters

(27)

ImageNet’14: GoogLeNet

• Two auxiliary classifiers

(28)

ImageNet’14: GoogLeNet

• Output classifier: GLobal Average Pooling (GAP) ⇒ no FC!

Drastic↓# params!

(29)

Deep Learning since 2012

ILSVRC since 2012: larger & larger networks

(30)

Issue: Training Deeper Networks

• BUT: deeper nets have worse performances Ex: plain nets: stacking 3×3 conv

Ex: deeper VGG⇒VGG56 < VGG20

(31)

Issue: Training Deeper Networks

• B BNot a generalization issue, training error also higher!

• General phenomenon, observed in many datasets

(32)

Issue: Training Deeper Networks

• 18 layers (left) vs 34 layers (right)

• Deeper counterpart: richer solution space

⇒ should not have higher training error

• Construction: copy from a shallower model

Extra layers: set as identity

(33)

Identity Mapping with Plain Blocks

• BUT: optimization challenge ⇒ solvers cannot find this solution when going deeper...

• H(x) desired mapping to fit with a 2 weight layers

• Multiple non-linear layers ⇒ identity mapping difficult

(34)

Identity Mapping with Residual Blocks [He et al., 2016]

• H(x) desired mapping, 2 layers fitF(x) ⇒H(x) =F(x) +x

• If identity optimal, easy to set weights as 0

• If optimal mapping closer to identity, easier to find small fluctuations

Residual Block

(35)

ImageNet’15: Residual Networks (ResNet) [He et al., 2016]

• Ex for VGG-style: 3x3 conv

ResNet-34

Architecture

• Stride 2 at some layers: spatial size /2, # filters x2

⇒same complexity per layer

• When size variations with residual connections: 1x1 conv (dotted)

• Trained from scratch, standard hyper-parameters & augmentation

(36)

ImageNet’15: Residual Networks (ResNet) [He et al., 2016]

ResNet-34

(37)

ResNet: Results

CIFAR-10 results

• Deep ResNets trained without difficulty:

Deeper Models ⇒training & testing error lower

(38)

ResNet: Results

ImageNet

• Deep ResNets trained without difficulty:

Deeper Models ⇒training & testing error lower

(39)

ResNet: Conclusion

• ResNet: training deeper models

• Performance below 4%error at

ImageNet Large Scale Visual Recognition Competition’15!

(40)

3 Transfer Learning

(41)

Beyond Stochastic Gradient Descent (SGD)

• Gradient descent optimization:

w^t⁺¹=w^t−η∂f

∂w(w^t) =w^t−η∇f (w^t)

• 1^st issue:objectivef changes quickly in one direction and slowly in another

• 2^nd issue:Stochastic Gradient Descent (SGD)

(42)

Beyond Stochastic Gradient Descent (SGD)

• Poor conditioning on Hessian matrix, i.e.large condition number (largest/smallest singular value)

• Gradient descent: very slow progress along shallow

[email protected] TRIED - US330X / Deep Learning 38/ 73

(43)

Beyond Stochastic Gradient Descent (SGD)

• Fonctionf(w) = ∑^N

i=1

f_i(w),e.g. f_i(w) =`_CE(yˆ_i,y^∗_i)

• SGD: approximation of the true gradient,e.g. for mini-batch:

∇f (w^t) ≈ ¹

B B

∑

i=1

∂f_i(w)

∂w (w⁽^t⁾)

(44)

Momentum

• In all SGD variants/improvement: w^t⁺¹=w^t−∆^t⁺¹

∆^t⁺¹: update vectorw^t→w^t⁺¹ Ex: Gradient descent: ∆^t⁺¹=η∇f(w^t)

• Momentum: use previous gradient memory,e.g.running average:

∆^t⁺¹ =η∇f(w^t) +γ∆^t γ∈ [0;1[ (0.5,0.9)

(45)

Momentum

• ∆^t: v^t ∼ velocity, inertia or memory: w^t⁺¹=w^t−v^t⁺¹ Dimensions with oscillating gradient directions⇒ v^t damped

Dimensions with small but consistent gradient direction⇒ v^t increased

• More robust to local minima/saddle points, poor conditioning and noisy gradients

(46)

Nesterov Accelerate Gradient (NAG)

• NAG [Sutskever et al., 2013]

• ∼Momentum, but compute gradient at position predicted by∆^t

∆^t+¹=η∇f(w^t−γ∆^t) +γ∆^t γ∼0.9 w^t⁺¹=w^t−∆^t⁺¹

Momentum NAG

(47)

Nesterov Accelerate Gradient (NAG) [Sutskever et al., 2013]

With x^t=w^t−γ∆^t, more convenient update rule:

• ∆^t⁺¹ =η∇f (x^t) +γ∆^t

• x^t⁺¹=x^t+γ∆^t− (γ+1)∆^t⁺¹

• ⊕: anticipatory update ⇒too large updates and overshooting

• ⊕: Increased responsiveness to the landscape of loss functionf Momentum

NAG

(48)

Optimization Schemes: Conclusion

• First-order methods,e.g. Momentum, NAG: better convergence

• Learning Rate Adaptation

Adapting updates to each individual parameter wi

⇒ larger/smaller updates depending on the landscape cost function

Context:sparse data, scaling parameter variation in deep networks (cf batch norm)

Family of algorithms with adaptive learning rates:

AdaGrad AdaDelta RMSProp Adam

(49)

Adaptive Gradient (Adagrad) [Duchi et al., 2011]

• Adaptative update rule for each parameter of the form:

w_i^t⁺¹=w_i^t−∆^t_i⁺¹

• Definition: g_i^t= _∂w^∂f

i(w_i^t),i.e. gradient dimension

• G^t_i =

t

∑

i=1

(g^t_i)² ⇒ memory of squared gradients over time

√

G_i^t: `2 norm of the gradient fort = {1;t}

G_i^t↔E[g_i²] ∶2^nd moment, uncentered variance

• Adagrad (ε∼10⁻⁸,η∼10⁻²):

∆^t_i⁺¹= η

√

G_i^t+εg_i^t=η^′g_i^t

Intuition: largeG_i^t ⇒ ↓η^′, smallG_i^t ⇒ ↑η^′

(50)

Adaptive Gradient (Adagrad) [Duchi et al., 2011]

• Vectorial form: w^t⁺¹=w^t−∆^t⁺¹

∆^t⁺¹ = η

√ G^t+ε

⊙g^t

• SHORTCOMING: very aggressive learning rate decay

• (g_i^t)²≥0⇒G_i^t monotonically increasing

⇒η^′= η

√

G_i^t+ε→0

• w updates stop

(51)

Root Mean Square Propagation

(RMSProp) [Tieleman and Hinton, 2012]

• RMSProp Idea: compute average of (g_i^t)² only using a recent time window:

G˜_i^t=ρG_i^t⁻¹+ (1−ρ)(g_i^t)² ρ∼0.9

∆^t_i⁺¹ = η

√ G˜_i^t+ε

g_i^t =η^′g_i^t

• RMSProp: Same update rule as Adagrad (G˜_i^t↔G_i^t)

⇒ G˜_i^t NOT monotonically increasing

⇒ Lessen aggressive learning rate decay

• Final parameter update:

(52)

AdaDelta (AdaDelta) [Zeiler, 2012]

• Essentially: AdaDelta = RMSProp + momentum

• Compute g_i^t andG˜_i^t∼ RMSProp: local average of previous(g_i^t)² G˜_i^t=ρG_i^t⁻¹+ (1−ρ)(g_i^t)² ρ∼0.9

• New AdaDelta update vector: u^t_i⁺¹=

√U^t_i+ε

√G~^t_i+εg^t_i =η^′g_i^t

• New term U_i^t in numerator compared to RMSProp:

∼Momentum (acceleration term) accumulating prior updates U_i^t⁺¹=ρU_i^t+ (1−ρ)(u_i^t)²

• Final parameter update:

w_i^t⁺¹=w_i^t−u^t_i⁺¹

(53)

AdaDelta (AdaDelta) [Zeiler, 2012]

u_i^t+1=

√ U_i^t+ε

√ G˜_i^t+ε

g_i^t=

RMS(updates) RMS(gradients) g_i^t

• Approximate second order (Hessian H diagonal):

w^t⁺¹=w^t−H⁻¹g ⇒w_i^t⁺¹∝w_i^t− ^f

′

f^′′

Approximate _f¹′′ by RMS(updates) RMS(gradients)

Homogeneous dimension update: u_i^t+1∼w! ≠SGD, momentum, RMSProp!

• Even no learning rate!

⊕no parameter

⊖no way to design variable update speeds (e.g. fine-tuning)

(54)

Adaptive Moment Estimation (Adam) [Kingma and Ba, 2015]

• Adam: 1^st and 2^st gradient moment estimation

• Strong similarities to Adadelta: 2^nd gradient moment + momentum

2^nd gradient moment∼ Adadelta/RMSProp with local accumulation:

v_i^t=β₂v_i^t−1+ (1−β₂)(g_i^t)² 1^st gradient moment (mean)vs squared activation Adadelta

m_i^t=β1m_i^t−1+ (1−β1)(g_i^t)

Overcome Adadelta limitations on biased moment estimate (init)

ˆ m_i^t= ^m

t i

1−β₁ vˆ_i^t= ^v

t i

1−β₂

Final update: w_i^t+1=w_i^t−η ^m^ˆ

t

√ i

ˆ v_i^t+ε

(55)

Learning Rate Adaptation: Conclusion

• Per-dimension update: important in case of sparse data

• Adagrad, RMSProp, Adadelta and Adam: all use 2^nd gradient moment (var) on denominator

• Adadelta, Adam: use ∼momentum on numerator

• Adam good default choice in many cases

(56)

3 Transfer Learning

(57)

Training Deep ConvNets on Small Datasets

• Ex: PASCAL VOC’07: 20 categories, 5000 training ex

• Training VGG from scratchvs Hancrafted FV [Perronnin et al., 2010]

• Deep « Handcrafted Model Test mAP (%)

(58)

Training Deep ConvNets on Small Datasets

• ImageNet: deep » hancrafted

• VOC’07:deep « hancrafted Not enough training ex Complex images

(59)

Transfer Learning

• Idea: export knowledge from source domain to target domain Source: good performances, e.g.many samples

Target: more challenging, e.g.few samples

⇒ Deep ConvNet good in imagenet, but not as good in VOC’07

• Assumption: source and target classes different but related Ex: Various breeds of cat (tabby, persian)vs cat

ImageNet VOC’07

Tabby Cat Persian Cat Cat

(60)

Transferring Representations learned from ImageNet

• Most naive transfer learning approach:

Load ConvNet model pre-trained on ImageNet,e.g. VGG

(61)

Transferring Representations learned from ImageNet

• Most naive transfer learning approach:

Load ConvNet model pre-trained on ImageNet

Apply ConvNet on each target dataset image, e.g.VOC Extract a given layer activation: "Deep Features" (DF)

(62)

Deep Features (DF) for Classification

• Deep Feature (DF), e.g. fc7: use it as any visual descriptor

• Fc7: relevant for discriminating target classes, e.g. tabby/persian cat for cat

(63)

Deep Features (DF) for Classification

• Deep Feature (DF), e.g. fc7: visual descriptor

DF: highly non-linear feature

Can use flat ML model for target class prediction

(64)

Deep Features (DF) for Classification

Which layer to use for classification ?

• Layers close to classification:

specific to ImageNet

• Layers less close to

classification: more generic features

⇒ Depends on similarity between target & ImageNet

(65)

Deep Features (DF) for Visual Recognition: Conclusion

• Deep Features: ConvNet success beyond ImageNet

⇒No need huge dataset for using / training deep models

• Off-the-shelf features for any visual recognition task

• ImageNet: 1000 classes, large set of visual concepts

Transfer very well even to tasks with large domain shift,e.g.medical images

• Off-the-shelf features for any visual recognition task

(66)

Deep Features (DF) and Transfer Learning for Classification

• Classification: adapting final layer to match target classes

1 Remove last layer⇔ImageNet classes

(67)

Deep Features (DF) and Transfer Learning for Classification

• Classification: adapting final layer to match target classes

2 Add transfer layer withK target classes,e.g. K =20 VOC’07

(68)

Transfer Learning for multi-class Classification

• Multi-class Classification: only one class per image, i.e. exclusive labels (e.g.ImageNet, MNIST)

• Ex for transfer in MIT-67 (museum)

(69)

Transfer Learning for multi-class Classification

• Transfer layer of K classes + soft-max activation

• s_i=x_iW+b,yˆ_k∼P(k/x_i,W,b) = _K^e^sk

∑ e^s^k^′

(70)

Multi-label Classification

• Several labels for a given image,e.g.dog + bicycle + car (PASCAL VOC)

• Classification: trainK binary models predicting class presence / absence

(71)

Transfer Learning for multi-label Classification

• Transfer layer of K classes + sigmoid activation

• s_i =x_iW+b,yˆ_k = [1+e⁻^λs^k]⁻¹

(72)

Deep Features (DF) and Transfer Learning for Classification

• Training strategy?, depending on:

a) Volume of the target dataset b) Semantic proximity wrt ImageNet

(73)

Training Strategy: Pure Transfer

(74)

Training Strategy: Fine-Tuning

• Decreased learning rate for fined-tuned wrt from scratch parameters e.g.factor 10 ↓

(75)

Training Strategy: Results

• Small datasets [10³,10⁴], semantically close to ImageNet

• Ex: VOC’07 (20 classes, 5000 ex)

Model mAP (%)

VGG from scratch ≈40 Handcrafted FV ≈70 VGG pure tranfer ≈83 VGG fine-tuning ≈85

• Fine Tuning >Transfer >>Handcrafted (BoW)

• From scratch low: not enough training data

(76)

Training Strategy: Results

• Medium/large datasets ≥10⁵, semantically close to ImageNet

• Ex: UPMC-food-101 (100 classes, 100000 ex)

Model mAP (%)

Handcrafted ≈25 VGG pure tranfer ≈40 VGG from scratch ≈53 VGG fine-tuning ≈65

• From scratch does work (well !)

• Fine Tuning >>From scratch>>Transfer>> Handcrafted

(77)

Training Strategy: Results

• Medium/large datasets

≥10⁵, semantically far from ImageNet

• Ex: M2CAI’16 challenge:

22 videos, ∼10⁵ images, 8 classes

Model Accuracy (%) VGG pure tranfer ≈60 VGG from scratch ≈70 VGG fine-tuning ≈80

• Fine Tuning >> From scratch >> Transfer

• Transfer already good baseline despite big visual content shift

(78)

Transfer Learning: Conclusion

Small visual shift Large visual shift Small dataset Transfer top layers Transfer lower layers Larger dataset Fine-tune top layers Fine-tune all layers

• Small dataset, large visual shift: common in medical image classification

• How to implement transfer for localized tasks?

⇒ last course!

(79)

References I

[Azizpour et al., 2016] Azizpour, H., Razavian, A. S., Sullivan, J., Maki, A., and Carlsson, S. (2016).

Factors of transferability for a generic convnet representation.

IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1790–1802.

[Duchi et al., 2011] Duchi, J., Hazan, E., and Singer, Y. (2011).

Adaptive subgradient methods for online learning and stochastic optimization.

J. Mach. Learn. Res., 12:2121–2159.

[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016).

Deep residual learning for image recognition.

In CVPR.

[Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015).

Batch normalization: Accelerating deep network training by reducing internal covariate shift.

In Bach, F. R. and Blei, D. M., editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org.

[Kanazawa et al., 2014] Kanazawa, A., Sharma, A., and Jacobs, D. W. (2014).

Locally scale-invariant convolutional neural networks.

In Deep Learning and Representation Learning Workshop: NIPS 2014.

[Kingma and Ba, 2015] Kingma, D. P. and Ba, J. (2015).

Adam: A method for stochastic optimization.

In ICLR, volume abs/1412.6980.

(80)

[Marcos et al., 2017] Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. (2017).

Rotation equivariant vector field networks.

In The IEEE International Conference on Computer Vision (ICCV).

[Perronnin et al., 2010] Perronnin, F., Sánchez, J., and Mensink, T. (2010).

Improving the fisher kernel for large-scale image classification.

In Computer Vision–ECCV 2010, pages 143–156. Springer.

[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014).

Very deep convolutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

[Sutskever et al., 2013] Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013).

On the importance of initialization and momentum in deep learning.

In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1139–III–1147. JMLR.org.

[Szegedy et al., 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015).

Going deeper with convolutions.

In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9. IEEE.

[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).

Deepface: Closing the gap to human-level performance in face verification.

In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[Tieleman and Hinton, 2012] Tieleman, T. and Hinton, G. (2012).

RMSprop Gradient Optimization.

(81)

References III

[Zeiler and Fergus, 2014] Zeiler, M. and Fergus, R. (2014).

Visualizing and understanding convolutional networks, volume 8689 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 818–833.

Springer Verlag, part 1 edition.

[Zeiler, 2012] Zeiler, M. D. (2012).

ADADELTA: an adaptive learning rate method.

CoRR, abs/1212.5701.