• Aucun résultat trouvé

Master TRIED Reconnaissance des formes et méthodes neuronales (US330X) - Neural Networks and Deep Learning

N/A
N/A
Protected

Academic year: 2022

Partager "Master TRIED Reconnaissance des formes et méthodes neuronales (US330X) - Neural Networks and Deep Learning"

Copied!
81
0
0

Texte intégral

(1)

Master TRIED

Reconnaissance des formes et méthodes neuronales (US330X) - Neural Networks and Deep Learning

Nicolas Thome

Conservatoire Nationnal des Arts et Métiers (Cnam) Laboratoire CEDRIC - équipe MSDMA

(2)

1 Modern Deep Architectures

Modern Architectural Components Modern Macro Architectures

2 Modern Optimization

3 Transfer Learning

(3)

Local Response/Contrast Normalization

Normalize value wrt spatial neighbors N (i,j)

Credit: A. Vedaldi

Local equalizing effective

(4)

Normalization: Local Feature Normalization

Normalize value wrt neighbors in different feature maps

Operates at each spatial position independently

Feature groups: subset of maps (sliding window)

∼ Lateral inhibition

(5)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

Recap init: fixed input distribution known to help training

Training deep neural networks: distribution of hidden layers unknown, change over training time ⇒covariate shift

Importance of init, e.g. Xavier

Batch Normalization (BN):

↓ importance of init, ↓ covariate shift

(6)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

Normalize input feature distribution

∼ N (0,1)

Normalization across each mini-batch:

µB =N1

N i=1xi σB2 = N1

N

i=1(xiµB)2 ˆ

xi= xi−µB

σB+ -for numerical stability

Is input feature distribution ∼ N (0,1) good idea?

Activation may not ever "saturate", e.g.sigmoid or tanh

Keeping in linear regime: depth useless,

global linear model

(7)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

Scale and shift: yi =γxˆi+β,(γ, β)trained

Apply after FC / conv and before non-linearity

(8)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

Applying BN at test time? ⇒Use train set statistics

BN Strengths

Faster training convergence vs covariate shift Regularization & generalization

better performances

BN Conclusion: especially important for very deep models,e.g. ResNet (see next)

(9)

Padding Modules

Filter of size H: problem with the borders ⌊H

2

Solution 1: reduce processing to computable area ⇒decreased output size

Solution 2: padding, i.e.fill missing info with (arbitrary) values

Zero-padding, recopy, mirror, etc

(10)

Zero-Padding

To avoid shrinking the spatial extent of the network rapidly

(11)

Pooling Modules

Overlapping Pooling- Ex: pooling size: 5, stride s=2

(12)

Pooling Modules

Pooling across feature maps

Aggregation for a given spatial position, between different tensor maps

Tensor maps (filter output) associated to a given transformation⇒ invariance with max pooling

(13)

Pooling across feature maps

Ex: scaling [Kanazawa et al., 2014]

(14)

Pooling across feature maps

Ex: rotation [Marcos et al., 2017]

(15)

Locally Connected vs Convolution Layers

Locally connected: different features detected across image positions

Sucessful in specific context,

e.g. DeepFace [Taigman et al., 2014]

(16)

1 Modern Deep Architectures

Modern Architectural Components Modern Macro Architectures

2 Modern Optimization

3 Transfer Learning

(17)

Deep Learning since 2012

More & more data (Facebook 109 images / day) ILSVRC since 2012: larger & larger networks

(18)

Deep Learning since 2012: ImageNet’13

[Zeiler and Fergus, 2014] (ZF): archi∼AlexNet’12 (conv+FC)

With Local Contrast Normalization (LCN) after conv/pool

(19)

Deep Learning since 2012: ImageNet’14

(20)

ImageNet’14: VGG [Simonyan and Zisserman, 2014]

Still Conv + FC, BUT:

No pooling between some conv layer

Convolution stride 1

(21)

ImageNet’14: VGG [Simonyan and Zisserman, 2014]

3x3 convolulutions:

two 3x3 convone 5x5 conv

(22)

ImageNet’14: GoogLeNet [Szegedy et al., 2015]

GoogLeNet global archi: three main components

1 ’Stem’ [Conv-Pool] Layer

2 Inception modules: Networks in Networks

3 Auxiliary classifiers

(23)

GoogLeNet: ’Stem’ ∼ [Conv-Pool] Layers

(24)

GoogLeNet: Inception Module

Inspired from Network in Network (NiN) idea [Lin et al., 2013]

Each conv layer: linear + non-linearity

NiN block: hierarchy of conv layer

1×1 conv MLP universal approximator, more expressive power for each patch

(25)

GoogLeNet: 1 × 1 Convolution

(26)

GoogLeNet: Inception Module

1×1, 3×3, 5×5 ⇒multi-scale

Optimal filter size, pooling or not: learned

1×1: dimensionality reduction

⇒reasonable # parameters

(27)

ImageNet’14: GoogLeNet

Two auxiliary classifiers

(28)

ImageNet’14: GoogLeNet

Output classifier: GLobal Average Pooling (GAP) no FC!

Drastic# params!

(29)

Deep Learning since 2012

ILSVRC since 2012: larger & larger networks

(30)

Issue: Training Deeper Networks

BUT: deeper nets have worse performances Ex: plain nets: stacking 3×3 conv

Ex: deeper VGGVGG56 < VGG20

(31)

Issue: Training Deeper Networks

B BNot a generalization issue, training error also higher!

General phenomenon, observed in many datasets

(32)

Issue: Training Deeper Networks

18 layers (left) vs 34 layers (right)

Deeper counterpart: richer solution space

⇒ should not have higher training error

Construction: copy from a shallower model

Extra layers: set as identity

(33)

Identity Mapping with Plain Blocks

BUT: optimization challenge ⇒ solvers cannot find this solution when going deeper...

H(x) desired mapping to fit with a 2 weight layers

Multiple non-linear layers ⇒ identity mapping difficult

(34)

Identity Mapping with Residual Blocks [He et al., 2016]

H(x) desired mapping, 2 layers fitF(x) ⇒H(x) =F(x) +x

If identity optimal, easy to set weights as 0

If optimal mapping closer to identity, easier to find small fluctuations

Residual Block

(35)

ImageNet’15: Residual Networks (ResNet) [He et al., 2016]

Ex for VGG-style: 3x3 conv

ResNet-34

Architecture

Stride 2 at some layers: spatial size /2, # filters x2

same complexity per layer

When size variations with residual connections: 1x1 conv (dotted)

Trained from scratch, standard hyper-parameters & augmentation

(36)

ImageNet’15: Residual Networks (ResNet) [He et al., 2016]

ResNet-34

(37)

ResNet: Results

CIFAR-10 results

Deep ResNets trained without difficulty:

Deeper Models ⇒training & testing error lower

(38)

ResNet: Results

ImageNet

Deep ResNets trained without difficulty:

Deeper Models ⇒training & testing error lower

(39)

ResNet: Conclusion

ResNet: training deeper models

Performance below 4%error at

ImageNet Large Scale Visual Recognition Competition’15!

(40)

1 Modern Deep Architectures

2 Modern Optimization

3 Transfer Learning

(41)

Beyond Stochastic Gradient Descent (SGD)

Gradient descent optimization:

wt+1=wt−η∂f

∂w(wt) =wt−η∇f (wt)

1st issue:objectivef changes quickly in one direction and slowly in another

2nd issue:Stochastic Gradient Descent (SGD)

(42)

Beyond Stochastic Gradient Descent (SGD)

Poor conditioning on Hessian matrix, i.e.large condition number (largest/smallest singular value)

Gradient descent: very slow progress along shallow

[email protected] TRIED - US330X / Deep Learning 38/ 73

(43)

Beyond Stochastic Gradient Descent (SGD)

Fonctionf(w) = ∑N

i=1

fi(w),e.g. fi(w) =`CE(yˆi,yi)

SGD: approximation of the true gradient,e.g. for mini-batch:

∇f (wt) ≈ 1

B B

i=1

∂fi(w)

∂w (w(t))

(44)

Momentum

In all SGD variants/improvement: wt+1=wt−∆t+1

t+1: update vectorwtwt+1 Ex: Gradient descent: t+1=η∇f(wt)

Momentum: use previous gradient memory,e.g.running average:

t+1 =η∇f(wt) +γ∆t γ∈ [0;1[ (0.5,0.9)

(45)

Momentum

t: vt ∼ velocity, inertia or memory: wt+1=wt−vt+1 Dimensions with oscillating gradient directions vt damped

Dimensions with small but consistent gradient direction vt increased

More robust to local minima/saddle points, poor conditioning and noisy gradients

(46)

Nesterov Accelerate Gradient (NAG)

NAG [Sutskever et al., 2013]

∼Momentum, but compute gradient at position predicted by∆t

t+1=η∇f(wtγ∆t) +γ∆t γ0.9 wt+1=wtt+1

Momentum NAG

(47)

Nesterov Accelerate Gradient (NAG) [Sutskever et al., 2013]

With xt=wt−γ∆t, more convenient update rule:

t+1 =η∇f (xt) +γ∆t

xt+1=xt+γ∆t− (γ+1)∆t+1

⊕: anticipatory update ⇒too large updates and overshooting

⊕: Increased responsiveness to the landscape of loss functionf Momentum

NAG

(48)

Optimization Schemes: Conclusion

First-order methods,e.g. Momentum, NAG: better convergence

Learning Rate Adaptation

Adapting updates to each individual parameter wi

larger/smaller updates depending on the landscape cost function

Context:sparse data, scaling parameter variation in deep networks (cf batch norm)

Family of algorithms with adaptive learning rates:

AdaGrad AdaDelta RMSProp Adam

(49)

Adaptive Gradient (Adagrad) [Duchi et al., 2011]

Adaptative update rule for each parameter of the form:

wit+1=wit−∆ti+1

Definition: git= ∂w∂f

i(wit),i.e. gradient dimension

Gti =

t

i=1

(gti)2 ⇒ memory of squared gradients over time

Git: `2 norm of the gradient fort = {1;t}

GitE[gi2] ∶2nd moment, uncentered variance

Adagrad (ε∼108,η∼102):

ti+1= η

Git+εgitgit

Intuition: largeGit ⇒ ↓η, smallGit ⇒ ↑η

(50)

Adaptive Gradient (Adagrad) [Duchi et al., 2011]

Vectorial form: wt+1=wt−∆t+1

t+1 = η

√ Gt

⊙gt

SHORTCOMING: very aggressive learning rate decay

(git)2≥0⇒Git monotonically increasing

⇒η= η

Git+ε→0

w updates stop

(51)

Root Mean Square Propagation

(RMSProp) [Tieleman and Hinton, 2012]

RMSProp Idea: compute average of (git)2 only using a recent time window:

it=ρGit1+ (1−ρ)(git)2 ρ∼0.9

ti+1 = η

√ G˜it

gitgit

RMSProp: Same update rule as Adagrad (G˜it↔Git)

⇒ G˜it NOT monotonically increasing

⇒ Lessen aggressive learning rate decay

Final parameter update:

(52)

AdaDelta (AdaDelta) [Zeiler, 2012]

Essentially: AdaDelta = RMSProp + momentum

Compute git andG˜it∼ RMSProp: local average of previous(git)2it=ρGit1+ (1−ρ)(git)2 ρ∼0.9

New AdaDelta update vector: uti+1=

Uti+ε

G~ti+εgtigit

New term Uit in numerator compared to RMSProp:

Momentum (acceleration term) accumulating prior updates Uit+1=ρUit+ (1−ρ)(uit)2

Final parameter update:

wit+1=wit−uti+1

(53)

AdaDelta (AdaDelta) [Zeiler, 2012]

uit+1=

Uit+ε

G˜it+ε

git=

RMS(updates) RMS(gradients) git

Approximate second order (Hessian H diagonal):

wt+1=wt−H1g ⇒wit+1∝witf

f′′

Approximate f1′′ by RMS(updates) RMS(gradients)

Homogeneous dimension update: uit+1w! SGD, momentum, RMSProp!

Even no learning rate!

no parameter

no way to design variable update speeds (e.g. fine-tuning)

(54)

Adaptive Moment Estimation (Adam) [Kingma and Ba, 2015]

Adam: 1st and 2st gradient moment estimation

Strong similarities to Adadelta: 2nd gradient moment + momentum

2nd gradient moment Adadelta/RMSProp with local accumulation:

vit=β2vit−1+ (1β2)(git)2 1st gradient moment (mean)vs squared activation Adadelta

mit=β1mit−1+ (1β1)(git)

Overcome Adadelta limitations on biased moment estimate (init)

ˆ mit= m

t i

1β1 vˆit= v

t i

1β2

Final update: wit+1=witη mˆ

t

i

ˆ vit

(55)

Learning Rate Adaptation: Conclusion

Per-dimension update: important in case of sparse data

Adagrad, RMSProp, Adadelta and Adam: all use 2nd gradient moment (var) on denominator

Adadelta, Adam: use ∼momentum on numerator

Adam good default choice in many cases

(56)

1 Modern Deep Architectures

2 Modern Optimization

3 Transfer Learning

(57)

Training Deep ConvNets on Small Datasets

Ex: PASCAL VOC’07: 20 categories, 5000 training ex

Training VGG from scratchvs Hancrafted FV [Perronnin et al., 2010]

Deep « Handcrafted Model Test mAP (%)

(58)

Training Deep ConvNets on Small Datasets

ImageNet: deep » hancrafted

VOC’07:deep « hancrafted Not enough training ex Complex images

(59)

Transfer Learning

Idea: export knowledge from source domain to target domain Source: good performances, e.g.many samples

Target: more challenging, e.g.few samples

Deep ConvNet good in imagenet, but not as good in VOC’07

Assumption: source and target classes different but related Ex: Various breeds of cat (tabby, persian)vs cat

ImageNet VOC’07

Tabby Cat Persian Cat Cat

(60)

Transferring Representations learned from ImageNet

Most naive transfer learning approach:

Load ConvNet model pre-trained on ImageNet,e.g. VGG

(61)

Transferring Representations learned from ImageNet

Most naive transfer learning approach:

Load ConvNet model pre-trained on ImageNet

Apply ConvNet on each target dataset image, e.g.VOC Extract a given layer activation: "Deep Features" (DF)

(62)

Deep Features (DF) for Classification

Deep Feature (DF), e.g. fc7: use it as any visual descriptor

Fc7: relevant for discriminating target classes, e.g. tabby/persian cat for cat

(63)

Deep Features (DF) for Classification

Deep Feature (DF), e.g. fc7: visual descriptor

DF: highly non-linear feature

Can use flat ML model for target class prediction

(64)

Deep Features (DF) for Classification

Which layer to use for classification ?

Layers close to classification:

specific to ImageNet

Layers less close to

classification: more generic features

⇒ Depends on similarity between target & ImageNet

(65)

Deep Features (DF) for Visual Recognition: Conclusion

Deep Features: ConvNet success beyond ImageNet

⇒No need huge dataset for using / training deep models

Off-the-shelf features for any visual recognition task

ImageNet: 1000 classes, large set of visual concepts

Transfer very well even to tasks with large domain shift,e.g.medical images

Off-the-shelf features for any visual recognition task

(66)

Deep Features (DF) and Transfer Learning for Classification

Classification: adapting final layer to match target classes

1 Remove last layerImageNet classes

(67)

Deep Features (DF) and Transfer Learning for Classification

Classification: adapting final layer to match target classes

2 Add transfer layer withK target classes,e.g. K =20 VOC’07

(68)

Transfer Learning for multi-class Classification

Multi-class Classification: only one class per image, i.e. exclusive labels (e.g.ImageNet, MNIST)

Ex for transfer in MIT-67 (museum)

(69)

Transfer Learning for multi-class Classification

Transfer layer of K classes + soft-max activation

si=xiW+b,yˆk∼P(k/xi,W,b) = Kesk

esk

(70)

Multi-label Classification

Several labels for a given image,e.g.dog + bicycle + car (PASCAL VOC)

Classification: trainK binary models predicting class presence / absence

(71)

Transfer Learning for multi-label Classification

Transfer layer of K classes + sigmoid activation

si =xiW+b,yˆk = [1+eλsk]1

(72)

Deep Features (DF) and Transfer Learning for Classification

Training strategy?, depending on:

a) Volume of the target dataset b) Semantic proximity wrt ImageNet

(73)

Training Strategy: Pure Transfer

(74)

Training Strategy: Fine-Tuning

Decreased learning rate for fined-tuned wrt from scratch parameters e.g.factor 10 ↓

(75)

Training Strategy: Results

Small datasets [103,104], semantically close to ImageNet

Ex: VOC’07 (20 classes, 5000 ex)

Model mAP (%)

VGG from scratch 40 Handcrafted FV 70 VGG pure tranfer 83 VGG fine-tuning 85

Fine Tuning >Transfer >>Handcrafted (BoW)

From scratch low: not enough training data

(76)

Training Strategy: Results

Medium/large datasets ≥105, semantically close to ImageNet

Ex: UPMC-food-101 (100 classes, 100000 ex)

Model mAP (%)

Handcrafted ≈25 VGG pure tranfer ≈40 VGG from scratch ≈53 VGG fine-tuning ≈65

From scratch does work (well !)

Fine Tuning >>From scratch>>Transfer>> Handcrafted

(77)

Training Strategy: Results

Medium/large datasets

≥105, semantically far from ImageNet

Ex: M2CAI’16 challenge:

22 videos, ∼105 images, 8 classes

Model Accuracy (%) VGG pure tranfer ≈60 VGG from scratch ≈70 VGG fine-tuning ≈80

Fine Tuning >> From scratch >> Transfer

Transfer already good baseline despite big visual content shift

(78)

Transfer Learning: Conclusion

Small visual shift Large visual shift Small dataset Transfer top layers Transfer lower layers Larger dataset Fine-tune top layers Fine-tune all layers

Small dataset, large visual shift: common in medical image classification

How to implement transfer for localized tasks?

⇒ last course!

(79)

References I

[Azizpour et al., 2016] Azizpour, H., Razavian, A. S., Sullivan, J., Maki, A., and Carlsson, S. (2016).

Factors of transferability for a generic convnet representation.

IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1790–1802.

[Duchi et al., 2011] Duchi, J., Hazan, E., and Singer, Y. (2011).

Adaptive subgradient methods for online learning and stochastic optimization.

J. Mach. Learn. Res., 12:2121–2159.

[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016).

Deep residual learning for image recognition.

In CVPR.

[Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015).

Batch normalization: Accelerating deep network training by reducing internal covariate shift.

In Bach, F. R. and Blei, D. M., editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org.

[Kanazawa et al., 2014] Kanazawa, A., Sharma, A., and Jacobs, D. W. (2014).

Locally scale-invariant convolutional neural networks.

In Deep Learning and Representation Learning Workshop: NIPS 2014.

[Kingma and Ba, 2015] Kingma, D. P. and Ba, J. (2015).

Adam: A method for stochastic optimization.

In ICLR, volume abs/1412.6980.

(80)

[Marcos et al., 2017] Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. (2017).

Rotation equivariant vector field networks.

In The IEEE International Conference on Computer Vision (ICCV).

[Perronnin et al., 2010] Perronnin, F., Sánchez, J., and Mensink, T. (2010).

Improving the fisher kernel for large-scale image classification.

In Computer Vision–ECCV 2010, pages 143–156. Springer.

[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014).

Very deep convolutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

[Sutskever et al., 2013] Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013).

On the importance of initialization and momentum in deep learning.

In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1139–III–1147. JMLR.org.

[Szegedy et al., 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015).

Going deeper with convolutions.

In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9. IEEE.

[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).

Deepface: Closing the gap to human-level performance in face verification.

In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[Tieleman and Hinton, 2012] Tieleman, T. and Hinton, G. (2012).

RMSprop Gradient Optimization.

(81)

References III

[Zeiler and Fergus, 2014] Zeiler, M. and Fergus, R. (2014).

Visualizing and understanding convolutional networks, volume 8689 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 818–833.

Springer Verlag, part 1 edition.

[Zeiler, 2012] Zeiler, M. D. (2012).

ADADELTA: an adaptive learning rate method.

CoRR, abs/1212.5701.

Références

Documents relatifs

We want to improve the convergence result of [16] and show the global in time convergence of the weak solution of the system of rotating fluids towards the solution of a

Inspired by Random Neural Networks, we introduce Random Neural Layer (RNL), which is compatible with other kinds of neural layers in Deep Neural Networks (DNN). The

Keywords: image augmentation, deep learning, distance transform, patch-wise segmentation, stratified

- Déterminer la motivation des parents et les éléments déclencheurs qui favoriseraient une participation parentale dans un programme de prise en charge du

Therefore, in the plateau regime, as long as the bulk of the solution is dilute, the surface layer where the concentration profile is present extends to distances of the order of

The data sequences composing this dataset are recorded in three different environments: a harbor at a depth of a few meters, a first archaeological site at a depth of 270 meters and

The robot captures a single image at the initial pose, the network is trained again and then our CNN-based direct visual servoing is performed.. While the robot is servoing the

• a novel training process is introduced, based on a single image (acquired at a reference pose), which includes the fast creation of a dataset using a simulator allowing for