• Aucun résultat trouvé

Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209) Neural Networks and Deep Learning History & Modern Deep Learning

N/A
N/A
Protected

Academic year: 2022

Partager "Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209) Neural Networks and Deep Learning History & Modern Deep Learning"

Copied!
78
0
0

Texte intégral

(1)

Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209)

Neural Networks and Deep Learning History & Modern Deep Learning

Nicolas Thome [email protected]

http://cedric.cnam.fr/vertigo/Cours/ml2/

Département Informatique

Conservatoire Nationnal des Arts et Métiers (Cnam)

(2)

Outline

1 Deep Learning History Deep Learning Strengths Deep Learning Weaknesses Deep Learning Revival

2 Modern Deep Learning

(3)

History Modern Deep Learning

Deep Learning: Expressiveness

MLP: Universal Function Approximators

● Neural network with one single hidden layer⇒universal approximator Can represent any function on compact subsets ofRn[Cybenko, 1989]

Ex pour classification: any decision boundaries can be expressed

⇒very rich modeling capacities

[email protected] RCP209 / Deep Learning 1/ 64

(4)

Deep Learning: Expressiveness

● 2 layers,i.e.one hidden layer, is enough... theoretically:

BUT:exponential number of hidden units [Barron, 1993]

● Challenge is NOT fitting training data

Simple models already have very large (infinite) modeling power

● Challenge: optimization, overfitting

(5)

History Modern Deep Learning

Deep Learning: Expressiveness & Compactness

● Deeper Models: less units required

Functions representable compactly withk layers may require exponentially size withk1 layers [Hastad, 1989, Bengio, 2009]

Digit reco., from [Goodfellow et al., 2016]

Same modeling power, fewer parameters

⇒better generalization!

[email protected] RCP209 / Deep Learning 3/ 64

(6)

Inductive Bias in Deep Learning

● Deep models: hierarchy of sequential layers

● Layers: fully connected,

convolution + non linearity

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ

convolution layer , pooling

● Convolutional architecture: prior knowledge,akaINDUCTIVE BIAS

Deep learning: feature design⇒architecture design

(7)

History Modern Deep Learning

Inductive Bias in Deep Learning

ConvNets & Prior Distribution

● Prior:imposing distribution on fully connected parameters

● Weak prior: high entropy (uncertainty), strong prior: low entropy

● Infinitely strong prior:zero probability on some parameters

● ConvNet∼Infinitely Strong Prior on Fully Connected net weights

● Convolution: local interactions, shared weights⇒zero probability elsewhere

[email protected] RCP209 / Deep Learning 5/ 64

(8)

Inductive Bias in Deep Learning

ConvNets as Inductive Bias

● ConvNet∼Infinitely Strong Prior on Fully Connected net weights

● Convolution⇒support learning translation-equivariant features

● Pooling⇒support features invariant (stable) wrt local translations Very rich modeling capacities: local interactionsglobal with depth Significantly reduce # parametersreducing over-fitting

From [Goodfellow et al., 2016]

(9)

History Modern Deep Learning

Inductive Bias in Deep Learning

ConvNet for Learning Compositions

● Conv/Pool hierarchies: feature composition Depth: gradual complexity, larger spatial extend Intuitive processing for hierarchical information modeling Biological foundations: simple cells, complex cells

[email protected] RCP209 / Deep Learning 7/ 64

(10)

Inductive Bias in Deep Learning

ConvNet for Learning Compositions

● Hierarchical Compositions Low-level: edges, color Mid-level: corner, parts

Higher levels: objects, scene concepts

● Distributed Representations: sharing Lower-level: shared by many classes Higher-levels: more class specific

(11)

History Modern Deep Learning

Deep Learning: Representation Learning

Latent representations: learned features

≠Handcrafted features for the task ≠Handcrafted kernels in kernel methods (SVM)

X-class classification,K classes

● Last hidden layer: RL→RK

● InRL, linear separation required

● Deep Learning: learning representations that gradually project data toRL spaces where linear separation possible

[email protected] RCP209 / Deep Learning 8/ 64

(12)

Deep Learning & Manifold Untangling

Manifold Untangling

Credit: DiCarlo

● DL: gradually projecting data toRL spaces where linear separation possible

● This is the definition of manifold untangling!

● ConvNets: inductive bias making manifold untangling easier!

(13)

History Modern Deep Learning

Manifold Untangling Visualization

● We want to visualize each layer activation for each class

● high-dimensional visualization?

⇒Projection to lower (e.g.2d) dimensions

[email protected] RCP209 / Deep Learning 10/ 64

(14)

t-distributed Stochastic Neighbor Embedding (t-SNE)

● t-SNE [van der Maaten and Hinton, 2008]:

non linear projection

● Intuitively: close distances in initial space

⇒close distances in projected (2d) space Distance preservation

Neighborhood preservationi.e.small distance preservation

(15)

History Modern Deep Learning

t-SNE [van der Maaten and Hinton, 2008]

● Similarity between points(xi,xj)in initial space,e.g.Rd : pij= e

∣∣xi−xj∣∣2 2

k≠l∑e

∣∣xk−xl∣∣2 2

P= {pij}(i,j)∈N×N

● Similarity between points(yi,yj)in projected space,e.g.R2: qij= (1+ ∣∣yi−yj∣∣2)−1

k≠l∑(1+ ∣∣yk−yl∣∣2)−1 Q= {qij}(i,j)∈N×N

● Loss function: Kullback-Leiber divergenceKL(P∣∣Q)

C= ∑

i

KL(P∣∣Q) = ∑

i

j

pij logpij

qij

[email protected] RCP209 / Deep Learning 12/ 64

(16)

t-SNE Visualization: MNIST example

● MNIST dataset: 28×28 grayscale images of digits

● 10 classes⇔digit number∈ {0;9}

● Input space dimension: 282=784

● Projection in 2d (3d) space for visualization

● t-SNE for computing projection: gradient descent

∂C

∂yi =4∑

i

(pij−qij)(yi−yj)(1+ ∣∣yi−yj∣∣2)−1

● Optimization (projection) for a given closed dataset

⇒transductive learning

(17)

History Modern Deep Learning

t-SNE Visualization: MNIST example

● Application of t-SNE in the test set of MNIST (10000) images

● Color⇔class ID

[email protected] RCP209 / Deep Learning 14/ 64

(18)

t-SNE Visualization: MNIST example

● Classes visually appear in 2d space,BUT overlap

● How to measure class separability?

Neighborhood Hit[Paulovich et al., 2008]:

NH = # pts in knn of the same class

# pts in knn

(19)

History Modern Deep Learning

t-SNE Visualization: MNIST example

● How to measure class separability?

Fitting ellipses to each class points Ellipses non-overlaplinear separability

[email protected] RCP209 / Deep Learning 16/ 64

(20)

80’s: LeNet 5 Model

● Total # parameters∼60000

● Evaluation on MNIST: test error of 0.95%

● Successful deployment for postal code reading in the US

(21)

History Modern Deep Learning

LeNet 5 Model: Manifold Untangling

Input space Latent space

[email protected] RCP209 / Deep Learning 18/ 64

(22)

LeNet 5 Model: Manifold Untangling

Latent space MLP Latent space LeNet

(23)

Outline

1 Deep Learning History Deep Learning Strengths Deep Learning Weaknesses Deep Learning Revival

2 Modern Deep Learning

(24)

Deep Neural Networks: Weaknesses & Drawbacks

Criticisms at two main levels

1 Modeling level: Neural Networks⇔Black Boxes

2 Training level: ad hoc, expertise, efficiency, guaranty

(25)

History Modern Deep Learning

Deep Neural Networks: Black Boxes

● Lack of explainability: why this decision?

Hidden units not directly interpretableothers,e.g.decision trees, expert systems

⇒Challenges:Human machine interaction, failure analysis

[email protected] RCP209 / Deep Learning 21/ 64

(26)

Deep Neural Networks: Black Boxes

● Lack of theory for architecture design

● How many layers, neurons?

● Layer type: fully connected, convolution, pooling?

● Trial/test: optimize architecture on validation set

⇒Ad hoc, no theory to guide you

(27)

History Modern Deep Learning

Deep Neural Networks: Training Issues

● Optimization: non convex objective No guaranty to reach global optimum Solution dependent on initialization

Importance of (random) initialization

training reproducibility Stochastic training: noisy gradient Expertise: ad hochyper-parameter tuning:

# epochs, decay, optimizers (next week)etc Costly Tuning

[email protected] RCP209 / Deep Learning 23/ 64

(28)

Deep Neural Networks: Training Issues

● Deep models need huge annotated datasets

⇒Huge models, huge computational demand

⇒Long be impossible to train such models with existing resources

● Smaller datasets: inferior predictive performances Small models: not enough expressive power Large models overfit

Performanceshandcrafted features

(29)

History Modern Deep Learning

Deep Learning: Trends and methods in the last four decades

90’s: start of winter for deep learning

● Deep neural nets =’ black magic’, black boxes Lack of interpretability

Optimization issues for highly non-convex objective function

● Golden age of kernel methods

Generalization theory with Support Vector Machines Extension to non-linear modes: kernel trick

Kernel encode prior knowledge (structure) on data Convex optimization problem

[email protected] RCP209 / Deep Learning 25/ 64

(30)

Deep Learning: Trends and methods in the last four decades

2000’s: Bag of Words Model (BoW)

● Started from the Information Retrieval (IR) community

● Text classification : document as a histogram of word occurrences

● Bow representation as input for powerful classifiers,e.g.SVM

(31)

History Modern Deep Learning

2000’s: Bag of Words Model

● Adapting the BoW model for visual recognition? ⇒Bag of Visual Word (BoV)

● Main challenge:definition of visual words unclear!

● Solution:compute a dictionary on local image regions (clustering) Local regions represented by handcrafted descriptors,e.g.SIFT

● 2000’s: BoW + SVM state-of-the-art

● Many works on kernel on BoW, coding & pooling→2012

[email protected] RCP209 / Deep Learning 27/ 64

(32)

Outline

1 Deep Learning History Deep Learning Strengths Deep Learning Weaknesses Deep Learning Revival

2 Modern Deep Learning

(33)

History Modern Deep Learning

Deep Learning: Trends and methods in the last four decades

Deep Learning renewal since 2006

● 2006: new unsupervised learning for Deep Belief Nets (DBN) [Hinton et al., 2006]

● Theoretical results for improving model quality with depth

● Unsupervised training used as init for supervised learning with back-prop

[email protected] RCP209 / Deep Learning 28/ 64

(34)

Deep Learning and ConvNet for Speech Recognition

● First DL breakthrough on large datasets: speech recognition

● Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, Dahl et al. (2010)

(35)

History Modern Deep Learning

Deep Learning and ConvNet for Image Classification

● ImageNet ILSVRC Challenge (Stanford):

1,200,000 training images, 1,000 classes, mono-label Based on WordNet hierarchy (ontology)

Evaluation: top-5 error

● Up to 2012, leading approaches: BoW + SVM

● ILSVRC’12: the deep revolution⇒outstanding success of ConvNets [Krizhevsky et al., 2012]

[email protected] RCP209 / Deep Learning 30/ 64

(36)

2012: the deep revolution

Deep ConvNet success at ILSVRC’12 Two main practical reasons:

1 Huge number of labeled images (106 images)

Possible to train very large models without over-fitting

Larger models enables to learn rich (semantic) features hierarchies

2 GPU implementation for training Relatively cheap and fast GPU

Training time reduced to 1-2 weeks (up to 50x speed up)

(37)

History Modern Deep Learning

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

● 60,000,000 parameters

● 650,000 neurons - 630,000,000 connections

● 5 convolutional layers, 3 Fully Connected (FC)

Convolution layer: Convolution + non linearity (ReLU) + pooling

Full= FC + non linearity - Final FC: 4096-dim

● Trained on 2 GPUs for a week

[email protected] RCP209 / Deep Learning 32/ 64

(38)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

First Convolutionnal Layer

● Input: Images: 227x227x3

● Filter (receptive field) size F: 11, S (stride) = 4

● 96 filters⇒output size 55*55*96 = 290,400 neurons

● Each Filter: 11*11*3 = 363 weights + 1 bias = 364 params

N.B.: Convolution in whole feature map depth (cf LeNet 5 discussion)

● # parms: 96 * 364 = 34, 944

(39)

History Modern Deep Learning

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

Credit: R. Fergus

[email protected] RCP209 / Deep Learning 34/ 64

(40)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

Credit: R. Fergus

(41)

History Modern Deep Learning

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

Credit: R. Fergus

[email protected] RCP209 / Deep Learning 34/ 64

(42)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

(43)

History Modern Deep Learning

Deep Learning in 2012: Representation Learning

Deep: more semantic features

[email protected] RCP209 / Deep Learning 35/ 64

(44)

AlexNet [Krizhevsky et al., 2012] in ILSVRC’12

● Same global architecture as older nets,e.g.LeNet Trained with back-prop and stochastic gradient descent

● But bigger (deeper and wider): 60 106 parametersvs 60 103 Needs more data (106vs104)

GPU implementation for fast training

● Also some architectural and optim improvements (see next):

Non-linearity: ReLUvssigmoid

Overlapping pooling (Local Response Normalisation, LRN) Regularization: data augmentation, dropout

(45)

Outline

1 Deep Learning History

2 Modern Deep Learning Modern Non-Linearities Modern Training

Modern Architectural Components

(46)

Modern Non-Linear Activation Modules

● Standard non-linear activation functions,e.g.sigmoid, tanh

● Saturating regime

⇒Vanishing gradient: no back-prop

⇒Slow convergence

(47)

History Modern Deep Learning

Rectified Linear Unit (ReLU)

ReLU(z) =

⎧⎪

⎪⎪

z si z≥0

0 sinon =max{0,z}

[email protected] RCP209 / Deep Learning 38/ 64

(48)

Rectified Linear Unit (ReLU)

● Reducing vanishing gradients problems⇒ faster learning / convergence

● Ex: 4-layer ConvNet, CIFAR-10

⇒ReLUvs tanh: x6 speedup

From [Krizhevsky et al., 2012]

(49)

History Modern Deep Learning

Non-Linear Activation Modules

Sigmoid

Saturation

Expensive

Not zero-centered

Tanh

Saturation

Expensive

Zero-centered

ReLU

No saturation

Very efficient

Not zero-centered

Negative activations ignored

[email protected] RCP209 / Deep Learning 40/ 64

(50)

Non-Linear Activation Modules

● ReLU: 0 for negative inputs⇒blocked gradient

● ReLU variants:

From [Gu et al., 2015]

● Leaky ReLU (LReLU):λempirically predefined

● Parametric ReLU (PReLU) :λklearned from data

● Randomized ReLU (RReLU):λnk uniform sampling

● Exponential Linear Unit (ELU):λfixed

(51)

History Modern Deep Learning

Non-Linear Activations: Conclusion

● ReLU non-linearity: training speed-up Used in AlexNet at ImageNet’12

Now vanilla activation for essentially every network

[email protected] RCP209 / Deep Learning 42/ 64

(52)

Outline

1 Deep Learning History

2 Modern Deep Learning Modern Non-Linearities Modern Training

Modern Architectural Components

(53)

History Modern Deep Learning

Pre-Processing Modules

● Normalization bewteen input neurons known to help training

● Idea: enforcing fixed distribution

Data set X= {xj},j∈ {1;N}, xj= {xij} ∈Rm

● Centering: mean subtractionµi across every individual featurexij µi= 1

N

N

i=1xijxiN,j=xijµi

● Normalization: centering + std division σi: σi2=N1N

i=1(xij−µi)2⇒xiN,j=xijσ−µi i

[email protected] RCP209 / Deep Learning 43/ 64

(54)

Pre-Processing Modules

More advanced processings:

● De-correlation: centering + covariance matrix wrt principal axes alignment

● Whitening: divide by each std⇒ N (0,1)

● In practice, not used with ConvNets

(55)

History Modern Deep Learning

Weight Initialization

● Non-convex deep learning objective⇒param init important

● Zero-init: all neurons same output, thus same gradient

● Random init will small numbers,e.g.uniform or W∼ N (0, σi)

● Input layer x with m neurons of outputs: Var[s] =mVar[w]Var[x] Xavier init: W1

mN (0, σi) [Glorot and Bengio, 2010]

[email protected] RCP209 / Deep Learning 45/ 64

(56)

Weight Initialization Example

Activation Histogram

Credit: Sullivan

● 10-layer net, 500 nodes at each layer

● Tanh activation, parameters init W∼ N (0, σi)

● σi small: activation may be0, not good init

(57)

History Modern Deep Learning

Weight Initialization Example

Activation Histogram

Credit: Sullivan

● 10-layer net, 500 nodes at each layer

● Tanh activation, parameters init W∼ N (0, σi)

● σi large: activation may be±1

⇒vanishing gradient

[email protected] RCP209 / Deep Learning 47/ 64

(58)

Weight Initialization Example

● Tanh activation, Xavier init W∼1500N (0,1) Activation Histogram

Credit: Sullivan

● Helps to control activation variance through depth

(59)

History Modern Deep Learning

Weight Initialization with ReLU

● Input layer x with m neurons of outputs: Var[s] =2mVar[w]Var[x] [He et al., 2015]

W1

2m N (0, σi)

[email protected] RCP209 / Deep Learning 49/ 64

(60)

Training: Data-Augmentation

● Jittering, mirroring, color perturbation, rotation, stretching, shearing, lens distortions,etc of the original images

● Increases # training samples, adds robustness to irrelevant variations

● Done in train AND in test

(61)

History Modern Deep Learning

Training: Dropout [Hinton et al., 2012]

● Randomly omit each hidden unit with probabilityp,e.g.p=0.5

● Regularization technique, limits over-fitting (better generalization)

Prevent co-adaptation,i.e.feature only helpful when other specific features present May be viewed as averaging over many NN

Slower convergence

[email protected] RCP209 / Deep Learning 51/ 64

(62)

Training: Dropout [Hinton et al., 2012]

● Training: dropout layer easily differentiable, freezing some weight updates

● What to do at test time ?

Sample many different architectures, average output distributions Faster alternative: use all hidden units (but after /2 outgoing weights) Equivalent to the geometric mean in case of single hidden layer Pretty good approximation for multiple layers

(63)

History Modern Deep Learning

Dropout: Conclusion

● Dropout: important for limiting over-fitting Used in AlexNet at ImageNet’12

Common in current archis, especially in FC layers

[email protected] RCP209 / Deep Learning 53/ 64

(64)

Outline

1 Deep Learning History

2 Modern Deep Learning Modern Non-Linearities Modern Training

Modern Architectural Components

(65)

History Modern Deep Learning

Local Response/Contrast Normalization

● Normalize value wrt spatial neighborsN (i,j)

Credit: A. Vedaldi

● Local equalizing effective

● Helps learning more invariant representations

⇒regularization, better generalization

[email protected] RCP209 / Deep Learning 54/ 64

(66)

Normalization: Local Feature Normalization

● Normalize value wrt neighbors in different feature maps

● Operates at each spatial position independently

● Feature groups: subset of maps (sliding window)

● ∼Lateral inhibition

(67)

History Modern Deep Learning

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

● Recap init: fixed input distribution known to help training

● Training deep neural networks: distribution of hidden layers unknown, change over training time⇒covariate shift

● Importance of init,e.g.Xavier

● Batch Normalization (BN):

↓importance of init,↓covariate shift

[email protected] RCP209 / Deep Learning 56/ 64

(68)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

● Normalize input feature distribution∼ N (0,1) Normalization across each mini-batch:

µB=N1 N

i=1xi

σB2=N1 N

i=1(xiµB)2 ˆ

xi=xσi−µB

B+ -for numerical stability

● Is input feature distribution∼ N (0,1)good idea?

Activation may not ever "saturate", e.g.sigmoid or tanh

Keeping in linear regime: depth useless,

global linear model

(69)

History Modern Deep Learning

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

● Scale and shift: yi=γxˆi+β,(γ, β)trained

● Apply after FC / conv and before non-linearity

● Batch Normalization differentiable

[email protected] RCP209 / Deep Learning 58/ 64

(70)

Batch Normalization (BN) [Ioffe and Szegedy, 2015]

● Applying BN at test time?

⇒Use train set statistics

● BN Strengths

Faster training convergence vs covariate shift Regularization & generalization

better performances

● BN Conclusion:especially important for very deep models,e.g.ResNet (see next)

(71)

History Modern Deep Learning

Pooling Modules

Overlapping Pooling- Ex: pooling size: 5, stride s=2

Credit: K. Matsui

[email protected] RCP209 / Deep Learning 60/ 64

(72)

Pooling Modules

Pooling across feature maps

● Aggregation for a given spatial position, between different tensor maps

● Tensor maps (filter output) associated to a given transformation⇒ invariance with max pooling

(73)

History Modern Deep Learning

Pooling across feature maps

● Ex: scaling [Kanazawa et al., 2014]

[email protected] RCP209 / Deep Learning 62/ 64

(74)

Pooling across feature maps

● Ex: rotation [Marcos et al., 2017]

(75)

History Modern Deep Learning

Locally Connected vs Convolution Layers

● Locally connected: different features detected across image positions

● Sucessful in specific context,e.g.DeepFace [Taigman et al., 2014]

[email protected] RCP209 / Deep Learning 64/ 64

(76)

References I

[Barron, 1993] Barron, A. R. (1993).

Universal approximation bounds for superpositions of a sigmoidal function.

Information Theory, IEEE Transactions on, 39(3):930–945.

[Bengio, 2009] Bengio, Y. (2009).

Learning deep architectures for ai.

Found. Trends Mach. Learn., 2(1):1–127.

[Bengio and Delalleau, 2011] Bengio, Y. and Delalleau, O. (2011).

On the expressive power of deep architectures.

In Proceedings of the 22Nd International Conference on Algorithmic Learning Theory, ALT’11, pages 18–36, Berlin, Heidelberg. Springer-Verlag.

[Cybenko, 1989] Cybenko, G. (1989).

Approximation by superpositions of a sigmoidal function.

Mathematics of control, signals and systems, 2(4):303–314.

[Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010).

Understanding the difficulty of training deep feedforward neural networks.

In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10).

[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep Learning.

MIT Press.

http://www.deeplearningbook.org.

[Gu et al., 2015] Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., and Wang, G.

(2015).

Recent advances in convolutional neural networks.

CoRR, abs/1512.07108.

[Hastad, 1989] Hastad, J. (1989).

Almost optimal lower bounds for small depth circuits.

In RANDOMNESS AND COMPUTATION, pages 6–20. JAI Press.

(77)

References II

[He et al., 2015] He, K., Zhang, X., Ren, S., and Sun, J. (2015).

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

In International Conference on Computer Vision (ICCV).

[Hinton et al., 2006] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006).

A fast learning algorithm for deep belief nets.

Neural Comput., 18(7):1527–1554.

[Hinton et al., 2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012).

Improving neural networks by preventing co-adaptation of feature detectors.

CoRR, abs/1207.0580.

[Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015).

Batch normalization: Accelerating deep network training by reducing internal covariate shift.

In Bach, F. R. and Blei, D. M., editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org.

[Kanazawa et al., 2014] Kanazawa, A., Sharma, A., and Jacobs, D. W. (2014).

Locally scale-invariant convolutional neural networks.

In Deep Learning and Representation Learning Workshop: NIPS 2014.

[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classification with deep convolutional neural networks.

In Advances in neural information processing systems, pages 1097–1105.

[Marcos et al., 2017] Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. (2017).

Rotation equivariant vector field networks.

In The IEEE International Conference on Computer Vision (ICCV).

[Paulovich et al., 2008] Paulovich, F. V., Nonato, L. G., Minghim, R., and Levkowitz, H. (2008).

Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping.

IEEE Trans. Vis. Comput. Graph., 14(3):564–575.

[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).

Deepface: Closing the gap to human-level performance in face verification.

In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

(78)

References III

[van der Maaten and Hinton, 2008] van der Maaten, L. and Hinton, G. E. (2008).

Visualizing high-dimensional data using t-sne.

Journal of Machine Learning Research, 9:2579–2605.

Références

Documents relatifs

– Convolution layers & Pooling layers + global architecture – Training algorithm + Dropout Regularization.. • Useful

Problème : pour des textes comprenant des millions de mots, il y a des milliards, voire des billons de relations possibles Deep learning : le terme symbolise la grande

Deep Neural Network (DNN) with pre-training [7, 1] or regular- ized learning process [14] have achieved good performance on difficult high dimensional input tasks where Multi

For what concerns the k-fold learning strategy, we can notice that the results achieved by the model not using the k-fold learning strategy (1 STL NO K-FOLD) are always lower than

The deep neural network with an input layer, a set of hidden layers and an output layer can be used for further considerations.. As standard ReLu activation function is used for all

Using the balanced dataset, with a number of positive samples equal to the number of the negative samples that have been randomly reduced, the CNN network always has better

James Martens and Ilya Sutskever, Learning recurrent neural networks with hessian-free optimization, Proceedings of the 28th International Conference on International Conference

Il montre d’abord comment installer TensorFlow et le projet contenant les exemples de code du livre (ainsi que les librairies dont il dépend), puis il présente les bases du