Apprentissage statistique : modélisation décisionnelle et apprentissage profond (RCP209)
Neural Networks and Deep Learning History & Modern Deep Learning
Nicolas Thome [email protected]
http://cedric.cnam.fr/vertigo/Cours/ml2/
Département Informatique
Conservatoire Nationnal des Arts et Métiers (Cnam)
Outline
1 Deep Learning History Deep Learning Strengths Deep Learning Weaknesses Deep Learning Revival
2 Modern Deep Learning
History Modern Deep Learning
Deep Learning: Expressiveness
MLP: Universal Function Approximators
● Neural network with one single hidden layer⇒universal approximator Can represent any function on compact subsets ofRn[Cybenko, 1989]
Ex pour classification: any decision boundaries can be expressed
⇒very rich modeling capacities
[email protected] RCP209 / Deep Learning 1/ 64
Deep Learning: Expressiveness
● 2 layers,i.e.one hidden layer, is enough... theoretically:
BUT:exponential number of hidden units [Barron, 1993]
● Challenge is NOT fitting training data
Simple models already have very large (infinite) modeling power
● Challenge: optimization, overfitting
History Modern Deep Learning
Deep Learning: Expressiveness & Compactness
● Deeper Models: less units required
Functions representable compactly withk layers may require exponentially size withk−1 layers [Hastad, 1989, Bengio, 2009]
Digit reco., from [Goodfellow et al., 2016]
Same modeling power, fewer parameters
⇒better generalization!
[email protected] RCP209 / Deep Learning 3/ 64
Inductive Bias in Deep Learning
● Deep models: hierarchy of sequential layers
● Layers: fully connected,
convolution + non linearity
³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ
convolution layer , pooling
● Convolutional architecture: prior knowledge,akaINDUCTIVE BIAS
Deep learning: feature design⇒architecture design
History Modern Deep Learning
Inductive Bias in Deep Learning
ConvNets & Prior Distribution
● Prior:imposing distribution on fully connected parameters
● Weak prior: high entropy (uncertainty), strong prior: low entropy
● Infinitely strong prior:zero probability on some parameters
● ConvNet∼Infinitely Strong Prior on Fully Connected net weights
● Convolution: local interactions, shared weights⇒zero probability elsewhere
[email protected] RCP209 / Deep Learning 5/ 64
Inductive Bias in Deep Learning
ConvNets as Inductive Bias
● ConvNet∼Infinitely Strong Prior on Fully Connected net weights
● Convolution⇒support learning translation-equivariant features
● Pooling⇒support features invariant (stable) wrt local translations Very rich modeling capacities: local interactions⇒global with depth Significantly reduce # parameters⇒reducing over-fitting
From [Goodfellow et al., 2016]
History Modern Deep Learning
Inductive Bias in Deep Learning
ConvNet for Learning Compositions
● Conv/Pool hierarchies: feature composition Depth: gradual complexity, larger spatial extend Intuitive processing for hierarchical information modeling Biological foundations: simple cells, complex cells
[email protected] RCP209 / Deep Learning 7/ 64
Inductive Bias in Deep Learning
ConvNet for Learning Compositions
● Hierarchical Compositions Low-level: edges, color Mid-level: corner, parts
Higher levels: objects, scene concepts
● Distributed Representations: sharing Lower-level: shared by many classes Higher-levels: more class specific
History Modern Deep Learning
Deep Learning: Representation Learning
Latent representations: learned features
≠Handcrafted features for the task ≠Handcrafted kernels in kernel methods (SVM)
X-class classification,K classes
● Last hidden layer: RL→RK
● InRL, linear separation required
● Deep Learning: learning representations that gradually project data toRL spaces where linear separation possible
[email protected] RCP209 / Deep Learning 8/ 64
Deep Learning & Manifold Untangling
Manifold Untangling
Credit: DiCarlo
● DL: gradually projecting data toRL spaces where linear separation possible
● This is the definition of manifold untangling!
● ConvNets: inductive bias making manifold untangling easier!
History Modern Deep Learning
Manifold Untangling Visualization
● We want to visualize each layer activation for each class
● high-dimensional visualization?
⇒Projection to lower (e.g.2d) dimensions
[email protected] RCP209 / Deep Learning 10/ 64
t-distributed Stochastic Neighbor Embedding (t-SNE)
● t-SNE [van der Maaten and Hinton, 2008]:
non linear projection
● Intuitively: close distances in initial space
⇒close distances in projected (2d) space Distance preservation
Neighborhood preservationi.e.small distance preservation
History Modern Deep Learning
t-SNE [van der Maaten and Hinton, 2008]
● Similarity between points(xi,xj)in initial space,e.g.Rd : pij= e−
∣∣xi−xj∣∣2 2σ2
k≠l∑e−
∣∣xk−xl∣∣2 2σ2
P= {pij}(i,j)∈N×N
● Similarity between points(yi,yj)in projected space,e.g.R2: qij= (1+ ∣∣yi−yj∣∣2)−1
k≠l∑(1+ ∣∣yk−yl∣∣2)−1 Q= {qij}(i,j)∈N×N
● Loss function: Kullback-Leiber divergenceKL(P∣∣Q)
C= ∑
i
KL(P∣∣Q) = ∑
i
∑
j
pij logpij
qij
[email protected] RCP209 / Deep Learning 12/ 64
t-SNE Visualization: MNIST example
● MNIST dataset: 28×28 grayscale images of digits
● 10 classes⇔digit number∈ {0;9}
● Input space dimension: 282=784
● Projection in 2d (3d) space for visualization
● t-SNE for computing projection: gradient descent
∂C
∂yi =4∑
i
(pij−qij)(yi−yj)(1+ ∣∣yi−yj∣∣2)−1
● Optimization (projection) for a given closed dataset
⇒transductive learning
History Modern Deep Learning
t-SNE Visualization: MNIST example
● Application of t-SNE in the test set of MNIST (10000) images
● Color⇔class ID
[email protected] RCP209 / Deep Learning 14/ 64
t-SNE Visualization: MNIST example
● Classes visually appear in 2d space,BUT overlap
● How to measure class separability?
Neighborhood Hit[Paulovich et al., 2008]:
NH = # pts in knn of the same class
# pts in knn
History Modern Deep Learning
t-SNE Visualization: MNIST example
● How to measure class separability?
Fitting ellipses to each class points Ellipses non-overlap⇒linear separability
[email protected] RCP209 / Deep Learning 16/ 64
80’s: LeNet 5 Model
● Total # parameters∼60000
● Evaluation on MNIST: test error of 0.95%
● Successful deployment for postal code reading in the US
History Modern Deep Learning
LeNet 5 Model: Manifold Untangling
Input space Latent space
[email protected] RCP209 / Deep Learning 18/ 64
LeNet 5 Model: Manifold Untangling
Latent space MLP Latent space LeNet
Outline
1 Deep Learning History Deep Learning Strengths Deep Learning Weaknesses Deep Learning Revival
2 Modern Deep Learning
Deep Neural Networks: Weaknesses & Drawbacks
Criticisms at two main levels
1 Modeling level: Neural Networks⇔Black Boxes
2 Training level: ad hoc, expertise, efficiency, guaranty
History Modern Deep Learning
Deep Neural Networks: Black Boxes
● Lack of explainability: why this decision?
Hidden units not directly interpretable≠others,e.g.decision trees, expert systems
⇒Challenges:Human machine interaction, failure analysis
[email protected] RCP209 / Deep Learning 21/ 64
Deep Neural Networks: Black Boxes
● Lack of theory for architecture design
● How many layers, neurons?
● Layer type: fully connected, convolution, pooling?
● Trial/test: optimize architecture on validation set
⇒Ad hoc, no theory to guide you
History Modern Deep Learning
Deep Neural Networks: Training Issues
● Optimization: non convex objective No guaranty to reach global optimum Solution dependent on initialization
Importance of (random) initialization
⇒training reproducibility Stochastic training: noisy gradient Expertise: ad hochyper-parameter tuning:
# epochs, decay, optimizers (next week)etc Costly Tuning
[email protected] RCP209 / Deep Learning 23/ 64
Deep Neural Networks: Training Issues
● Deep models need huge annotated datasets
⇒Huge models, huge computational demand
⇒Long be impossible to train such models with existing resources
● Smaller datasets: inferior predictive performances Small models: not enough expressive power Large models overfit
⇒Performances↓handcrafted features
History Modern Deep Learning
Deep Learning: Trends and methods in the last four decades
90’s: start of winter for deep learning
● Deep neural nets =’ black magic’, black boxes Lack of interpretability
Optimization issues for highly non-convex objective function
● Golden age of kernel methods
Generalization theory with Support Vector Machines Extension to non-linear modes: kernel trick
Kernel encode prior knowledge (structure) on data Convex optimization problem
[email protected] RCP209 / Deep Learning 25/ 64
Deep Learning: Trends and methods in the last four decades
2000’s: Bag of Words Model (BoW)
● Started from the Information Retrieval (IR) community
● Text classification : document as a histogram of word occurrences
● Bow representation as input for powerful classifiers,e.g.SVM
History Modern Deep Learning
2000’s: Bag of Words Model
● Adapting the BoW model for visual recognition? ⇒Bag of Visual Word (BoV)
● Main challenge:definition of visual words unclear!
● Solution:compute a dictionary on local image regions (clustering) Local regions represented by handcrafted descriptors,e.g.SIFT
● 2000’s: BoW + SVM state-of-the-art
● Many works on kernel on BoW, coding & pooling→2012
[email protected] RCP209 / Deep Learning 27/ 64
Outline
1 Deep Learning History Deep Learning Strengths Deep Learning Weaknesses Deep Learning Revival
2 Modern Deep Learning
History Modern Deep Learning
Deep Learning: Trends and methods in the last four decades
Deep Learning renewal since 2006
● 2006: new unsupervised learning for Deep Belief Nets (DBN) [Hinton et al., 2006]
● Theoretical results for improving model quality with depth
● Unsupervised training used as init for supervised learning with back-prop
[email protected] RCP209 / Deep Learning 28/ 64
Deep Learning and ConvNet for Speech Recognition
● First DL breakthrough on large datasets: speech recognition
● Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, Dahl et al. (2010)
History Modern Deep Learning
Deep Learning and ConvNet for Image Classification
● ImageNet ILSVRC Challenge (Stanford):
1,200,000 training images, 1,000 classes, mono-label Based on WordNet hierarchy (ontology)
Evaluation: top-5 error
● Up to 2012, leading approaches: BoW + SVM
● ILSVRC’12: the deep revolution⇒outstanding success of ConvNets [Krizhevsky et al., 2012]
[email protected] RCP209 / Deep Learning 30/ 64
2012: the deep revolution
Deep ConvNet success at ILSVRC’12 Two main practical reasons:
1 Huge number of labeled images (106 images)
Possible to train very large models without over-fitting
Larger models enables to learn rich (semantic) features hierarchies
2 GPU implementation for training Relatively cheap and fast GPU
Training time reduced to 1-2 weeks (up to 50x speed up)
History Modern Deep Learning
AlexNet [Krizhevsky et al., 2012] in ILSVRC’12
● 60,000,000 parameters
● 650,000 neurons - 630,000,000 connections
● 5 convolutional layers, 3 Fully Connected (FC)
Convolution layer: Convolution + non linearity (ReLU) + pooling
Full= FC + non linearity - Final FC: 4096-dim
● Trained on 2 GPUs for a week
[email protected] RCP209 / Deep Learning 32/ 64
AlexNet [Krizhevsky et al., 2012] in ILSVRC’12
First Convolutionnal Layer
● Input: Images: 227x227x3
● Filter (receptive field) size F: 11, S (stride) = 4
● 96 filters⇒output size 55*55*96 = 290,400 neurons
● Each Filter: 11*11*3 = 363 weights + 1 bias = 364 params
N.B.: Convolution in whole feature map depth (cf LeNet 5 discussion)
● # parms: 96 * 364 = 34, 944
History Modern Deep Learning
AlexNet [Krizhevsky et al., 2012] in ILSVRC’12
Credit: R. Fergus
[email protected] RCP209 / Deep Learning 34/ 64
AlexNet [Krizhevsky et al., 2012] in ILSVRC’12
Credit: R. Fergus
History Modern Deep Learning
AlexNet [Krizhevsky et al., 2012] in ILSVRC’12
Credit: R. Fergus
[email protected] RCP209 / Deep Learning 34/ 64
AlexNet [Krizhevsky et al., 2012] in ILSVRC’12
History Modern Deep Learning
Deep Learning in 2012: Representation Learning
Deep: more semantic features
[email protected] RCP209 / Deep Learning 35/ 64
AlexNet [Krizhevsky et al., 2012] in ILSVRC’12
● Same global architecture as older nets,e.g.LeNet Trained with back-prop and stochastic gradient descent
● But bigger (deeper and wider): 60 106 parametersvs 60 103 Needs more data (106vs104)
GPU implementation for fast training
● Also some architectural and optim improvements (see next):
Non-linearity: ReLUvssigmoid
Overlapping pooling (Local Response Normalisation, LRN) Regularization: data augmentation, dropout
Outline
1 Deep Learning History
2 Modern Deep Learning Modern Non-Linearities Modern Training
Modern Architectural Components
Modern Non-Linear Activation Modules
● Standard non-linear activation functions,e.g.sigmoid, tanh
● Saturating regime
⇒Vanishing gradient: no back-prop
⇒Slow convergence
History Modern Deep Learning
Rectified Linear Unit (ReLU)
ReLU(z) =
⎧⎪
⎪
⎨
⎪⎪
⎩
z si z≥0
0 sinon =max{0,z}
[email protected] RCP209 / Deep Learning 38/ 64
Rectified Linear Unit (ReLU)
● Reducing vanishing gradients problems⇒ faster learning / convergence
● Ex: 4-layer ConvNet, CIFAR-10
⇒ReLUvs tanh: x6 speedup
From [Krizhevsky et al., 2012]
History Modern Deep Learning
Non-Linear Activation Modules
Sigmoid
● Saturation
● Expensive
● Not zero-centered
Tanh
● Saturation
● Expensive
● Zero-centered
ReLU
● No saturation
● Very efficient
● Not zero-centered
● Negative activations ignored
[email protected] RCP209 / Deep Learning 40/ 64
Non-Linear Activation Modules
● ReLU: 0 for negative inputs⇒blocked gradient
● ReLU variants:
From [Gu et al., 2015]
● Leaky ReLU (LReLU):λempirically predefined
● Parametric ReLU (PReLU) :λklearned from data
● Randomized ReLU (RReLU):λnk uniform sampling
● Exponential Linear Unit (ELU):λfixed
History Modern Deep Learning
Non-Linear Activations: Conclusion
● ReLU non-linearity: training speed-up Used in AlexNet at ImageNet’12
Now vanilla activation for essentially every network
[email protected] RCP209 / Deep Learning 42/ 64
Outline
1 Deep Learning History
2 Modern Deep Learning Modern Non-Linearities Modern Training
Modern Architectural Components
History Modern Deep Learning
Pre-Processing Modules
● Normalization bewteen input neurons known to help training
● Idea: enforcing fixed distribution
Data set X= {xj},j∈ {1;N}, xj= {xij} ∈Rm
● Centering: mean subtractionµi across every individual featurexij µi= 1
N
∑N
i=1xij⇒xiN,j=xij−µi
● Normalization: centering + std division σi: σi2=N1 ∑N
i=1(xij−µi)2⇒xiN,j=xijσ−µi i
[email protected] RCP209 / Deep Learning 43/ 64
Pre-Processing Modules
More advanced processings:
● De-correlation: centering + covariance matrix wrt principal axes alignment
● Whitening: divide by each std⇒ N (0,1)
● In practice, not used with ConvNets
History Modern Deep Learning
Weight Initialization
● Non-convex deep learning objective⇒param init important
● Zero-init: all neurons same output, thus same gradient
● Random init will small numbers,e.g.uniform or W∼ N (0, σi)
● Input layer x with m neurons of outputs: Var[s] =mVar[w]Var[x] Xavier init: W∼√1
mN (0, σi) [Glorot and Bengio, 2010]
[email protected] RCP209 / Deep Learning 45/ 64
Weight Initialization Example
Activation Histogram
Credit: Sullivan
● 10-layer net, 500 nodes at each layer
● Tanh activation, parameters init W∼ N (0, σi)
● σi small: activation may be0, not good init
History Modern Deep Learning
Weight Initialization Example
Activation Histogram
Credit: Sullivan
● 10-layer net, 500 nodes at each layer
● Tanh activation, parameters init W∼ N (0, σi)
● σi large: activation may be±1
⇒vanishing gradient
[email protected] RCP209 / Deep Learning 47/ 64
Weight Initialization Example
● Tanh activation, Xavier init W∼√1500N (0,1) Activation Histogram
Credit: Sullivan
● Helps to control activation variance through depth
History Modern Deep Learning
Weight Initialization with ReLU
● Input layer x with m neurons of outputs: Var[s] =2mVar[w]Var[x] [He et al., 2015]
W∼√1
2m N (0, σi)
[email protected] RCP209 / Deep Learning 49/ 64
Training: Data-Augmentation
● Jittering, mirroring, color perturbation, rotation, stretching, shearing, lens distortions,etc of the original images
● Increases # training samples, adds robustness to irrelevant variations
● Done in train AND in test
History Modern Deep Learning
Training: Dropout [Hinton et al., 2012]
● Randomly omit each hidden unit with probabilityp,e.g.p=0.5
● Regularization technique, limits over-fitting (better generalization)
Prevent co-adaptation,i.e.feature only helpful when other specific features present May be viewed as averaging over many NN
Slower convergence
[email protected] RCP209 / Deep Learning 51/ 64
Training: Dropout [Hinton et al., 2012]
● Training: dropout layer easily differentiable, freezing some weight updates
● What to do at test time ?
Sample many different architectures, average output distributions Faster alternative: use all hidden units (but after /2 outgoing weights) Equivalent to the geometric mean in case of single hidden layer Pretty good approximation for multiple layers
History Modern Deep Learning
Dropout: Conclusion
● Dropout: important for limiting over-fitting Used in AlexNet at ImageNet’12
Common in current archis, especially in FC layers
[email protected] RCP209 / Deep Learning 53/ 64
Outline
1 Deep Learning History
2 Modern Deep Learning Modern Non-Linearities Modern Training
Modern Architectural Components
History Modern Deep Learning
Local Response/Contrast Normalization
● Normalize value wrt spatial neighborsN (i,j)
Credit: A. Vedaldi
● Local equalizing effective
● Helps learning more invariant representations
⇒regularization, better generalization
[email protected] RCP209 / Deep Learning 54/ 64
Normalization: Local Feature Normalization
● Normalize value wrt neighbors in different feature maps
● Operates at each spatial position independently
● Feature groups: subset of maps (sliding window)
● ∼Lateral inhibition
History Modern Deep Learning
Batch Normalization (BN) [Ioffe and Szegedy, 2015]
● Recap init: fixed input distribution known to help training
● Training deep neural networks: distribution of hidden layers unknown, change over training time⇒covariate shift
● Importance of init,e.g.Xavier
● Batch Normalization (BN):
↓importance of init,↓covariate shift
[email protected] RCP209 / Deep Learning 56/ 64
Batch Normalization (BN) [Ioffe and Szegedy, 2015]
● Normalize input feature distribution∼ N (0,1) Normalization across each mini-batch:
µB=N1 ∑N
i=1xi
σB2=N1 ∑N
i=1(xi−µB)2 ˆ
xi=xσi−µB
B+ -for numerical stability
● Is input feature distribution∼ N (0,1)good idea?
Activation may not ever "saturate", e.g.sigmoid or tanh
Keeping in linear regime: depth useless,
∼global linear model
History Modern Deep Learning
Batch Normalization (BN) [Ioffe and Szegedy, 2015]
● Scale and shift: yi=γxˆi+β,(γ, β)trained
● Apply after FC / conv and before non-linearity
● Batch Normalization differentiable
[email protected] RCP209 / Deep Learning 58/ 64
Batch Normalization (BN) [Ioffe and Szegedy, 2015]
● Applying BN at test time?
⇒Use train set statistics
● BN Strengths
Faster training convergence vs covariate shift Regularization & generalization
⇒better performances
● BN Conclusion:especially important for very deep models,e.g.ResNet (see next)
History Modern Deep Learning
Pooling Modules
Overlapping Pooling- Ex: pooling size: 5, stride s=2
Credit: K. Matsui
[email protected] RCP209 / Deep Learning 60/ 64
Pooling Modules
Pooling across feature maps
● Aggregation for a given spatial position, between different tensor maps
● Tensor maps (filter output) associated to a given transformation⇒ invariance with max pooling
History Modern Deep Learning
Pooling across feature maps
● Ex: scaling [Kanazawa et al., 2014]
[email protected] RCP209 / Deep Learning 62/ 64
Pooling across feature maps
● Ex: rotation [Marcos et al., 2017]
History Modern Deep Learning
Locally Connected vs Convolution Layers
● Locally connected: different features detected across image positions
● Sucessful in specific context,e.g.DeepFace [Taigman et al., 2014]
[email protected] RCP209 / Deep Learning 64/ 64
References I
[Barron, 1993] Barron, A. R. (1993).
Universal approximation bounds for superpositions of a sigmoidal function.
Information Theory, IEEE Transactions on, 39(3):930–945.
[Bengio, 2009] Bengio, Y. (2009).
Learning deep architectures for ai.
Found. Trends Mach. Learn., 2(1):1–127.
[Bengio and Delalleau, 2011] Bengio, Y. and Delalleau, O. (2011).
On the expressive power of deep architectures.
In Proceedings of the 22Nd International Conference on Algorithmic Learning Theory, ALT’11, pages 18–36, Berlin, Heidelberg. Springer-Verlag.
[Cybenko, 1989] Cybenko, G. (1989).
Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems, 2(4):303–314.
[Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010).
Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10).
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016).
Deep Learning.
MIT Press.
http://www.deeplearningbook.org.
[Gu et al., 2015] Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., and Wang, G.
(2015).
Recent advances in convolutional neural networks.
CoRR, abs/1512.07108.
[Hastad, 1989] Hastad, J. (1989).
Almost optimal lower bounds for small depth circuits.
In RANDOMNESS AND COMPUTATION, pages 6–20. JAI Press.
References II
[He et al., 2015] He, K., Zhang, X., Ren, S., and Sun, J. (2015).
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.
In International Conference on Computer Vision (ICCV).
[Hinton et al., 2006] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006).
A fast learning algorithm for deep belief nets.
Neural Comput., 18(7):1527–1554.
[Hinton et al., 2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012).
Improving neural networks by preventing co-adaptation of feature detectors.
CoRR, abs/1207.0580.
[Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015).
Batch normalization: Accelerating deep network training by reducing internal covariate shift.
In Bach, F. R. and Blei, D. M., editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org.
[Kanazawa et al., 2014] Kanazawa, A., Sharma, A., and Jacobs, D. W. (2014).
Locally scale-invariant convolutional neural networks.
In Deep Learning and Representation Learning Workshop: NIPS 2014.
[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105.
[Marcos et al., 2017] Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. (2017).
Rotation equivariant vector field networks.
In The IEEE International Conference on Computer Vision (ICCV).
[Paulovich et al., 2008] Paulovich, F. V., Nonato, L. G., Minghim, R., and Levkowitz, H. (2008).
Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping.
IEEE Trans. Vis. Comput. Graph., 14(3):564–575.
[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).
Deepface: Closing the gap to human-level performance in face verification.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
References III
[van der Maaten and Hinton, 2008] van der Maaten, L. and Hinton, G. E. (2008).
Visualizing high-dimensional data using t-sne.
Journal of Machine Learning Research, 9:2579–2605.