On the Challenge of Learning Complex Functions
Yoshua Bengio
May 9th 2006
Thanks to: Pascal Lamblin, Fran¸cois Rivest, Olivier Delalleau, Nicolas Le Roux, Hugo Larochelle
Learning High-Level Abstractions
High-level abstraction=
a highly-varying, complex, but structured function
Learning High-Level Abstractions
High-level abstraction=
a highly-varying, complex, but structured function
e.g. semantic concept of “chair” is a higher-level abstraction than a recognizer of a particular instance of chair, which is a higher-level abstraction than a V1 oriented edge-detector.
Learning High-Level Abstractions
High-level abstraction=
a highly-varying, complex, but structured function
e.g. semantic concept of “chair” is a higher-level abstraction than a recognizer of a particular instance of chair, which is a higher-level abstraction than a V1 oriented edge-detector.
Such high-level abstraction includes a very large set of possible retinal images that can be very different from each other in terms of raw sensory patterns (“Euclidean distance”).
Recently Hot Topics at NIPS
1 “Kernel machines” (best known = Support Vector Machines
= SVMs)
2 Knowledge-rich probabilistic graphical models :
Recently Hot Topics at NIPS
1 “Kernel machines” (best known = Support Vector Machines
= SVMs)
Modern versions of RBF networks, mathematically enhanced linear combinations of grandmother cell outputs
Fun maths & computer science: mathematical guarantees of optimality due to convexity
Plenty of new extensions in many areas of machine learning
2 Knowledge-rich probabilistic graphical models :
Recently Hot Topics at NIPS
1 “Kernel machines” (best known = Support Vector Machines
= SVMs)
Modern versions of RBF networks, mathematically enhanced linear combinations of grandmother cell outputs
Fun maths & computer science: mathematical guarantees of optimality due to convexity
Plenty of new extensions in many areas of machine learning
2 Knowledge-rich probabilistic graphical models : Algorithms for inference in probabilistic logic
Fun maths & computer science: many different complicated models can be devised
Provides interpretable models, may help humans to understand the data
What’s Wrong ?
1 “Kernel machines” :
2 knowledge-rich probabilistic graphical models :
What’s Wrong ?
1 “Kernel machines” :
works well if “similarity” is defined smartly, i.e. with a good representation to start with
math guarantees only applies if the similarity function is fixed, don’t know good ways to learn it
curse of dimensionality(our recent work...)
= can’t learn highly-varying functions, i.e. high-level abstractions
2 knowledge-rich probabilistic graphical models :
What’s Wrong ?
1 “Kernel machines” :
works well if “similarity” is defined smartly, i.e. with a good representation to start with
math guarantees only applies if the similarity function is fixed, don’t know good ways to learn it
curse of dimensionality(our recent work...)
= can’t learn highly-varying functions, i.e. high-level abstractions
2 knowledge-rich probabilistic graphical models :
Require to pre-specify ALL the concepts of interest : don’t know how to really learn the semantics
Requires to specify prior distribution on the data through priors on all the relevant variables and their interactions Exact inference algorithms are untractable : use
approximations or small models
Local Kernels
Kernel machines are 2-layer architectures whose first layer is
“fixed” by taking the examples xi as prototypes, learning αi
Local Kernels
Kernel machines are 2-layer architectures whose first layer is
“fixed” by taking the examples xi as prototypes, learning αi f(x) f(x) =
Xn i=1
αiK(x,xi)
K(x,x1) . . . K(x,xi) . . . K(x,xn)
x
α1 αi αn
K(x,xi) = grandmother cell
Local Kernels
Kernel machines are 2-layer architectures whose first layer is
“fixed” by taking the examples xi as prototypes, learning αi f(x) f(x) =
Xn i=1
αiK(x,xi)
K(x,x1) . . . K(x,xi) . . . K(x,xn)
x
α1 αi αn
K(x,xi) = grandmother cell
LocalmeansK(x,xi)→0 as ||x−xi|| increases.
Local Learning Algorithms
A learned parameter of the model influences the value of the learned function in a local area of the input domain.
Local Learning Algorithms
A learned parameter of the model influences the value of the learned function in a local area of the input domain.
With local kernel machine
f(x) =X
i
αiK(x,xi),
αi only influencesf(x) for x nearxi.
Local Learning Algorithms
A learned parameter of the model influences the value of the learned function in a local area of the input domain.
With local kernel machine
f(x) =X
i
αiK(x,xi),
αi only influencesf(x) for x nearxi. Examples :
nearest-neighbor algorithms local kernel machines
most non-parametric models except multi-layer neural networks
Mathematical Problem with Local Learning
Theorem
With K the Gaussian kernel andf(·) changing sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then at least k examples are required.
Mathematical Problem with Local Learning
Theorem
With K the Gaussian kernel andf(·) changing sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then at least k examples are required.
decision surface
Class −1
Class 1
With local kernels, learning a function that has many “bumps”
requires as many examples as bumps.
The Curse of Dimensionality
Mathematical problem with classical non-parametric models
The Curse of Dimensionality
Mathematical problem with classical non-parametric models
The Curse of Dimensionality
Mathematical problem with classical non-parametric models
May need to have examples for each probable combination of the variables of interest.
OK for 2 or 3 variables (e.g. V1 cells),
⇒ NOT OK for more abstract concepts...
Mathematical Problem with Local Kernels
Theorem
With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f(x)6=f(x0) when |x−x0|= 1) with d inputs, then at least 2d−1 examples are required.
Mathematical Problem with Local Kernels
Theorem
With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f(x)6=f(x0) when |x−x0|= 1) with d inputs, then at least 2d−1 examples are required.
⇒ need to cover the space of possibilities with examples
⇒ may require nb examples exponential in nb inputs
Mathematical Problem with Local Kernels
Theorem
With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f(x)6=f(x0) when |x−x0|= 1) with d inputs, then at least 2d−1 examples are required.
⇒ need to cover the space of possibilities with examples
⇒ may require nb examples exponential in nb inputs
= strongly negative mathematical results on local kernel machines Other similar results in (Bengio, Delalleau, Le Roux, NIPS’2005)
Is There Hope ?
What kind of architec- tures would allow to re- present high-level abs- tractions without ex- plicitly enumerating all the variations ?
Deep Networks
Some mathematical functions can be repre- sented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network.
Deep Networks
Some mathematical functions can be repre- sented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network.
. . . .
. . . .
. . . .
. . . .
. . . .
Deep Networks
Some mathematical functions can be repre- sented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network.
e.g. d-bit parity :
• 1 adaptive layer (SVM) : 2d units and parameters required
• 2 adaptive layers (neural net) : d units, d2 parameters
• d-layer net : 2d units , 5d parameters
• recurrent net : 2 units , 5 param.,d steps
. . . .
. . . .
. . . .
. . . .
. . . .
Mathematical Problem with Gradient-Based Learning of Deep Networks
Theorem
In non-linear dynamical systems that can latch information for long durations, gradients capturing long-term dependencies vanish exponentially with duration.
Mathematical Problem with Gradient-Based Learning of Deep Networks
Theorem
In non-linear dynamical systems that can latch information for long durations, gradients capturing long-term dependencies vanish exponentially with duration.
This also applies to deep neural networks (since recurrent neural networks are equivalent, when unfolded, to a very deep network).
Mathematical Problem with Gradient-Based Learning of Deep Networks
Theorem
In non-linear dynamical systems that can latch information for long durations, gradients capturing long-term dependencies vanish exponentially with duration.
This also applies to deep neural networks (since recurrent neural networks are equivalent, when unfolded, to a very deep network).
Basic mathematical problem : the gradients become smaller and more diffuse as they are back-propagated.
Also, not clear how to back-propagate gradients accurately through a very deep or recurrent network in the brain.
Greedy Learning of Abstractions
Greedily learning simple things first, higher-level abstractions on top of lower-level ones seems like a possible good
strategy and is psychologically plausible.
Greedy Learning of Abstractions
Greedily learning simple things first, higher-level abstractions on top of lower-level ones seems like a possible good
strategy and is psychologically plausible.
Coherent with psychological litterature starting with Piaget 1952.
We learn baby math before arithmetic before algebra before differential equations....
Also evidence from neurobiology : (Guillery 2005) “Is postnatal neocortical maturation hierarchical ?”.
And several successful machine learning algorithms are
constructive, e.g. boosting (Freund & Schapire 1996) adds one group of units (weak learner) at a time (but all on the same layer).
Deep Belief Networks
Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :
Deep Belief Networks
Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :
unsupervised learning of each layer, each trying to model distribution of its inputs
Deep Belief Networks
Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :
unsupervised learning of each layer, each trying to model distribution of its inputs
Hebbian-like local update rules
Deep Belief Networks
Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :
unsupervised learning of each layer, each trying to model distribution of its inputs
Hebbian-like local update rules
whole network can be refined wrt supervised target if gradients can be propagated
Deep Belief Networks
Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :
unsupervised learning of each layer, each trying to model distribution of its inputs
Hebbian-like local update rules
whole network can be refined wrt supervised target if gradients can be propagated
beating state-of-the-art statistical learning in preliminary experiments on a large benchmark task (MNIST)
Greedy Layer-wise Learning
Supervised greedy layer-wise learning : each added layer taking in input output of previous layer is trained as the hid- den layer of a supervised 1-hidden-layer net. Throw away output weights once the layer is trained.
target = y
h2
h1
x
Unsupervised Learning Guides the Optimization
Simulation results : does not work as well as unsupervised greedy layer-wise learning (Deep Belief Nets, or network of auto- encoders).
target = h1
h2
h1
x
Unsupervised Learning Guides the Optimization
Simulation results : does not work as well as unsupervised greedy layer-wise learning (Deep Belief Nets, or network of auto- encoders).
Results : lower TRAINING AND TEST error with the UNSUPERVI- SED variant !
target = h1
h2
h1
x
Unsupervised Learning Guides the Optimization
Simulation results : does not work as well as unsupervised greedy layer-wise learning (Deep Belief Nets, or network of auto- encoders).
Results : lower TRAINING AND TEST error with the UNSUPERVI- SED variant !
⇒ suggests unsupervised learning GUIDES the optimization
target = h1
h2
h1
x
Multiple Modalities Help Each Other
Hypothesis :
multiple modalities can guide each other during learning.
My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)
Multiple Modalities Help Each Other
Hypothesis :
multiple modalities can guide each other during learning.
Rationale : shared high-level semantics = same world behind favors learning of internal representations capturing underlying structure of the world.
My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)
Multiple Modalities Help Each Other
Hypothesis :
multiple modalities can guide each other during learning.
Rationale : shared high-level semantics = same world behind favors learning of internal representations capturing underlying structure of the world.
Contradicts machine learning folklore that more input variables implies more difficult learning problem.
My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)
Multiple Modalities Help Each Other
Hypothesis :
multiple modalities can guide each other during learning.
Rationale : shared high-level semantics = same world behind favors learning of internal representations capturing underlying structure of the world.
Contradicts machine learning folklore that more input variables implies more difficult learning problem.
Supporting evidence in machine learning litterature : - Coherence criterion (IMAX) from(Becker & Hinton 1992) - co-training(Blum & Mitchell 1998)
My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)
Probabilistic Surprise as Reinforcement
Reinforcement learning with novelty as reinforcement
Probabilistic Surprise as Reinforcement
Reinforcement learning with novelty as reinforcement
=
active unsupervised learning
⇒ potential for much faster learning
Probabilistic Surprise as Reinforcement
Reinforcement learning with novelty as reinforcement
=
active unsupervised learning
⇒ potential for much faster learning
Psychological evidence : infant habituation, e.g. (Fantz 1964), (Sirois & Mareshal 2004)
Neurobiological evidence : connection between novelty and dopamine e.g. (Lisman & Grace 2005).
Probabilistic Surprise as Reinforcement
Reinforcement learning with novelty as reinforcement
=
active unsupervised learning
⇒ potential for much faster learning
Psychological evidence : infant habituation, e.g. (Fantz 1964), (Sirois & Mareshal 2004)
Neurobiological evidence : connection between novelty and dopamine e.g. (Lisman & Grace 2005).
Hypothesis : it is not a low probability event that should be rewarded but rather one that induces changes in the model (i.e. the predicted probabilities were wrong).
e.g. TV white-noise is unpredictable (low prob.) but not really surprising.
reinforcement 6= logProb(input|θ) reinforcement =||∂logProb(input|θ)
∂θ ||
Semi-Supervised Learning
Semi-supervised learning : “labeled” + “unlabeled” examples e.g. image + speech naming objects in it vs image alone
Semi-Supervised Learning
Semi-supervised learning : “labeled” + “unlabeled” examples e.g. image + speech naming objects in it vs image alone
Unsupervised component in our learning algorithm allows taking advantage of much larger quantity of unlabeled data.
Semi-Supervised Learning
Semi-supervised learning : “labeled” + “unlabeled” examples e.g. image + speech naming objects in it vs image alone
Unsupervised component in our learning algorithm allows taking advantage of much larger quantity of unlabeled data.
Seems necessary both for AI and biologically motivated learning algorithms.
Semi-supervised algorithms often generalize better than purely supervised ones that are trained only on the labeled data.
= another motivation for unsupervised component
Predictive Models : Supervised = Unsupervised
With temporal data, unsupervised modeling of the sequence x1,x2, . . . ,xt, . . . is equivalent to predictive (supervised) modeling of xt|xt−1,xt−2, . . .:
P(x1, . . .xT) = YT t=1
P(xt|xt−1,xt−2, . . .x1)
Predictive Models : Supervised = Unsupervised
With temporal data, unsupervised modeling of the sequence x1,x2, . . . ,xt, . . . is equivalent to predictive (supervised) modeling of xt|xt−1,xt−2, . . .:
P(x1, . . .xT) = YT t=1
P(xt|xt−1,xt−2, . . .x1)
Ongoing research : combine static unsupervised algorithms such as Deep Belief Networks with predictive supervised online learning.
Predictive Models : Supervised = Unsupervised
With temporal data, unsupervised modeling of the sequence x1,x2, . . . ,xt, . . . is equivalent to predictive (supervised) modeling of xt|xt−1,xt−2, . . .:
P(x1, . . .xT) = YT t=1
P(xt|xt−1,xt−2, . . .x1)
Ongoing research : combine static unsupervised algorithms such as Deep Belief Networks with predictive supervised online learning.
Let zt learned internal representations at timet.
Hypothesis : using error signal from prediction of zt|zt−1 can help improve/guide training of static unsupervised model of xt’s.
Conclusions
Fundamental mathematical limitations of kernel machines : curse of dimensionality
Conclusions
Fundamental mathematical limitations of kernel machines : curse of dimensionality
Much needed : algorithms for learning in deep networks
Conclusions
Fundamental mathematical limitations of kernel machines : curse of dimensionality
Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks
Conclusions
Fundamental mathematical limitations of kernel machines : curse of dimensionality
Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions
Conclusions
Fundamental mathematical limitations of kernel machines : curse of dimensionality
Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions
The local unsupervised learning somehow appears to guide the optimization.
Conclusions
Fundamental mathematical limitations of kernel machines : curse of dimensionality
Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions
The local unsupervised learning somehow appears to guide the optimization.
May be combined with actions, with probabilistic novelty as reinforcement signal
Conclusions
Fundamental mathematical limitations of kernel machines : curse of dimensionality
Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions
The local unsupervised learning somehow appears to guide the optimization.
May be combined with actions, with probabilistic novelty as reinforcement signal
May be improved by combining multiple modalities and temporal dependencies
The Team
Yoshua Bengio Pascal Lamblin Fran¸cois Rivest
Olivier Delalleau Nicolas Le Roux Hugo Larochelle