Learning High-Level Abstractions

(1)

On the Challenge of Learning Complex Functions

Yoshua Bengio

May 9th 2006

Thanks to: Pascal Lamblin, Fran¸cois Rivest, Olivier Delalleau, Nicolas Le Roux, Hugo Larochelle

(2)

Learning High-Level Abstractions

High-level abstraction=

a highly-varying, complex, but structured function

(3)

Learning High-Level Abstractions

e.g. semantic concept of “chair” is a higher-level abstraction than a recognizer of a particular instance of chair, which is a higher-level abstraction than a V1 oriented edge-detector.

(4)

Learning High-Level Abstractions

e.g. semantic concept of “chair” is a higher-level abstraction than a recognizer of a particular instance of chair, which is a higher-level abstraction than a V1 oriented edge-detector.

Such high-level abstraction includes a very large set of possible retinal images that can be very different from each other in terms of raw sensory patterns (“Euclidean distance”).

(5)

Recently Hot Topics at NIPS

1 “Kernel machines” (best known = Support Vector Machines

= SVMs)

2 Knowledge-rich probabilistic graphical models :

(6)

Recently Hot Topics at NIPS

= SVMs)

Modern versions of RBF networks, mathematically enhanced linear combinations of grandmother cell outputs

Fun maths & computer science: mathematical guarantees of optimality due to convexity

Plenty of new extensions in many areas of machine learning

2 Knowledge-rich probabilistic graphical models :

(7)

Recently Hot Topics at NIPS

= SVMs)

Modern versions of RBF networks, mathematically enhanced linear combinations of grandmother cell outputs

Fun maths & computer science: mathematical guarantees of optimality due to convexity

Plenty of new extensions in many areas of machine learning

2 Knowledge-rich probabilistic graphical models : Algorithms for inference in probabilistic logic

Fun maths & computer science: many different complicated models can be devised

Provides interpretable models, may help humans to understand the data

(8)

What’s Wrong ?

1 “Kernel machines” :

2 knowledge-rich probabilistic graphical models :

(9)

What’s Wrong ?

works well if “similarity” is defined smartly, i.e. with a good representation to start with

math guarantees only applies if the similarity function is fixed, don’t know good ways to learn it

curse of dimensionality(our recent work...)

= can’t learn highly-varying functions, i.e. high-level abstractions

(10)

What’s Wrong ?

works well if “similarity” is defined smartly, i.e. with a good representation to start with

math guarantees only applies if the similarity function is fixed, don’t know good ways to learn it

curse of dimensionality(our recent work...)

= can’t learn highly-varying functions, i.e. high-level abstractions

Require to pre-specify ALL the concepts of interest : don’t know how to really learn the semantics

Requires to specify prior distribution on the data through priors on all the relevant variables and their interactions Exact inference algorithms are untractable : use

approximations or small models

(11)

Local Kernels

Kernel machines are 2-layer architectures whose first layer is

“fixed” by taking the examples xi as prototypes, learning α_i

(12)

Local Kernels

“fixed” by taking the examples xi as prototypes, learning α_i f(x) f(x) =

Xn i=1

αiK(x,xi)

K(x,x₁) . . . K(x,xi) . . . K(x,xn)

x

α₁ α_i α_n

K(x,x_i) = grandmother cell

(13)

Local Kernels

“fixed” by taking the examples xi as prototypes, learning α_i f(x) f(x) =

Xn i=1

αiK(x,xi)

K(x,x₁) . . . K(x,xi) . . . K(x,xn)

x

α₁ α_i α_n

K(x,x_i) = grandmother cell

LocalmeansK(x,xi)→0 as ||x−xi|| increases.

(14)

Local Learning Algorithms

A learned parameter of the model influences the value of the learned function in a local area of the input domain.

(15)

Local Learning Algorithms

With local kernel machine

f(x) =X

i

α_iK(x,x_i),

α_i only influencesf(x) for x nearx_i.

(16)

Local Learning Algorithms

With local kernel machine

f(x) =X

i

α_iK(x,x_i),

α_i only influencesf(x) for x nearx_i. Examples :

nearest-neighbor algorithms local kernel machines

most non-parametric models except multi-layer neural networks

(17)

Mathematical Problem with Local Learning

Theorem

With K the Gaussian kernel andf(·) changing sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then at least k examples are required.

(18)

Mathematical Problem with Local Learning

Theorem

With K the Gaussian kernel andf(·) changing sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then at least k examples are required.

decision surface

Class −1

Class 1

With local kernels, learning a function that has many “bumps”

requires as many examples as bumps.

(19)

The Curse of Dimensionality

Mathematical problem with classical non-parametric models

(20)

The Curse of Dimensionality

(21)

The Curse of Dimensionality

May need to have examples for each probable combination of the variables of interest.

OK for 2 or 3 variables (e.g. V1 cells),

⇒ NOT OK for more abstract concepts...

(22)

Mathematical Problem with Local Kernels

Theorem

With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f(x)6=f(x⁰) when |x−x⁰|= 1) with d inputs, then at least 2^d−1 examples are required.

(23)

Mathematical Problem with Local Kernels

Theorem

⇒ need to cover the space of possibilities with examples

⇒ may require nb examples exponential in nb inputs

(24)

Mathematical Problem with Local Kernels

Theorem

⇒ need to cover the space of possibilities with examples

⇒ may require nb examples exponential in nb inputs

= strongly negative mathematical results on local kernel machines Other similar results in (Bengio, Delalleau, Le Roux, NIPS’2005)

(25)

Is There Hope ?

What kind of architectures would allow to re- present high-level abstractions without ex- plicitly enumerating all the variations ?

(26)

Deep Networks

Some mathematical functions can be repre- sented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network.

(27)

Deep Networks

. . . .

(28)

Deep Networks

e.g. d-bit parity :

• 1 adaptive layer (SVM) : 2^d units and parameters required

• 2 adaptive layers (neural net) : d units, d² parameters

• d-layer net : 2d units , 5d parameters

• recurrent net : 2 units , 5 param.,d steps

. . . .

(29)

Mathematical Problem with Gradient-Based Learning of Deep Networks

Theorem

In non-linear dynamical systems that can latch information for long durations, gradients capturing long-term dependencies vanish exponentially with duration.

(30)

Mathematical Problem with Gradient-Based Learning of Deep Networks

Theorem

This also applies to deep neural networks (since recurrent neural networks are equivalent, when unfolded, to a very deep network).

(31)

Mathematical Problem with Gradient-Based Learning of Deep Networks

Theorem

This also applies to deep neural networks (since recurrent neural networks are equivalent, when unfolded, to a very deep network).

Basic mathematical problem : the gradients become smaller and more diffuse as they are back-propagated.

Also, not clear how to back-propagate gradients accurately through a very deep or recurrent network in the brain.

(32)

Greedy Learning of Abstractions

Greedily learning simple things first, higher-level abstractions on top of lower-level ones seems like a possible good

strategy and is psychologically plausible.

(33)

Greedy Learning of Abstractions

Greedily learning simple things first, higher-level abstractions on top of lower-level ones seems like a possible good

strategy and is psychologically plausible.

Coherent with psychological litterature starting with Piaget 1952.

We learn baby math before arithmetic before algebra before differential equations....

Also evidence from neurobiology : (Guillery 2005) “Is postnatal neocortical maturation hierarchical ?”.

And several successful machine learning algorithms are

constructive, e.g. boosting (Freund & Schapire 1996) adds one group of units (weak learner) at a time (but all on the same layer).

(34)

Deep Belief Networks

Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :

(35)

Deep Belief Networks

unsupervised learning of each layer, each trying to model distribution of its inputs

(36)

Deep Belief Networks

Hebbian-like local update rules

(37)

Deep Belief Networks

whole network can be refined wrt supervised target if gradients can be propagated

(38)

Deep Belief Networks

whole network can be refined wrt supervised target if gradients can be propagated

beating state-of-the-art statistical learning in preliminary experiments on a large benchmark task (MNIST)

(39)

Greedy Layer-wise Learning

Supervised greedy layer-wise learning : each added layer taking in input output of previous layer is trained as the hidden layer of a supervised 1-hidden-layer net. Throw away output weights once the layer is trained.

target = y

h₂

h₁

x

(40)

Unsupervised Learning Guides the Optimization

Simulation results : does not work as well as unsupervised greedy layer-wise learning (Deep Belief Nets, or network of auto- encoders).

target = h₁

h₂

h₁

x

(41)

Unsupervised Learning Guides the Optimization

Results : lower TRAINING AND TEST error with the UNSUPERVI- SED variant !

target = h₁

h₂

h₁

x

(42)

Unsupervised Learning Guides the Optimization

Results : lower TRAINING AND TEST error with the UNSUPERVI- SED variant !

⇒ suggests unsupervised learning GUIDES the optimization

target = h₁

h₂

h₁

x

(43)

Multiple Modalities Help Each Other

Hypothesis :

multiple modalities can guide each other during learning.

My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)

(44)

Multiple Modalities Help Each Other

Hypothesis :

Rationale : shared high-level semantics = same world behind favors learning of internal representations capturing underlying structure of the world.

(45)

Multiple Modalities Help Each Other

Hypothesis :

Contradicts machine learning folklore that more input variables implies more difficult learning problem.

(46)

Multiple Modalities Help Each Other

Hypothesis :

Contradicts machine learning folklore that more input variables implies more difficult learning problem.

Supporting evidence in machine learning litterature : - Coherence criterion (IMAX) from(Becker & Hinton 1992) - co-training(Blum & Mitchell 1998)

(47)

Probabilistic Surprise as Reinforcement

Reinforcement learning with novelty as reinforcement

(48)

Probabilistic Surprise as Reinforcement

=

active unsupervised learning

⇒ potential for much faster learning

(49)

Probabilistic Surprise as Reinforcement

=

active unsupervised learning

Psychological evidence : infant habituation, e.g. (Fantz 1964), (Sirois & Mareshal 2004)

Neurobiological evidence : connection between novelty and dopamine e.g. (Lisman & Grace 2005).

(50)

Probabilistic Surprise as Reinforcement

=

active unsupervised learning

Psychological evidence : infant habituation, e.g. (Fantz 1964), (Sirois & Mareshal 2004)

Neurobiological evidence : connection between novelty and dopamine e.g. (Lisman & Grace 2005).

Hypothesis : it is not a low probability event that should be rewarded but rather one that induces changes in the model (i.e. the predicted probabilities were wrong).

e.g. TV white-noise is unpredictable (low prob.) but not really surprising.

reinforcement 6= logProb(input|θ) reinforcement =||^∂^logProb(input|θ)

∂θ ||

(51)

Semi-Supervised Learning

Semi-supervised learning : “labeled” + “unlabeled” examples e.g. image + speech naming objects in it vs image alone

(52)

Semi-Supervised Learning

Unsupervised component in our learning algorithm allows taking advantage of much larger quantity of unlabeled data.

(53)

Semi-Supervised Learning

Unsupervised component in our learning algorithm allows taking advantage of much larger quantity of unlabeled data.

Seems necessary both for AI and biologically motivated learning algorithms.

Semi-supervised algorithms often generalize better than purely supervised ones that are trained only on the labeled data.

= another motivation for unsupervised component

(54)

Predictive Models : Supervised = Unsupervised

With temporal data, unsupervised modeling of the sequence x₁,x₂, . . . ,xt, . . . is equivalent to predictive (supervised) modeling of x_t|x_t−1,x_t₋₂, . . .:

P(x₁, . . .x_T) = YT t=1

P(x_t|x_t−1,x_t−2, . . .x₁)

(55)

Predictive Models : Supervised = Unsupervised

P(x₁, . . .x_T) = YT t=1

P(x_t|x_t−1,x_t−2, . . .x₁)

Ongoing research : combine static unsupervised algorithms such as Deep Belief Networks with predictive supervised online learning.

(56)

Predictive Models : Supervised = Unsupervised

P(x₁, . . .x_T) = YT t=1

P(x_t|x_t−1,x_t−2, . . .x₁)

Ongoing research : combine static unsupervised algorithms such as Deep Belief Networks with predictive supervised online learning.

Let z_t learned internal representations at timet.

Hypothesis : using error signal from prediction of zt|z_t−1 can help improve/guide training of static unsupervised model of x_t’s.

(57)

Conclusions

Fundamental mathematical limitations of kernel machines : curse of dimensionality

(58)

Conclusions

Much needed : algorithms for learning in deep networks

(59)

Conclusions

Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks

(60)

Conclusions

Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions

(61)

Conclusions

The local unsupervised learning somehow appears to guide the optimization.

(62)

Conclusions

May be combined with actions, with probabilistic novelty as reinforcement signal

(63)

Conclusions

May be combined with actions, with probabilistic novelty as reinforcement signal

May be improved by combining multiple modalities and temporal dependencies

(64)

The Team

Yoshua Bengio Pascal Lamblin Fran¸cois Rivest

Olivier Delalleau Nicolas Le Roux Hugo Larochelle