• Aucun résultat trouvé

Learning High-Level Abstractions

N/A
N/A
Protected

Academic year: 2022

Partager "Learning High-Level Abstractions"

Copied!
64
0
0

Texte intégral

(1)

On the Challenge of Learning Complex Functions

Yoshua Bengio

May 9th 2006

Thanks to: Pascal Lamblin, Fran¸cois Rivest, Olivier Delalleau, Nicolas Le Roux, Hugo Larochelle

(2)

Learning High-Level Abstractions

High-level abstraction=

a highly-varying, complex, but structured function

(3)

Learning High-Level Abstractions

High-level abstraction=

a highly-varying, complex, but structured function

e.g. semantic concept of “chair” is a higher-level abstraction than a recognizer of a particular instance of chair, which is a higher-level abstraction than a V1 oriented edge-detector.

(4)

Learning High-Level Abstractions

High-level abstraction=

a highly-varying, complex, but structured function

e.g. semantic concept of “chair” is a higher-level abstraction than a recognizer of a particular instance of chair, which is a higher-level abstraction than a V1 oriented edge-detector.

Such high-level abstraction includes a very large set of possible retinal images that can be very different from each other in terms of raw sensory patterns (“Euclidean distance”).

(5)

Recently Hot Topics at NIPS

1 “Kernel machines” (best known = Support Vector Machines

= SVMs)

2 Knowledge-rich probabilistic graphical models :

(6)

Recently Hot Topics at NIPS

1 “Kernel machines” (best known = Support Vector Machines

= SVMs)

Modern versions of RBF networks, mathematically enhanced linear combinations of grandmother cell outputs

Fun maths & computer science: mathematical guarantees of optimality due to convexity

Plenty of new extensions in many areas of machine learning

2 Knowledge-rich probabilistic graphical models :

(7)

Recently Hot Topics at NIPS

1 “Kernel machines” (best known = Support Vector Machines

= SVMs)

Modern versions of RBF networks, mathematically enhanced linear combinations of grandmother cell outputs

Fun maths & computer science: mathematical guarantees of optimality due to convexity

Plenty of new extensions in many areas of machine learning

2 Knowledge-rich probabilistic graphical models : Algorithms for inference in probabilistic logic

Fun maths & computer science: many different complicated models can be devised

Provides interpretable models, may help humans to understand the data

(8)

What’s Wrong ?

1 “Kernel machines” :

2 knowledge-rich probabilistic graphical models :

(9)

What’s Wrong ?

1 “Kernel machines” :

works well if “similarity” is defined smartly, i.e. with a good representation to start with

math guarantees only applies if the similarity function is fixed, don’t know good ways to learn it

curse of dimensionality(our recent work...)

= can’t learn highly-varying functions, i.e. high-level abstractions

2 knowledge-rich probabilistic graphical models :

(10)

What’s Wrong ?

1 “Kernel machines” :

works well if “similarity” is defined smartly, i.e. with a good representation to start with

math guarantees only applies if the similarity function is fixed, don’t know good ways to learn it

curse of dimensionality(our recent work...)

= can’t learn highly-varying functions, i.e. high-level abstractions

2 knowledge-rich probabilistic graphical models :

Require to pre-specify ALL the concepts of interest : don’t know how to really learn the semantics

Requires to specify prior distribution on the data through priors on all the relevant variables and their interactions Exact inference algorithms are untractable : use

approximations or small models

(11)

Local Kernels

Kernel machines are 2-layer architectures whose first layer is

“fixed” by taking the examples xi as prototypes, learning αi

(12)

Local Kernels

Kernel machines are 2-layer architectures whose first layer is

“fixed” by taking the examples xi as prototypes, learning αi f(x) f(x) =

Xn i=1

αiK(x,xi)

K(x,x1) . . . K(x,xi) . . . K(x,xn)

x

α1 αi αn

K(x,xi) = grandmother cell

(13)

Local Kernels

Kernel machines are 2-layer architectures whose first layer is

“fixed” by taking the examples xi as prototypes, learning αi f(x) f(x) =

Xn i=1

αiK(x,xi)

K(x,x1) . . . K(x,xi) . . . K(x,xn)

x

α1 αi αn

K(x,xi) = grandmother cell

LocalmeansK(x,xi)→0 as ||x−xi|| increases.

(14)

Local Learning Algorithms

A learned parameter of the model influences the value of the learned function in a local area of the input domain.

(15)

Local Learning Algorithms

A learned parameter of the model influences the value of the learned function in a local area of the input domain.

With local kernel machine

f(x) =X

i

αiK(x,xi),

αi only influencesf(x) for x nearxi.

(16)

Local Learning Algorithms

A learned parameter of the model influences the value of the learned function in a local area of the input domain.

With local kernel machine

f(x) =X

i

αiK(x,xi),

αi only influencesf(x) for x nearxi. Examples :

nearest-neighbor algorithms local kernel machines

most non-parametric models except multi-layer neural networks

(17)

Mathematical Problem with Local Learning

Theorem

With K the Gaussian kernel andf(·) changing sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then at least k examples are required.

(18)

Mathematical Problem with Local Learning

Theorem

With K the Gaussian kernel andf(·) changing sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then at least k examples are required.

decision surface

Class −1

Class 1

With local kernels, learning a function that has many “bumps”

requires as many examples as bumps.

(19)

The Curse of Dimensionality

Mathematical problem with classical non-parametric models

(20)

The Curse of Dimensionality

Mathematical problem with classical non-parametric models

(21)

The Curse of Dimensionality

Mathematical problem with classical non-parametric models

May need to have examples for each probable combination of the variables of interest.

OK for 2 or 3 variables (e.g. V1 cells),

⇒ NOT OK for more abstract concepts...

(22)

Mathematical Problem with Local Kernels

Theorem

With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f(x)6=f(x0) when |x−x0|= 1) with d inputs, then at least 2d−1 examples are required.

(23)

Mathematical Problem with Local Kernels

Theorem

With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f(x)6=f(x0) when |x−x0|= 1) with d inputs, then at least 2d−1 examples are required.

⇒ need to cover the space of possibilities with examples

⇒ may require nb examples exponential in nb inputs

(24)

Mathematical Problem with Local Kernels

Theorem

With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f(x)6=f(x0) when |x−x0|= 1) with d inputs, then at least 2d−1 examples are required.

⇒ need to cover the space of possibilities with examples

⇒ may require nb examples exponential in nb inputs

= strongly negative mathematical results on local kernel machines Other similar results in (Bengio, Delalleau, Le Roux, NIPS’2005)

(25)

Is There Hope ?

What kind of architec- tures would allow to re- present high-level abs- tractions without ex- plicitly enumerating all the variations ?

(26)

Deep Networks

Some mathematical functions can be repre- sented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network.

(27)

Deep Networks

Some mathematical functions can be repre- sented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network.

. . . .

. . . .

. . . .

. . . .

. . . .

(28)

Deep Networks

Some mathematical functions can be repre- sented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network.

e.g. d-bit parity :

• 1 adaptive layer (SVM) : 2d units and parameters required

• 2 adaptive layers (neural net) : d units, d2 parameters

• d-layer net : 2d units , 5d parameters

• recurrent net : 2 units , 5 param.,d steps

. . . .

. . . .

. . . .

. . . .

. . . .

(29)

Mathematical Problem with Gradient-Based Learning of Deep Networks

Theorem

In non-linear dynamical systems that can latch information for long durations, gradients capturing long-term dependencies vanish exponentially with duration.

(30)

Mathematical Problem with Gradient-Based Learning of Deep Networks

Theorem

In non-linear dynamical systems that can latch information for long durations, gradients capturing long-term dependencies vanish exponentially with duration.

This also applies to deep neural networks (since recurrent neural networks are equivalent, when unfolded, to a very deep network).

(31)

Mathematical Problem with Gradient-Based Learning of Deep Networks

Theorem

In non-linear dynamical systems that can latch information for long durations, gradients capturing long-term dependencies vanish exponentially with duration.

This also applies to deep neural networks (since recurrent neural networks are equivalent, when unfolded, to a very deep network).

Basic mathematical problem : the gradients become smaller and more diffuse as they are back-propagated.

Also, not clear how to back-propagate gradients accurately through a very deep or recurrent network in the brain.

(32)

Greedy Learning of Abstractions

Greedily learning simple things first, higher-level abstractions on top of lower-level ones seems like a possible good

strategy and is psychologically plausible.

(33)

Greedy Learning of Abstractions

Greedily learning simple things first, higher-level abstractions on top of lower-level ones seems like a possible good

strategy and is psychologically plausible.

Coherent with psychological litterature starting with Piaget 1952.

We learn baby math before arithmetic before algebra before differential equations....

Also evidence from neurobiology : (Guillery 2005) “Is postnatal neocortical maturation hierarchical ?”.

And several successful machine learning algorithms are

constructive, e.g. boosting (Freund & Schapire 1996) adds one group of units (weak learner) at a time (but all on the same layer).

(34)

Deep Belief Networks

Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :

(35)

Deep Belief Networks

Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :

unsupervised learning of each layer, each trying to model distribution of its inputs

(36)

Deep Belief Networks

Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :

unsupervised learning of each layer, each trying to model distribution of its inputs

Hebbian-like local update rules

(37)

Deep Belief Networks

Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :

unsupervised learning of each layer, each trying to model distribution of its inputs

Hebbian-like local update rules

whole network can be refined wrt supervised target if gradients can be propagated

(38)

Deep Belief Networks

Geoff Hinton just introduced a deep network model that provides more evidence that this direction is worthwhile :

unsupervised learning of each layer, each trying to model distribution of its inputs

Hebbian-like local update rules

whole network can be refined wrt supervised target if gradients can be propagated

beating state-of-the-art statistical learning in preliminary experiments on a large benchmark task (MNIST)

(39)

Greedy Layer-wise Learning

Supervised greedy layer-wise learning : each added layer taking in input output of previous layer is trained as the hid- den layer of a supervised 1-hidden-layer net. Throw away output weights once the layer is trained.

target = y

h2

h1

x

(40)

Unsupervised Learning Guides the Optimization

Simulation results : does not work as well as unsupervised greedy layer-wise learning (Deep Belief Nets, or network of auto- encoders).

target = h1

h2

h1

x

(41)

Unsupervised Learning Guides the Optimization

Simulation results : does not work as well as unsupervised greedy layer-wise learning (Deep Belief Nets, or network of auto- encoders).

Results : lower TRAINING AND TEST error with the UNSUPERVI- SED variant !

target = h1

h2

h1

x

(42)

Unsupervised Learning Guides the Optimization

Simulation results : does not work as well as unsupervised greedy layer-wise learning (Deep Belief Nets, or network of auto- encoders).

Results : lower TRAINING AND TEST error with the UNSUPERVI- SED variant !

⇒ suggests unsupervised learning GUIDES the optimization

target = h1

h2

h1

x

(43)

Multiple Modalities Help Each Other

Hypothesis :

multiple modalities can guide each other during learning.

My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)

(44)

Multiple Modalities Help Each Other

Hypothesis :

multiple modalities can guide each other during learning.

Rationale : shared high-level semantics = same world behind favors learning of internal representations capturing underlying structure of the world.

My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)

(45)

Multiple Modalities Help Each Other

Hypothesis :

multiple modalities can guide each other during learning.

Rationale : shared high-level semantics = same world behind favors learning of internal representations capturing underlying structure of the world.

Contradicts machine learning folklore that more input variables implies more difficult learning problem.

My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)

(46)

Multiple Modalities Help Each Other

Hypothesis :

multiple modalities can guide each other during learning.

Rationale : shared high-level semantics = same world behind favors learning of internal representations capturing underlying structure of the world.

Contradicts machine learning folklore that more input variables implies more difficult learning problem.

Supporting evidence in machine learning litterature : - Coherence criterion (IMAX) from(Becker & Hinton 1992) - co-training(Blum & Mitchell 1998)

My particular slant : it helps the OPTIMIZATION process ! (not only a regularizer)

(47)

Probabilistic Surprise as Reinforcement

Reinforcement learning with novelty as reinforcement

(48)

Probabilistic Surprise as Reinforcement

Reinforcement learning with novelty as reinforcement

=

active unsupervised learning

⇒ potential for much faster learning

(49)

Probabilistic Surprise as Reinforcement

Reinforcement learning with novelty as reinforcement

=

active unsupervised learning

⇒ potential for much faster learning

Psychological evidence : infant habituation, e.g. (Fantz 1964), (Sirois & Mareshal 2004)

Neurobiological evidence : connection between novelty and dopamine e.g. (Lisman & Grace 2005).

(50)

Probabilistic Surprise as Reinforcement

Reinforcement learning with novelty as reinforcement

=

active unsupervised learning

⇒ potential for much faster learning

Psychological evidence : infant habituation, e.g. (Fantz 1964), (Sirois & Mareshal 2004)

Neurobiological evidence : connection between novelty and dopamine e.g. (Lisman & Grace 2005).

Hypothesis : it is not a low probability event that should be rewarded but rather one that induces changes in the model (i.e. the predicted probabilities were wrong).

e.g. TV white-noise is unpredictable (low prob.) but not really surprising.

reinforcement 6= logProb(input|θ) reinforcement =||logProb(input|θ)

∂θ ||

(51)

Semi-Supervised Learning

Semi-supervised learning : “labeled” + “unlabeled” examples e.g. image + speech naming objects in it vs image alone

(52)

Semi-Supervised Learning

Semi-supervised learning : “labeled” + “unlabeled” examples e.g. image + speech naming objects in it vs image alone

Unsupervised component in our learning algorithm allows taking advantage of much larger quantity of unlabeled data.

(53)

Semi-Supervised Learning

Semi-supervised learning : “labeled” + “unlabeled” examples e.g. image + speech naming objects in it vs image alone

Unsupervised component in our learning algorithm allows taking advantage of much larger quantity of unlabeled data.

Seems necessary both for AI and biologically motivated learning algorithms.

Semi-supervised algorithms often generalize better than purely supervised ones that are trained only on the labeled data.

= another motivation for unsupervised component

(54)

Predictive Models : Supervised = Unsupervised

With temporal data, unsupervised modeling of the sequence x1,x2, . . . ,xt, . . . is equivalent to predictive (supervised) modeling of xt|xt−1,xt−2, . . .:

P(x1, . . .xT) = YT t=1

P(xt|xt−1,xt−2, . . .x1)

(55)

Predictive Models : Supervised = Unsupervised

With temporal data, unsupervised modeling of the sequence x1,x2, . . . ,xt, . . . is equivalent to predictive (supervised) modeling of xt|xt−1,xt−2, . . .:

P(x1, . . .xT) = YT t=1

P(xt|xt−1,xt−2, . . .x1)

Ongoing research : combine static unsupervised algorithms such as Deep Belief Networks with predictive supervised online learning.

(56)

Predictive Models : Supervised = Unsupervised

With temporal data, unsupervised modeling of the sequence x1,x2, . . . ,xt, . . . is equivalent to predictive (supervised) modeling of xt|xt−1,xt−2, . . .:

P(x1, . . .xT) = YT t=1

P(xt|xt−1,xt−2, . . .x1)

Ongoing research : combine static unsupervised algorithms such as Deep Belief Networks with predictive supervised online learning.

Let zt learned internal representations at timet.

Hypothesis : using error signal from prediction of zt|zt−1 can help improve/guide training of static unsupervised model of xt’s.

(57)

Conclusions

Fundamental mathematical limitations of kernel machines : curse of dimensionality

(58)

Conclusions

Fundamental mathematical limitations of kernel machines : curse of dimensionality

Much needed : algorithms for learning in deep networks

(59)

Conclusions

Fundamental mathematical limitations of kernel machines : curse of dimensionality

Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks

(60)

Conclusions

Fundamental mathematical limitations of kernel machines : curse of dimensionality

Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions

(61)

Conclusions

Fundamental mathematical limitations of kernel machines : curse of dimensionality

Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions

The local unsupervised learning somehow appears to guide the optimization.

(62)

Conclusions

Fundamental mathematical limitations of kernel machines : curse of dimensionality

Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions

The local unsupervised learning somehow appears to guide the optimization.

May be combined with actions, with probabilistic novelty as reinforcement signal

(63)

Conclusions

Fundamental mathematical limitations of kernel machines : curse of dimensionality

Much needed : algorithms for learning in deep networks Supervised gradient descent also limited in deep networks Layer-wise unsupervised greedy learning offers hope to learn high-level abstractions

The local unsupervised learning somehow appears to guide the optimization.

May be combined with actions, with probabilistic novelty as reinforcement signal

May be improved by combining multiple modalities and temporal dependencies

(64)

The Team

Yoshua Bengio Pascal Lamblin Fran¸cois Rivest

Olivier Delalleau Nicolas Le Roux Hugo Larochelle

Références

Documents relatifs

On the top of the VGGish Model for AudioSet, we added six fully connected layers with re- spectively 100, 80, 60, 40, 20, and 5 units.

5 For a fully connected neural network of L layers, given that we have one such operation to perform for each of the layers, the gradient computation for these terms alone would have

An alternative algorithm is supervised, greedy and layer-wise: train each new hidden layer as the hidden layer of a one-hidden layer supervised neural network NN (taking as input

For each dataset, CentralNet is compared with four different fusion approaches: early fusion (concatenation of low-level fea- tures), late fusion (concatenation of the unimodal

Deep Neural Network (DNN) with pre-training [7, 1] or regular- ized learning process [14] have achieved good performance on difficult high dimensional input tasks where Multi

In this pa- per, we present a Budget Active Learning (BAL) for Deep Networks a new robust AL method created by combining both uncertainty and correlation measure as an

cohort, it does suggest that implanting patients with less comorbidi- ties, broad epicardial biventricular paced QRS duration, a left ventricle that is not so severely dilated

PRODUCTIVITY MANAGEMENT SYMPOSIUM INDUSTRIAL LIAISON PROGRAM.. MARCH