• Aucun résultat trouvé

Neural networks

N/A
N/A
Protected

Academic year: 2022

Partager "Neural networks"

Copied!
4
0
0

Texte intégral

(1)

Neural networks

Chapter 20, Section 5

Chapter 20, Section 5 1

Outline

♦ Brains

♦ Neural networks

♦ Perceptrons

♦ Multilayer perceptrons

♦ Applications of neural networks

Chapter 20, Section 5 2

Brains

1011neurons of >20types,1014synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential

Axon

Cell body or Soma Nucleus Dendrite

Synapses Axonal arborization Axon from another cell Synapse

Chapter 20, Section 5 3

McCulloch–Pitts “unit”

Output is a “squashed” linear function of the inputs:

ai←g(ini) =g

Σ

jWj,iaj

Output

Σ

Input

Links Activation

Function Input

Function Output

Links a0 = −1 ai = g(ini)

ai ini g

Wj,i W0,i Bias Weight

aj

A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do

Chapter 20, Section 5 4

Activation functions

(a) (b)

+1 +1

ini ini

g(ini) g(ini)

(a) is astep functionorthreshold function (b) is asigmoidfunction1/(1 +e−x)

Changing the bias weightW0,imoves the threshold location

Chapter 20, Section 5 5

Implementing logical functions

AND W0= 1.5

W1 = 1

W2 = 1

OR W2 = 1 W1 = 1

W0= 0.5

NOT W1 = –1 W0= – 0.5

McCulloch and Pitts: every Boolean function can be implemented

Chapter 20, Section 5 6

(2)

Network structures

Feed-forward networks:

–single-layer perceptrons –multi-layer perceptrons

Feed-forward networks implement functions, have no internal state Recurrent networks:

–Hopfield networkshave symmetric weights (Wi,j=Wj,i)

g(x) =sign(x),ai=±1;holographic associative memory –Boltzmann machinesuse stochastic activation functions,

≈MCMC in Bayes nets

– recurrent neural nets have directed cycles with delays

⇒ have internal state (like flip-flops), can oscillate etc.

Chapter 20, Section 5 7

Feed-forward example W

1,3

W

1,4

W

2,3

W

2,4

W

3,5

W

4,5

1

2

3

4

5

Feed-forward network = a parameterized family of nonlinear functions:

a5 =g(W3,5·a3+W4,5·a4)

=g(W3,5·g(W1,3·a1+W2,3·a2) +W4,5·g(W1,4·a1+W2,4·a2)) Adjusting weights changes the function: do learning this way!

Chapter 20, Section 5 8

Single-layer perceptrons

Input

Units Units

Output Wj,i

-4 -2x10 2 4 -4-20 24 x2 0

0.2 0.4 0.6 0.8 1 Perceptron output

Output units all operate separately—no shared weights

Adjusting weights moves the location, orientation, and steepness of cliff

Chapter 20, Section 5 9

Expressiveness of perceptrons

Consider a perceptron withg= step function (Rosenblatt, 1957, 1960) Can represent AND, OR, NOT, majority, etc., but not XOR

Represents alinear separatorin input space:

Σ

jWjxj>0 or W·x>0

(a) x1and x2

1

0

0 1

x1

x2

(b) x1or x2

0 1

1

0 x1

x2

(c) x1xor x2

?

0 1

1

0 x1

x2

Minsky & Papert (1969) pricked the neural network balloon

Chapter 20, Section 5 10

Perceptron learning

Learn by adjusting weights to reduceerroron training set Thesquared errorfor an example with inputxand true outputyis

E=1 2Err2≡1

2(y−hW(x))2,

Perform optimization search by gradient descent:

∂E

∂Wj

=Err×∂Err

∂Wj

=Err× ∂

∂Wj

y−g(

Σ

nj= 0Wjxj)

=−Err×g0(in)×xj

Simple weight update rule:

Wj←Wj+α×Err×g0(in)×xj

E.g., +ve error ⇒ increase network output

⇒ increase weights on +ve inputs, decrease on -ve inputs

Chapter 20, Section 5 11

Perceptron learning contd.

Perceptron learning rule converges to a consistent function for any linearly separable data set

0.4 0.5 0.6 0.7 0.8 0.9 1

0 10 20 30 40 50 60 70 80 90 100

Proportion correct on test set

Training set size - MAJORITY on 11 inputs Perceptron Decision tree

0.4 0.5 0.6 0.7 0.8 0.9 1

0 10 20 30 40 50 60 70 80 90 100

Proportion correct on test set

Training set size - RESTAURANT data Perceptron Decision tree

Perceptron learns majority function easily, DTL is hopeless DTL learns restaurant function easily, perceptron cannot represent it

Chapter 20, Section 5 12

(3)

Multilayer perceptrons

Layers are usually fully connected;

numbers ofhidden unitstypically chosen by hand

Input units Hidden units Output units ai

Wj,i

aj

Wk,j

ak

Chapter 20, Section 5 13

Expressiveness of MLPs

All continuous functions w/ 2 layers, all functions w/ 3 layers

-4 -2x10 2 4 -4-202 4 x2 0

0.2 0.4 0.6 0.8 1 hW(x1, x2)

-4 -2x10 2 4 -4-20 24 x2 0

0.2 0.4 0.6 0.8 1 hW(x1, x2)

Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump

Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units (cf DTL proof)

Chapter 20, Section 5 14

Back-propagation learning

Output layer: same as for single-layer perceptron, Wj,i←Wj,i+α×aj×∆i

where∆i=Erri×g0(ini)

Hidden layer:back-propagatethe error from the output layer:

j=g0(inj)X

iWj,ii.

Update rule for weights in hidden layer:

Wk,j←Wk,j+α×ak×∆j.

(Most neuroscientists deny that back-propagation occurs in the brain)

Chapter 20, Section 5 15

Back-propagation derivation

The squared error on a single example is defined as E=1

2

X

i(yi−ai)2,

where the sum is over the nodes in the output layer.

∂E

∂Wj,i = −(yi−ai) ∂ai

∂Wj,i=−(yi−ai)∂g(ini)

∂Wj,i

= −(yi−ai)g0(ini)∂ini

∂Wj,i

=−(yi−ai)g0(ini) ∂

∂Wj,i

X

jWj,iaj

= −(yi−ai)g0(ini)aj=−aji

Chapter 20, Section 5 16

Back-propagation derivation contd.

∂E

∂Wk,j =−X

i(yi−ai) ∂ai

∂Wk,j=−X

i(yi−ai)∂g(ini)

∂Wk,j

=−X

i(yi−ai)g0(ini)∂ini

∂Wk,j

=−X

ii

∂Wk,j

X

jWj,iaj

=−X

iiWj,i ∂aj

∂Wk,j

=−X

iiWj,i∂g(inj)

∂Wk,j

=−X

iiWj,ig0(inj)∂inj

∂Wk,j

=−X

iiWj,ig0(inj) ∂

∂Wk,j

X

kWk,jak

=−X

iiWj,ig0(inj)ak=−akj

Chapter 20, Section 5 17

Back-propagation learning contd.

At eachepoch, sum gradient updates for all examples and apply Training curvefor 100 restaurant examples: finds exact fit

0 2 4 6 8 10 12 14

0 50 100 150 200 250 300 350 400

Total error on training set

Number of epochs

Typical problems: slow convergence, local minima

Chapter 20, Section 5 18

(4)

Back-propagation learning contd.

Learning curve for MLP with 4 hidden units:

0.4 0.5 0.6 0.7 0.8 0.9 1

0 10 20 30 40 50 60 70 80 90 100

Proportion correct on test set

Training set size - RESTAURANT data Decision tree Multilayer network

MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily

Chapter 20, Section 5 19

Handwritten digit recognition

3-nearest-neighbor = 2.4% error 400–300–10 unit MLP = 1.6% error

LeNet: 768–192–30–10 unit MLP = 0.9% error

Current best (kernel machines, vision algorithms)≈0.6% error

Chapter 20, Section 5 20

Summary

Most brains have lots of neurons; each neuron≈linear–threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive

Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation

Many applications: speech, driving, handwriting, fraud detection, etc.

Engineering, cognitive modelling, and neural system modelling subfields have largely diverged

Chapter 20, Section 5 21

Références

Documents relatifs

the variables could be the weights of a neural network, and each constraint imposes that the network satisfies the correct input-output relation on one of M.. training

An alternative algorithm is supervised, greedy and layer-wise: train each new hidden layer as the hidden layer of a one-hidden layer supervised neural network NN (taking as input

An alternative algorithm is supervised, greedy and layer-wise: train each new hidden layer as the hidden layer of a one-hidden layer supervised neural network NN (taking as input

The proposed architecture is a “neural network” implementation of a graphical model where all the variables are observed in the training set, with the hidden units playing a

An alternative algorithm is supervised, greedy and layer-wise: train each new hidden layer as the hidden layer of a one-hidden layer supervised neural network (taking as input

This simple model however, does not represent the best performance that can be achieved for these networks, indeed the maximum storage capacity for uncorrelated patterns has been

To enumerate only encapsulations and decapsulations in the length of each word (and thus minimize adaptation functions by finding the shortest word accepted), a transformed automaton

2 A user-defined hyperparameter... Pavia University image. The output layer has only one neuron, and its activa- tion function is the sigmoid. After the training, the two