Learning algorithms for perceptrons from statistical physics

(1)

HAL Id: jpa-00246731

https://hal.archives-ouvertes.fr/jpa-00246731

Submitted on 1 Jan 1993

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Learning algorithms for perceptrons from statistical physics

Mirta Gordon, Pierre Peretto, Dominique Berchier

To cite this version:

Mirta Gordon, Pierre Peretto, Dominique Berchier. Learning algorithms for perceptrons from statis- tical physics. Journal de Physique I, EDP Sciences, 1993, 3 (2), pp.377-387. �10.1051/jp1:1993140�.

�jpa-00246731�

(2)

Classification

Physics

Abstracts

05.50 02.70 87.10

Learning algorithms for perceptrons ^from statistical physics

Mirta B.

Gordon(~)(°),

Pierre

Peretto(~)

and

Dominique Berchier(~)(**)

(~) DRFMC/SPSMS,

CEN Grenoble, 85X, 38041 Grenoble Cedex, France

(~) DRFMC/SP2M,

CEN Grenoble, 85X, 38041 Grenoble

Cedex,

France

(Received

22

May

1992,

accepted

in final farm 23

July 1992)

Rdsumd. Nous ddduisons des

algorithmes d'apprentissage

_pour des perceptrons ^I

partir

de consid6rations de m6canique

statistique.

Des

quantit6s thermodynamiques

sont considdr6es

comme des fonctions de colt, dont _on obtient, _par _une dynamique de

gradient,

)es efficacit6s

synaptiques

qui apprennent l'ensemble

d'apprentissage.

Les

rkgles

ainsi obtenues _sont class6es

en deux

cat6gories

suivant les

statistiques,

de Boltzmann

ou de Fermi, ^utilis6es _pour d6river )es fonctions de colt. Dans )es limites de

temp6ratures

nulle _ou

infinie,

la

plupart

des

rbgles

trouv6es tendent _vers )es

algorithmes

_connus, mais I

temp6rature

finie _on trouve des

strat6gies nouvelles, qui

minimisent le nombre d'erreurs dans l'ensemble

d'apprentissage.

Abstract.

Learning algorithms

for perceptrons _are deduced from statistical mechanics.

Thermodynamical

quantities _are used _as cost functions which _may be extremal12ed

by gradient dynamics

to find the

synaptic

efficacies that store the

learning

set of patterns. ^The

learning

rules _so obtained _are classified in two

categories, following

the statistics used to derive the cost

functions, namely,

Bolt2mann statistics, ^and ^Fermi statistics. In the limits of _zero _or infinite temperatures _some of the rules behave like already known

algorithms,

but _new

strategies

for

learning

_are obtained at finite temperatures, ^which ^minim12e ^the ^number ^of errors on the

training

set.

J'ai rencontrd Rammal _pour la

premiAre

tots dans le cadre du DEA "Matidre et

Rayon-

nement" de l'lfniversitd de Grenoble. Je l'ai _revu bier _sou vent ensuite et

je

lui

dois,

_en

partie,

de m'@tre in tdressd _aux rdseaux de _neurones. A la fin de 1982 _en

efTet,

R.

Maynard

recevait de P.W. Anderson deux

preprints,

l'un d'Anderson lui-m@me _sur la

dynamique

de la _soupe

prdbi- otique

et l'autre de

lfopfield

_sur le modAle lfebbien de mdmoire darts les rdseaux _neuronaux.

II _me les

communiquait

et

aprAs

avoir

programmd

la

dynamique

de la soupe de Anderson

je

ddcidais de

m'impliquer plus

_avant darts la tlidorie des rdseaux de _neurones. Nous _avons don _c

pendant

l'annde 1983 forma R.

Maynard,

R. Rammal et moi-mime _un

petit

_groupe

qui

s'est

(*)

Also with C-N-R-S-

(**) Supported

by

Sunburst-Fonds,

Board of the Swiss Federal Institutes of

Technology,

Zfirich.

JOURNAL DE PHYSIQUE T j N' 2 FEBRUARi lQ93 >4

(3)

rduni _une dizaine de fois et darts

lequel j'exposai

les iddes

qui commengaient

k

dmerger

dons

ce domaine. Les vastes cultures de Rammal et

Maynard

out fait de _ces rencontres des _mo- ments d'excitation intellectueJle

qui

font l'interdt

principal

de

Ii

vie des clierclieurs.

J'espdre

que Rammal aurait

apprdcid

le

petit

article de

syntlidse

_que _nous _proposons ci-dessous. Je le pense.

P. Peretto

1 Intro duction.

One of the

present

^interests on neural networks arises from the fact that

they

_are endowed with

generalization capabilities,

that

is, they

_are able to extract, ^from a set of

examples (the training

set of

patterns),

the

underlying

rule that

generates

^them. ^If

good generalization

is

achieved,

the network should

give

_a correct

output

to an unfamiliar

input,

_an

input

not

belonging

to the

training

set.

The interactions

J;;

between _neurons

j

and

I,

^I <

I, j

<

N,

and the thresholds

J;o, fully

characterize

a neural network of N _neurons.

Learning

consists in

modifying

these interactions

so as to make the _response of the network to the

patterns

^of the

learning

set _as close _as

possible

to the

corresponding expected outputs.

As

usual,

_we shall write

ff

for the state of _neuron

I,

and _p for the labels of the

patterns

ⁱⁿ ^the

training

set, I < p ^< P.

A number of

learning algorithms

has

already

been

put

forward for neural nets

ill,

most of them

stemming

from the seminal idea of

Hebb,

who

pointed

out the role of correlations of activities in the

learning dynamics. However,

the

original

Hebbian

law,

in which

learning

_a

new

pattern

_p introduces _a modification of the

synaptic weights given by

_[2]:

1kJ;;(p)

₌ _c

ff f)

_c > 0

(1.I)

is not

optimal

in the _sense that neither does it allow to reach the maximal

(theoretical) capacity [3,4]

^of ^the ^network ^for ^random

patterns

nor is it able _to learn without _errors when

P,

the number of

patterns

in the

learning set,

scales with the number of neurons

[5].

^Attention then focused _on the

simplest

network _one _can

imagine:

the

perceptron,

which consists of N

input

neurons

acting, through

_a set of interactions

J;,

onto _a

unique output

^unit. ^A

specific algorithm, namely

the

Perceptron algorithm,

_was devised for this

type

of network

[6].

^It modifies the

synaptic

efficacies

according

to I-I

),

but

taking

into account

only

those

patterns

the network

gives

the _wrong _answer _to. The

training

set is

iteratively

tested and the

J;

modified until all the

patterns

_are well learnt. It has been

proved

_[6] that if the

learning

set is

linearly separable,

that

is,

if _a set of interactions

J;

exists such _as the response

N

a(p)

₌

Sign ~jJ,ff _(1.2)

I=0

coincides with the

expected

_response

f~

for all the

patterns

of the

learning set,

^then ^the

algorithm

finds these interactions. In

(1.2)

_we have

introduced,

_as

usual,

_a threshold unit

f(=-I,

associated with

Jo,

the threshold value. The

Perceptron algorithm

is not very

efficient,

because it

only

allows for

marginal stabilities,

and several _new

learning

rules have been

proposed

which both

speed

_up the search times and

improve

the

stability

of solutions with

respect

to small

perturbations

of the

input signals

and

/or

interactions [7]

[8].

^All ^these

algorithms

_are

able to find _a solution when it

exists,

and it is believed that the best

generalization

rates

correspond

to the

highest

stabilities.

Recently,

_an

algorithm

that finds the best

generalization

(4)

rates for _a

perceptron

has been

proposed _[9],

but it

needs,

in its

implementation,

_a network of

higher complexity, namely

one with _an infinite number of hidden units.

However, problems

to be treated

by

neural networks _are in

general

_more

complex

than

linearly separable,

and the perceptron architecture is too poor for such

performances

_as

good learning

^and

generalization _[10]. Therefore,

networks with _more involved

designs

_are needed.

The

problem

of

learning

in these

complex

architectures is difficult and _not

completely

solved

yet. Among

^various

attempts

^made so

far,

constructivistic

algorithms

_seem _most

promising.

They proceed by adding

to the network _one _new cell at _a

time,

_so that

learning

reduces to

learning

in the

new

perceptrons successively

created

by

the construction. The

problem

of

learning

remains central. But _now _one is less interested in

finding

the solution

(which

anyway, ^at least at the

stage

^{of the} intermediate

perceptrons,

does not

exist)

that in

finding

the best

solution,

which minimizes the number of _errors _on the

learning

set

[11, 12].

^The relation between

minimizing

^the ^number of _errors _on the

training

set and the

generalization performance

of such constructivistic

algorithms

is not clear

yet,

^but it is believed that the smaller the network built _up

by

the

learning algorithm,

^the better its

generalization

rate

[13].

The

strategy put

^forward ^ih ^this paper is to find _a convenient

thermodynamical quantity

to

extremalize,

like _an _energy _or _any other cost

function,

from which _a

learning algorithm

_may be deduced

by gradient dynamics.

For this

goal

to be

achieved,

_we

rely

_upon statistical mechanics.

The _reason of such _a choice is that at finite

temperature

we have the

possibility

of

accepting

solutions

which, although

not

optimal,

_are nevertheless

acceptable.

Several

approaches

which

optimize

the

storing capacities

of

perceptrons

even in the

regime

with _errors, _are

possible.

We

classify

them in two

categories, following

the statistics used to derive the cost

functions, namely,

Boltzmann

statistics,

which _we consider in section

2,

and Fermi

statistics,

discussed in

section 3. In _an interest of

clarity,

the

learning

rules _are first deduced without

normalizing

the

synaptic efficacies,

the introduction of normalization

being postponed

to section 4. In section 5 _we discuss the different

learning

rules obtained.

2. Cost functions

generated

with Boltzmanii statistics.

2, I CANONICAL _FREE _ENERGY MAXIMIZATION. We consider _a

perceptron

^network ^of ^N

input

units which act _on _a

single output

via the

synaptic

efficacies

(J;)

,

I ₌

0,1,...N.

The units _can

only

be in _one of _two

states,

+I _or -I. Each

example (f(, f(,.., f(; f")

of the

training

set consists in the

input (f(, f(,.., f(),

which _can be viewed _as _one of the _corners of _a

hypercube

in dimension

N,

and its

expected output f".

The

learning

set is stored

by

the

network,

defined

by

its interactions

(J;),

if the stabilities

7" satisfy:

N

y"

=

f" ~j _J,ff

_> ₀

_(2.I)

1=0

where _p ₌

1,2,..

,

P,

that is if all the stabilities _are

positive quantities.

The

larger y"

the

higher

the

stability

of

pattern

_p.

To derive

learning algorithms

from Boltzmann statistics _we consider each

example

_as _one

possible

state of the

learning

_process. The energy associated with _state _p is

E~

=

7~ (iJil) ^(2.2)

the

stability

of the network for that

pattern.

^This definition of energy may look

strange

^because it is lower for the less stable

patterns. However,

^the ^central ^{idea in}

building learning dynamics

(5)

is that if _we succeed in

stabilizing

the least stable

pattern,

then _we _are

guaranteed

that all the other

patterns

are stable _as well.

We _suppose that

only

the states of the

training

set _are accessible _to the network

during learning.

The

probability

that the network is in state p at

temperature

^T =

I/fl

is

given by

_pE«

the Boltzmann factor ^~ where

Z

P

Z =

~j e~fl~' (2.3)

p=1

is the

partition

function. The network

spends

_more time in those states which have lower stabilities and _seem _more difficult to learn. The free _energy F

((J;), T)

of the

system

at

temperature

T is

given by:

F ₌ -T

log

Z

(2.4)

In order to make the least stable

pattern

as stable _as

possible,

the

synaptic

efficacies _are modified _so _as to maximize the free energy. This _may be achieved with _a

gradient dynamics:

~~

T

_"

_@

~~ _"

[

^p ^~ ^_p~«

z

~' _'~~

which may be written in _a discrete time

approximation

as:

J,(i+ ⁱⁱ

⁼

^J;(i)

⁺

^z~J;(1)

1hJ,(t)

= c

f

~

~~~fff"

_c> ₀

^~~~~

p=1

~

with _c _a small time

step.

This formula has been

proposed

in

[14].

^Each ^hebbian term

fff"

appears with _a

weight equal

to the

probability

of the

learning

state, which is

higher

the _more unstable the

pattern

within the

learning

set

(Fig. I).

Rule

(2.6) interpolates

between the usual

~

~-flE"

_i Hebbian rule in the limit of _very

high temperatures,

for which

-

p,

^and ^the ^Minover

Z

algorithm

with _zero

stabilit.y

_{[7] in} the limit of _zero

temperature,

^for in t,hat limit

(2.6)

writes:

1iJ;(t)

₌

fff~'~f"~'° (2.7)

where pm,n is the least stable

pattern

of the

training

set at time

step

t.

Thus,

_pm,n is the

pattern,

^and ^the

only

_one, that is learnt at time t.

2.2 VARIATIONS ON THE BOLTZMANN STATISTICS THEME.

Learning

rule

(2.6)

lids been

deduced from the maxim12ation of the free _energy.

However,

in order to increase the stabilities of the

patterns

of the

training

set, one may

try

to maximize the _mean energy of the

system:

(E)

₌ ^~~~~ ₌

~j

E"

"

(2.8) fill

~

Z

~

If this

quantity

^is

positive

at 2ero

T,

it is clear that all the patterns ^of ^the

training

set _are

stable,

because if T ₌ 0 then

(E)

is

equal

to the

stability

of the least stable

pattern.

(6)

o

6 4 2 0 2 4 6

E@

Fig.

1. Coefficient of

fff~

vs. E~

(in arbitrary units):(-)

rule

(2.6), (- -)

rule

(2.9).

Maximizing (E) gives

the

following learning

rule:

~~ ~(~ ~

^~

~~~~

/~~~

^~

^~~~~ ^f/f'~ ^(2.9)

This rule

interpolates

between Hebb rule at

high

T and Alinover at T

=

0, exactly

in the _same way as

(2.6),

but at finite

temperatures

^the

weights

of the most stable

patterns

are

negative:

in order to learn the still not stable patterns ^with ^this

rule,

_very stable patterns should be unlearned

(Fig. I).

This

strategy

^is ^similar to that of Adaline

algorithm [15, 16],

ⁱⁿ ^which ^the stabilities _are forced to be all

equal

to I. Adaline rule modifies the

synaptic strenghts

_so _as to minimize the

following

cost function:

H ₌

~j _(E~ _1)~ _(2.10)

~

»

thus

unlearning patterns

^with stabilities

higher

than I and

learning

those with lower stabilities.

3. Cost functions

generated

^with ^Feiini statistics.

3, I GRAND POTENTIAL &4AXIMIZATION. This second

approach

is

quite

different. Now _we consider the P

learning patterns

as states of

energies

^E" ^that can be

occupied

_or

empty,

^like the _energy states of _a

system

of _non

interacting spinless

fermions. We define the

occupation

numbers n"=I if state

(pattern)

_p is

occupied

and n"

= 0 if not. One _may view n" _as _a

pointer

which tells _us whether the response of the network to

input

_~ is _erroneous

(n"

₌

I)

or not

(n"

₌

0).

The energy of the

system

is _now:

P

E =

£

_E~ _n"

(3.I)

p=i

with E"

given,

as

before, by (2.2).

At _zero temperature, the _energy

(3.I)

is minimized if all the not learned

patterns

_p, ^which ^have ^E" <

0,

_are

occupied,

I-e- have n~

= I and the learned

(7)

*.

* ,

'

6 0

@

Fig.

2. Coefficient of

ffi"

vs. E" with _K

= 0

(in arbitrary units): (-)rule (3A), (, -)

rule

(3.8), (. .)

^rule

(3.13).

P

patterns, having

E" _> 0 _are

empty.

^The sum

£

_n" _is _therefore the total number of _errors p=i

done

by

the network _on the

training

set.

Consider _now the

system

in

equilibrium

with _a reservoir of chemical

potential

_~ and tem-

perature

^T =

I/fl.

The

partition

function in the

grand

canonical ensemble is:

E ₌

£ e i'~l« _~"]

=

jj

(1

⁺

e~fl(~" ~')) ^(3.2)

(n"=0 1) ^P=1

The

grand potential

is:

P

#

₌ T

£

_In

_(I

₊

e~fl(~" ~')) ^(3.3)

p=i

As

before,

_one _may look for the set of interactions that maximize the

potential (3.3),

because

indirectly

this will indrease the

pattern

stabilities thus

decre~~ing

their

population,

which is the number of _errors. lvith

expression (2.2)

for the energy, the

resulting learning dynamics

is:

~~'

=

~§~

=

( (i

^t~~h

^~~~~ ^~) ^ff ^fP ⁽³ ^~)

fit

fiJ,

2 2 '

P=1

The

hyperbolic _tangent

vanishes in the limit T

- oo, and for T

- 0 _we have

I

tanh~~~~ ^'~

-

(l- sign(E" ~)

₌ 9

(- (E" ii). Therefore,

this

algorithm

2 2

(Figs.

² ^and

3) interpolates

between the Hebbian

rule,

in the limit of

high T,

and the

Percep-

tron

algorithm

with

stability

_~ in the limit of _2ero

temperature,

where it

gives:

~~

=

~~

=

£

₉

_(~

_E"

_fff" _(3.5)

fit

fiJ;

~~

'

Therefore,

if _a solution to the

problem

of

learning exists,

the chemical

potential

enables to

impose

a

given stability.

(8)

6

E@

ig. 3. -

Coefficient

Learning

rule

(3A)

maximizes the

probability,

at

temperature T,

that the network

gives

the correct

output

to all the

patterns

of the

learning

set. This

probability

writes:

~

_~

_~~~

_~~

~

i +

~-~(E"-<) ^'

~l ~l

It is

straightforward

to

verify

that the maximization of the

grand potential (3.3)

^is

equivalent

to maximize the

logarithm

of the

probability (3.6).

3. 2 OTHER _LEARNING _RULES. Within the iermionic

formulation,

_one can

try

to minimize the _mean number of _errors,

~

~~~

~ ~ (l

^tanh

^~~~" ^~)

@-~

~

(~ ~)

It follows the

learning dynamics:

fiJi fill~l fl (

I

~p~p ~~~~

fit

fiJ;

4

~~~h2 @j)

^'

P= 2

Here

again,

_we find _an

algorithm

that reduces to Hebb's rule at

high temperature,

but at finite T it

prescribes

_a

learning dynamics

that

gives

_a

higher weight

at each time

step

to those

patterns

^whose stabilities _are closer to _~

(Figs.

2 and

3).

This

prescription,

with _~ ₌

0,

is very similar to that of the 1h-rule

[15-17],

^based _upon ^the minimization of _a cost function of the form:

H =

£ _(~» _~(j»E»jj2 _{(3_gj}

~

»

where _a is _a

sigmoid

function. This rule is

nothing

else than the

popular Backpropagation

algorithm [18] applied

to _a

perceptron

architecture without hidden units.

Taking

into account

(9)

that _we _are

dealing

^with

binary units,

which is not the usual _case in networks trained with rule

(3.9),

and

choosing

_a

(f~E")

₌ tanh

(( _f"E~) _=f~

tanh

(( _E"), yields

a cost function

that counts the _square of the _mean number of _errors _on each

pattern

at _~ ₌ 0:

H =

£ (i-

tanh

(~E»))

⁼²

^£ (In"))~ (3.i°)

p

~

»

Minimization of

(3.10) gives

~~

=

~~

=

~ (

~~~~~"~ fff" (3.ll)

fit

fiJ;

2

_~

cosh

(flE@/2)

^'

Here,

the

weight

with which the

patterns

are learnt is

peaked

around

slightly negative

stabilities.

Finally,

_as in Boltzmann's

formulation,

let _us consider the rule derived from the maximization of the _mean energy of the system:

P

(E)

₌

£E" _in") _(3.12)

p=1

thus

trying

to increase the stabilities

mainly

of

patterns

^with

in")

m I which _are those with lower stabilities. Maximization of

(E)

leads to the

following learning

rule:

$'

=

(~~

₌

(

ii

^tanh

^~ ^(E~ ^)j (1 ^~~" (l

⁺ ^tanh

~ (E" )j

ff f"

t

J;

2 2 2 2

P=

(3.13)

With this rule

patterns

with stabilities around _-~ _are

given higher weights,

and

unlearning

of

patterns

with

large positive

stabilities takes

place (Figs.

2 and

3),

like in Adaline

type

of rules.

4.

Normalizing

the

synaptic weights.

Up

to now, no attention has been

paid

to the range of values of the

J;. However,

_we know that it is useful to divide the stabilities

by

the _norm of the

interactions, (J(,

to avoid their increase

by

_a trivial overall

scaling

of the

J; [19].Two

conventions _are

possible:

either the full vector J _#

(J0> J[,..

,

JN)

N lj2

(J(

₌

i=0 £J/ (4.I.A)

is considered

(convention A),

or the threshold is excluded but is constrained _to lie inside the

hypercube

of dimension N whose _corners

represent

all the

possible input

states of the network

(convention B):

N 1/2

jdj

~

£ j)

1

(4

1

B)

i=1

-$

_<

_Jo

_<

4

(10)

With convention

A, J;/(J(

_are the director cosines of _a

hyperplane through

the center of the

hypercube

ⁱⁿ N + I

dimensions,

the threshold unit

providing

for the

supplementary

dimension.

With convention

B, Jo

is the distance from the

hyperplane

of directors

J; / _(J~(

to the _center of the

hypercube

in dimension N. The definition

(2.2)

of the

energies

should be modified:

N j

E"

=

£ "ff _f" (4.2.A)

;=o

(J(

^' if convention A is

adopted,

_or

E"

=

L ( _If _f"

+

Joftf" 14.2.Bl

~~

with convention B. The calculation

proceeds exactly

_as

above, leading

to:

~~

=

£ ^" (fff" ^~ E") ^(4.3A)

fit

~~ (J(~

'

(J(

for rule

(2.6)

with convention

A,

_or to

ll'

=

( Ill

lfff» A _iE»

Joftf»)I

^> ^°

~~ ^~

~~

~ ~~

~~

^~

~~ ~

'

~ ~ ~

with convention B. In

(4.3.B),

_a

stabilizing

term

-qJ]

has been added to make _sure that the

hyperplane

^wil not escape the

hypercube.

In

general then,

the effect of

normalizing

the stabilities is

simply

to

replace

the Hebbian

factor,

in the

learning

rule without

normalization, by:

fff"

_-

fff"- ^~

E"

(4A.A)

(J( (J(

if convention A is

adopted,

_or:

jfjp

_

(jPjP

~

(j~@

_J~jfj@))

^I>o

^(~_~_~)

~

lJ~l lJ~l

and

adding

the

confining

term

-ijJ]

in the threshold

dynamics,

in the _case of convention B.

5. Discussion and conclusions.

In the

preceding

sections _we have showed that statistical mechanics

provides

_an

interestig

framework wherein most of known

perceptrons learning

rules _may find their

place.

All the

learning algorithms

take the

general

form:

1hJ;(t)

m

£ w~(ff~ (5.1)

»

(11)

where

w",

^the

weight

with which each

pattern

is

learned, depends

_on the

specific learning

rule.

Now the

problem

is to determine which rule is the most efficient.

Let _us _assume that there is

a solution: _a _set of interactions

(J;)

exists such

as the stabilities E" _are all

positive.

In this _case, Boltzmann and Fermi statistics

give

rules whose

weights

behave in _one of two

possible

_ways, either

always positive

like in

Perceptron type

of

rules,

w" _m

e~~

fl ~"

(5.2)

with k ₌

1/2,

_or

2,

_or like in Adaline

types

of

rules,

in which

weights

are

positive

for low stabilities and

negative

for

high

stabilities:

w» _>

ii I(flE")i ^e-k

^fl ^E"

(5.3)

where

f(flE")

₌ I

fl(E" (E))

_or

f(flE")

₌ I

flE" depending

_on whether

we consider Boltzmann _or Fermi statistics.

Determining

the

speed

of convergence is another

question.

As

learning starts,

in

general

from tabula _rasa _or from random

synaptic efficacies,

_some

patterns

are ill-classified.

Convergence might

be accelerated if those ill-classified

patterns

exert _a

strong

force _on the

hyperplane

determined

by

the

(J;)

_so as to

place

themselves in the correct side with

respect

to it. In other

words,

the

weights

w" should be

larger

the _more

negative

the stabilities. This is achieved

by

rules

(2.5), (2.9), (3A)

and

(3.8)

_or their normalized version like

(4.3).

However,

in

general,

there is _no solution to the

learning problem

within the

perceptron

architecture. Then the best _we _can do is to devise _a rule that minimizes the number of _errors.

Suppose

that such _a solution

(J;)

has been found. Then there

necessarily

exist ill-classified

patterns.

Let _us _assume _on the other hand that the

temperature

^is so low that the action of

patterns

^with

positive

stabilities vanishes. If the

weights

w" do _not vanish for

negative E",

as may be the _case in all the rules but

(3.8),

there will

always

exist _a _non

vanishing

force exerted _on the

hyperplane,

whatever its

position,

and the

learning procedure

will _never end.

The conclusion is that solutions that minimize the number of _errors _can

only

be reached

by using weights

^which vanish for

large positive

and

large negative stabilities,

like in

algorithm (3.8),

_or in

backpropagation (3.ll). However,

it _may take _an

infinitely long

time to reach _a solution this _way, since the forces _on the

hyperplane

exerted

by patterns

^with

strong negative

E~ is

vanishingly

small. A sort of

compromise

is then necesseray that combines this

type

^of

algorithms

and those of

perceptron type

to

speed

_up the _pace _of the convergence. Several _ways of

tackling

this

problem

_are

possible:

I)

The most

simple

idea is to

linearly

combine both

algorithms,

which in the _case of _~ ₌ 0 writes:

p ~-flE"

~~'

_~

~j 4cosh~(flE»/2)

~ ~

Z

'

where the

parameter

I has to be

large enough

to

actually

decrease the convergence times.

Learning dynamics (5A)

looks rather

involved,

but it is the

price

to pay when the space of solutions is _no _more _convex.

2)

^Another way ^to increase the

weight

of ill-classified patterns ^is ^to use the rule

(3.8)

and make it

asymmetrical by choosing

two different

temperatures, according

to the

sign

of the

stability:

~~

_~

~

4

cosh~

~»E@

/2) ~~~ ~'~~

(12)

with

fl~

₌

fl+

if E" > 0 and

fl~

₌

fl~

if E~ _<

0, fl+

_>

fl~

The determination of the

parameters fl,

_~ and I and the results of simulations carried _out

using perscriptions (5A)

and

(5.5)

_are

beyond

the _scope of this

article,

and will be

reported

elsewhere.

3) Finally,

it _may be _a

good strategy

to _use all the three

perceptron type

^of

learning algo-

rithms in turn,

starting

with Minover _or _any other rule that _ensures

high stabilities,

which will

swiftly give

the solution _to the

problem

if the solution

exists,

and then

finishing

the

learning

session wih _a combination such _as

ISA)

_or

(5.5)

if the solution without _errors

was not found.

The

problem

is _to find the criteria that will tell when _to _swap from

a

type

^of ^rule to another.

This _may _prove to be _a difficult

problem.

References

[II

ABBOTT

L-F-,

Network 1

(1990)

105.

[2] HOPFIELD

J-J-,

Proc. Nail. Acad. Sci. U-S-A. 79

(1982)

2554.

[3] COVER

T-M-,

IEEE Trans. Electronic

Computers

EC14

(1965)

326.

[4] ^GARDNER E., Europllys. Lett. 4

(1987)

481.

[5] ^AMIT D-J-, GUTFREUND H., SOMPOLINSKY H., Ann.

Pllys.

173

(1987)

30.

[6] ^MINSKY ^M. ^and ^PAPERT S., Perceptrons

(MIT Press, Cambridge, 1988).

[7] ^KRAUTH ^W. ^and ^MEzARD M., J.

Pllys.

A 20

(1987)

L745.

(8] ^RUJAN P.,

Preprint (1992).

[9] ^OPPER M. and HAUSSLER

D., Pllys.

Rev. Lett. 66

(1991)

2677.

[10] ^WATKIN

T-L-H-,

RAU A. and BIEHL M.,

University

of Oxford

Preprint (1992).

(ii]

GALLANT

S-I-,

IEEE Trans. Neural Networks 1

(1990)

179.

[12] ^NABUTOVSKY ^D. ^and ^DOMANY E., Neural

Computation

3

(1991)

604.

(13] ^SOLLA

S-A-,

Neural Networks: from Models to

Applications,

L. Personna2 and G.

Dreyfus

Eds.

(I.D.S.E.T., Paris, 1989)

p.168.

[14] ^PERETTO P., Neural Networks1

(1988)

309.

[15] ^WIDROW ^B. ^and ^HOFF

M-E-,

WESCON Convention,

Report

IV

(1960)

96.

[16] ^DIEDERICH ^S. ^and ^OPPER

M., Pllys.

Rev. Lett 58

(1987)

949.

[17] ^SEEWER ^S. ^and ^RUJAN

P.,

J.

Pllys.

A: Math. Gen. 25

(1992)

L505.

[18] LE CUN

Y., Cognitiva'85 (1985)

599.

[19] ^KRAUTH W., NADAL J.-P and MtzARD M., J.

Pllys.

A: Math. Gen. 21

(1988)

2995