HAL Id: jpa-00246731
https://hal.archives-ouvertes.fr/jpa-00246731
Submitted on 1 Jan 1993
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Learning algorithms for perceptrons from statistical physics
Mirta Gordon, Pierre Peretto, Dominique Berchier
To cite this version:
Mirta Gordon, Pierre Peretto, Dominique Berchier. Learning algorithms for perceptrons from statis- tical physics. Journal de Physique I, EDP Sciences, 1993, 3 (2), pp.377-387. �10.1051/jp1:1993140�.
�jpa-00246731�
Classification
Physics
Abstracts05.50 02.70 87.10
Learning algorithms for perceptrons from statistical physics
Mirta B.
Gordon(~)(°),
PierrePeretto(~)
andDominique Berchier(~)(**)
(~) DRFMC/SPSMS,
CEN Grenoble, 85X, 38041 Grenoble Cedex, France(~) DRFMC/SP2M,
CEN Grenoble, 85X, 38041 GrenobleCedex,
France(Received
22May
1992,accepted
in final farm 23July 1992)
Rdsumd. Nous ddduisons des
algorithmes d'apprentissage
pour des perceptrons Ipartir
de consid6rations de m6canique
statistique.
Desquantit6s thermodynamiques
sont considdr6escomme des fonctions de colt, dont on obtient, par une dynamique de
gradient,
)es efficacit6ssynaptiques
qui apprennent l'ensembled'apprentissage.
Lesrkgles
ainsi obtenues sont class6esen deux
cat6gories
suivant lesstatistiques,
de Boltzmannou de Fermi, utilis6es pour d6river )es fonctions de colt. Dans )es limites de
temp6ratures
nulle ouinfinie,
laplupart
desrbgles
trouv6es tendent vers )es
algorithmes
connus, mais Itemp6rature
finie on trouve desstrat6gies nouvelles, qui
minimisent le nombre d'erreurs dans l'ensembled'apprentissage.
Abstract.
Learning algorithms
for perceptrons are deduced from statistical mechanics.Thermodynamical
quantities are used as cost functions which may be extremal12edby gradient dynamics
to find thesynaptic
efficacies that store thelearning
set of patterns. Thelearning
rules so obtained are classified in twocategories, following
the statistics used to derive the costfunctions, namely,
Bolt2mann statistics, and Fermi statistics. In the limits of zero or infinite temperatures some of the rules behave like already knownalgorithms,
but newstrategies
forlearning
are obtained at finite temperatures, which minim12e the number of errors on thetraining
set.
J'ai rencontrd Rammal pour la
premiAre
tots dans le cadre du DEA "Matidre etRayon-
nement" de l'lfniversitd de Grenoble. Je l'ai revu bier sou vent ensuite et
je
luidois,
enpartie,
de m'@tre in tdressd aux rdseaux de neurones. A la fin de 1982 en
efTet,
R.Maynard
recevait de P.W. Anderson deuxpreprints,
l'un d'Anderson lui-m@me sur ladynamique
de la soupeprdbi- otique
et l'autre delfopfield
sur le modAle lfebbien de mdmoire darts les rdseaux neuronaux.II me les
communiquait
etaprAs
avoirprogrammd
ladynamique
de la soupe de Andersonje
ddcidais dem'impliquer plus
avant darts la tlidorie des rdseaux de neurones. Nous avons don cpendant
l'annde 1983 forma R.Maynard,
R. Rammal et moi-mime unpetit
groupequi
s'est(*)
Also with C-N-R-S-(**) Supported
bySunburst-Fonds,
Board of the Swiss Federal Institutes ofTechnology,
Zfirich.JOURNAL DE PHYSIQUE T j N' 2 FEBRUARi lQ93 >4
rduni une dizaine de fois et darts
lequel j'exposai
les iddesqui commengaient
kdmerger
donsce domaine. Les vastes cultures de Rammal et
Maynard
out fait de ces rencontres des mo- ments d'excitation intellectueJlequi
font l'interdtprincipal
deIi
vie des clierclieurs.J'espdre
que Rammal aurait
apprdcid
lepetit
article desyntlidse
que nous proposons ci-dessous. Je le pense.P. Peretto
1 Intro duction.
One of the
present
interests on neural networks arises from the fact thatthey
are endowed withgeneralization capabilities,
thatis, they
are able to extract, from a set ofexamples (the training
set of
patterns),
theunderlying
rule thatgenerates
them. Ifgood generalization
isachieved,
the network shouldgive
a correctoutput
to an unfamiliarinput,
aninput
notbelonging
to thetraining
set.The interactions
J;;
between neuronsj
andI,
I <I, j
<N,
and the thresholdsJ;o, fully
characterizea neural network of N neurons.
Learning
consists inmodifying
these interactionsso as to make the response of the network to the
patterns
of thelearning
set as close aspossible
to the
corresponding expected outputs.
Asusual,
we shall writeff
for the state of neuronI,
and p for the labels of thepatterns
in thetraining
set, I < p < P.A number of
learning algorithms
hasalready
beenput
forward for neural netsill,
most of themstemming
from the seminal idea ofHebb,
whopointed
out the role of correlations of activities in thelearning dynamics. However,
theoriginal
Hebbianlaw,
in whichlearning
anew
pattern
p introduces a modification of thesynaptic weights given by
[2]:1kJ;;(p)
= cff f)
c > 0(1.I)
is not
optimal
in the sense that neither does it allow to reach the maximal(theoretical) capacity [3,4]
of the network for randompatterns
nor is it able to learn without errors whenP,
the number ofpatterns
in thelearning set,
scales with the number of neurons[5].
Attention then focused on thesimplest
network one canimagine:
theperceptron,
which consists of Ninput
neuronsacting, through
a set of interactionsJ;,
onto aunique output
unit. Aspecific algorithm, namely
thePerceptron algorithm,
was devised for thistype
of network[6].
It modifies thesynaptic
efficaciesaccording
to I-I),
buttaking
into accountonly
thosepatterns
the networkgives
the wrong answer to. Thetraining
set isiteratively
tested and theJ;
modified until all thepatterns
are well learnt. It has beenproved
[6] that if thelearning
set islinearly separable,
thatis,
if a set of interactionsJ;
exists such as the responseN
a(p)
=Sign ~jJ,ff (1.2)
I=0
coincides with the
expected
responsef~
for all thepatterns
of thelearning set,
then thealgorithm
finds these interactions. In(1.2)
we haveintroduced,
asusual,
a threshold unitf(=-I,
associated withJo,
the threshold value. ThePerceptron algorithm
is not veryefficient,
because itonly
allows formarginal stabilities,
and several newlearning
rules have beenproposed
which bothspeed
up the search times andimprove
thestability
of solutions withrespect
to smallperturbations
of theinput signals
and/or
interactions [7][8].
All thesealgorithms
areable to find a solution when it
exists,
and it is believed that the bestgeneralization
ratescorrespond
to thehighest
stabilities.Recently,
analgorithm
that finds the bestgeneralization
rates for a
perceptron
has beenproposed [9],
but itneeds,
in itsimplementation,
a network ofhigher complexity, namely
one with an infinite number of hidden units.However, problems
to be treatedby
neural networks are ingeneral
morecomplex
thanlinearly separable,
and the perceptron architecture is too poor for suchperformances
asgood learning
andgeneralization [10]. Therefore,
networks with more involveddesigns
are needed.The
problem
oflearning
in thesecomplex
architectures is difficult and notcompletely
solvedyet. Among
variousattempts
made sofar,
constructivisticalgorithms
seem mostpromising.
They proceed by adding
to the network one new cell at atime,
so thatlearning
reduces tolearning
in thenew
perceptrons successively
createdby
the construction. Theproblem
of
learning
remains central. But now one is less interested infinding
the solution(which
anyway, at least at the
stage
of the intermediateperceptrons,
does notexist)
that infinding
the best
solution,
which minimizes the number of errors on thelearning
set[11, 12].
The relation betweenminimizing
the number of errors on thetraining
set and thegeneralization performance
of such constructivisticalgorithms
is not clearyet,
but it is believed that the smaller the network built upby
thelearning algorithm,
the better itsgeneralization
rate[13].
The
strategy put
forward ih this paper is to find a convenientthermodynamical quantity
toextremalize,
like an energy or any other costfunction,
from which alearning algorithm
may be deducedby gradient dynamics.
For thisgoal
to beachieved,
werely
upon statistical mechanics.The reason of such a choice is that at finite
temperature
we have thepossibility
ofaccepting
solutions
which, although
notoptimal,
are neverthelessacceptable.
Severalapproaches
whichoptimize
thestoring capacities
ofperceptrons
even in theregime
with errors, arepossible.
We
classify
them in twocategories, following
the statistics used to derive the costfunctions, namely,
Boltzmannstatistics,
which we consider in section2,
and Fermistatistics,
discussed insection 3. In an interest of
clarity,
thelearning
rules are first deduced withoutnormalizing
thesynaptic efficacies,
the introduction of normalizationbeing postponed
to section 4. In section 5 we discuss the differentlearning
rules obtained.2. Cost functions
generated
with Boltzmanii statistics.2, I CANONICAL FREE ENERGY MAXIMIZATION. We consider a
perceptron
network of Ninput
units which act on asingle output
via thesynaptic
efficacies(J;)
,
I =
0,1,...N.
The units canonly
be in one of twostates,
+I or -I. Eachexample (f(, f(,.., f(; f")
of thetraining
set consists in theinput (f(, f(,.., f(),
which can be viewed as one of the corners of ahypercube
in dimensionN,
and itsexpected output f".
Thelearning
set is storedby
thenetwork,
definedby
its interactions(J;),
if the stabilities7" satisfy:
N
y"
=
f" ~j J,ff
> 0(2.I)
1=0
where p =
1,2,..
,
P,
that is if all the stabilities arepositive quantities.
Thelarger y"
thehigher
the
stability
ofpattern
p.To derive
learning algorithms
from Boltzmann statistics we consider eachexample
as onepossible
state of thelearning
process. The energy associated with state p isE~
=
7~ (iJil) (2.2)
the
stability
of the network for thatpattern.
This definition of energy may lookstrange
because it is lower for the less stablepatterns. However,
the central idea inbuilding learning dynamics
is that if we succeed in
stabilizing
the least stablepattern,
then we areguaranteed
that all the otherpatterns
are stable as well.We suppose that
only
the states of thetraining
set are accessible to the networkduring learning.
Theprobability
that the network is in state p attemperature
T =I/fl
isgiven by
_pE«
the Boltzmann factor ~ where
Z
P
Z =
~j e~fl~' (2.3)
p=1
is the
partition
function. The networkspends
more time in those states which have lower stabilities and seem more difficult to learn. The free energy F((J;), T)
of thesystem
attemperature
T isgiven by:
F = -T
log
Z(2.4)
In order to make the least stable
pattern
as stable aspossible,
thesynaptic
efficacies are modified so as to maximize the free energy. This may be achieved with agradient dynamics:
~~
T
"@
~~ "[
p ~ _p~«z
~'~~ ~~'~~
which may be written in a discrete time
approximation
as:J,(i+ ii
=J;(i)
+z~J;(1)
1hJ,(t)
= c
f
~~~~fff"
c> 0~~~~
p=1
~
with c a small time
step.
This formula has beenproposed
in[14].
Each hebbian termfff"
appears with a
weight equal
to theprobability
of thelearning
state, which ishigher
the more unstable thepattern
within thelearning
set(Fig. I).
Rule(2.6) interpolates
between the usual~
~-flE"
i Hebbian rule in the limit of veryhigh temperatures,
for which-
p,
and the MinoverZ
algorithm
with zerostabilit.y
[7] in the limit of zerotemperature,
for in t,hat limit(2.6)
writes:1iJ;(t)
=fff~'~f"~'° (2.7)
where pm,n is the least stable
pattern
of thetraining
set at timestep
t.Thus,
pm,n is thepattern,
and theonly
one, that is learnt at time t.2.2 VARIATIONS ON THE BOLTZMANN STATISTICS THEME.
Learning
rule(2.6)
lids beendeduced from the maxim12ation of the free energy.
However,
in order to increase the stabilities of thepatterns
of thetraining
set, one maytry
to maximize the mean energy of thesystem:
(E)
= ~~~~ =~j
E""
(2.8) fill
~
Z
~
If this
quantity
ispositive
at 2eroT,
it is clear that all the patterns of thetraining
set arestable,
because if T = 0 then(E)
isequal
to thestability
of the least stablepattern.
o
6 4 2 0 2 4 6
E@
Fig.
1. Coefficient offff~
vs. E~
(in arbitrary units):(-)
rule(2.6), (- -)
rule(2.9).
Maximizing (E) gives
thefollowing learning
rule:~~ ~(~ ~
~~~~~
/~~~
~~~~~ f/f'~ (2.9)
This rule
interpolates
between Hebb rule athigh
T and Alinover at T=
0, exactly
in the same way as(2.6),
but at finitetemperatures
theweights
of the most stablepatterns
arenegative:
in order to learn the still not stable patterns with this
rule,
very stable patterns should be unlearned(Fig. I).
Thisstrategy
is similar to that of Adalinealgorithm [15, 16],
in which the stabilities are forced to be allequal
to I. Adaline rule modifies thesynaptic strenghts
so as to minimize thefollowing
cost function:H =
~j (E~ 1)~ (2.10)
~
»
thus
unlearning patterns
with stabilitieshigher
than I andlearning
those with lower stabilities.3. Cost functions
generated
with Feiini statistics.3, I GRAND POTENTIAL &4AXIMIZATION. This second
approach
isquite
different. Now we consider the Plearning patterns
as states ofenergies
E" that can beoccupied
orempty,
like the energy states of asystem
of noninteracting spinless
fermions. We define theoccupation
numbers n"=I if state
(pattern)
p isoccupied
and n"= 0 if not. One may view n" as a
pointer
which tells us whether the response of the network toinput
~ is erroneous(n"
=I)
or not(n"
=0).
The energy of thesystem
is now:P
E =
£
E~ n"(3.I)
p=i
with E"
given,
asbefore, by (2.2).
At zero temperature, the energy(3.I)
is minimized if all the not learnedpatterns
p, which have E" <0,
areoccupied,
I-e- have n~= I and the learned
*.
* ,
'
6 0
@
Fig.
2. Coefficient offfi"
vs. E" with K
= 0
(in arbitrary units): (-)rule (3A), (, -)
rule(3.8), (. .)
rule(3.13).
P
patterns, having
E" > 0 areempty.
The sum£
n" is therefore the total number of errors p=idone
by
the network on thetraining
set.Consider now the
system
inequilibrium
with a reservoir of chemicalpotential
~ and tem-perature
T =I/fl.
Thepartition
function in thegrand
canonical ensemble is:E =
£ e~~ i~~'~l« ~"]
=
jj
(1
+e~fl(~" ~')) (3.2)
(n"=0 1) P=1
The
grand potential
is:P
#
= T£
In(I
+e~fl(~" ~')) (3.3)
p=i
As
before,
one may look for the set of interactions that maximize thepotential (3.3),
becauseindirectly
this will indrease thepattern
stabilities thusdecre~~ing
theirpopulation,
which is the number of errors. lvithexpression (2.2)
for the energy, theresulting learning dynamics
is:~~'
=
~§~
=
( (i
t~~h~~~~ ~) ff fP (3 ~)
fit
fiJ,
2 2 'P=1
The
hyperbolic tangent
vanishes in the limit T- oo, and for T
- 0 we have
I
tanh~~~~ '~
-
(l- sign(E" ~)
= 9(- (E" ii). Therefore,
thisalgorithm
2 2
(Figs.
2 and3) interpolates
between the Hebbianrule,
in the limit ofhigh T,
and thePercep-
tron
algorithm
withstability
~ in the limit of 2erotemperature,
where itgives:
~~
=
~~
=
£
9(~
E"fff" (3.5)
fit
fiJ;
~~
'
Therefore,
if a solution to theproblem
oflearning exists,
the chemicalpotential
enables toimpose
agiven stability.
6
E@
ig. 3. -
Coefficient
Learning
rule(3A)
maximizes theprobability,
attemperature T,
that the networkgives
the correctoutput
to all thepatterns
of thelearning
set. Thisprobability
writes:~
~
~~~~
~~~
i +
~-~(E"-<) ~~'~~
~l ~l
It is
straightforward
toverify
that the maximization of thegrand potential (3.3)
isequivalent
to maximize the
logarithm
of theprobability (3.6).
3. 2 OTHER LEARNING RULES. Within the iermionic
formulation,
one cantry
to minimize the mean number of errors,~
~~~
~ ~ (l
tanh~~~" ~)
@-~
~
(~ ~)
It follows the
learning dynamics:
fiJi fill~l fl (
I~p~p ~~~~
fit
fiJ;
4~~~h2 @j)
'P= 2
Here
again,
we find analgorithm
that reduces to Hebb's rule athigh temperature,
but at finite T itprescribes
alearning dynamics
thatgives
ahigher weight
at each timestep
to thosepatterns
whose stabilities are closer to ~(Figs.
2 and3).
Thisprescription,
with ~ =0,
is very similar to that of the 1h-rule[15-17],
based upon the minimization of a cost function of the form:H =
£ (~» _~(j»E»jj2 (3_gj
~
»
where a is a
sigmoid
function. This rule isnothing
else than thepopular Backpropagation
algorithm [18] applied
to aperceptron
architecture without hidden units.Taking
into accountthat we are
dealing
withbinary units,
which is not the usual case in networks trained with rule(3.9),
andchoosing
a(f~E")
= tanh(( f"E~) =f~
tanh(( E"), yields
a cost functionthat counts the square of the mean number of errors on each
pattern
at ~ = 0:H =
£ (i-
tanh(~E»))
=2£ (In"))~ (3.i°)
p
~
»
Minimization of
(3.10) gives
~~
=
~~
=
~ (
~~~~~~~"~~~ fff" (3.ll)
fit
fiJ;
2_~
cosh
(flE@/2)
'Here,
theweight
with which thepatterns
are learnt ispeaked
aroundslightly negative
stabili- ties.Finally,
as in Boltzmann'sformulation,
let us consider the rule derived from the maximiza- tion of the mean energy of the system:P
(E)
=£E" in") (3.12)
p=1
thus
trying
to increase the stabilitiesmainly
ofpatterns
within")
m I which are those with lower stabilities. Maximization of(E)
leads to thefollowing learning
rule:$'
=
(~~
=(
ii
tanh~ (E~ )j (1 ~~" (l
+ tanh~ (E" )j
ff f"
t
J;
2 2 2 2P=
(3.13)
With this rule
patterns
with stabilities around -~ aregiven higher weights,
andunlearning
ofpatterns
withlarge positive
stabilities takesplace (Figs.
2 and3),
like in Adalinetype
of rules.4.
Normalizing
thesynaptic weights.
Up
to now, no attention has beenpaid
to the range of values of theJ;. However,
we know that it is useful to divide the stabilitiesby
the norm of theinteractions, (J(,
to avoid their increaseby
a trivial overallscaling
of theJ; [19].Two
conventions arepossible:
either the full vector J #(J0> J[,..
,
JN)
N lj2
(J(
=i=0 £J/ (4.I.A)
is considered
(convention A),
or the threshold is excluded but is constrained to lie inside thehypercube
of dimension N whose cornersrepresent
all thepossible input
states of the network(convention B):
N 1/2
jdj
~£ j)
1
(4
1B)
i=1
-$
<Jo
<4
With convention
A, J;/(J(
are the director cosines of ahyperplane through
the center of thehypercube
in N + Idimensions,
the threshold unitproviding
for thesupplementary
dimension.With convention
B, Jo
is the distance from thehyperplane
of directorsJ; / (J~(
to the center of thehypercube
in dimension N. The definition(2.2)
of theenergies
should be modified:N j
E"
=
£ "ff f" (4.2.A)
;=o
(J(
' if convention A isadopted,
orE"
=
L ( If f"
+
Joftf" 14.2.Bl
~~
with convention B. The calculation
proceeds exactly
asabove, leading
to:~~
=
£ " (fff" ~ E") (4.3A)
fit
~~ (J(~
'
(J(
for rule
(2.6)
with conventionA,
or toll'
=
( Ill
lfff» A iE»
Joftf»)I
> °~~ ~
~~
~ ~~
~~
~~~~~ ~~~
'
~ ~ ~
with convention B. In
(4.3.B),
astabilizing
term-qJ]
has been added to make sure that thehyperplane
wil not escape thehypercube.
Ingeneral then,
the effect ofnormalizing
the stabilities issimply
toreplace
the Hebbianfactor,
in thelearning
rule withoutnormalization, by:
fff"
-fff"- ~
E"(4A.A)
(J( (J(
if convention A is
adopted,
or:jfjp
_(jPjP
~
(j~@
_J~jfj@))
I>o(~_~_~)
~
lJ~l lJ~l
and
adding
theconfining
term-ijJ]
in the thresholddynamics,
in the case of convention B.5. Discussion and conclusions.
In the
preceding
sections we have showed that statistical mechanicsprovides
aninterestig
framework wherein most of known
perceptrons learning
rules may find theirplace.
All thelearning algorithms
take thegeneral
form:1hJ;(t)
m£ w~(ff~ (5.1)
»
where
w",
theweight
with which eachpattern
islearned, depends
on thespecific learning
rule.Now the
problem
is to determine which rule is the most efficient.Let us assume that there is
a solution: a set of interactions
(J;)
exists suchas the stabilities E" are all
positive.
In this case, Boltzmann and Fermi statisticsgive
rules whoseweights
behave in one of two
possible
ways, eitheralways positive
like inPerceptron type
ofrules,
w" m
e~~
fl ~"(5.2)
with k =
1/2,
or2,
or like in Adalinetypes
ofrules,
in whichweights
arepositive
for low stabilities andnegative
forhigh
stabilities:w» >
ii I(flE")i e-k
fl E"(5.3)
where
f(flE")
= Ifl(E" (E))
orf(flE")
= IflE" depending
on whetherwe consider Boltzmann or Fermi statistics.
Determining
thespeed
of convergence is anotherquestion.
Aslearning starts,
ingeneral
from tabula rasa or from randomsynaptic efficacies,
somepatterns
are ill-classified.
Convergence might
be accelerated if those ill-classifiedpatterns
exert astrong
force on thehyperplane
determinedby
the(J;)
so as toplace
themselves in the correct side withrespect
to it. In otherwords,
theweights
w" should belarger
the morenegative
the stabilities. This is achievedby
rules(2.5), (2.9), (3A)
and(3.8)
or their normalized version like(4.3).
However,
ingeneral,
there is no solution to thelearning problem
within theperceptron
architecture. Then the best we can do is to devise a rule that minimizes the number of errors.Suppose
that such a solution(J;)
has been found. Then therenecessarily
exist ill-classifiedpatterns.
Let us assume on the other hand that thetemperature
is so low that the action ofpatterns
withpositive
stabilities vanishes. If theweights
w" do not vanish fornegative E",
as may be the case in all the rules but
(3.8),
there willalways
exist a nonvanishing
force exerted on thehyperplane,
whatever itsposition,
and thelearning procedure
will never end.The conclusion is that solutions that minimize the number of errors can
only
be reachedby using weights
which vanish forlarge positive
andlarge negative stabilities,
like inalgorithm (3.8),
or inbackpropagation (3.ll). However,
it may take aninfinitely long
time to reach a solution this way, since the forces on thehyperplane
exertedby patterns
withstrong negative
E~ is
vanishingly
small. A sort ofcompromise
is then necesseray that combines thistype
ofalgorithms
and those ofperceptron type
tospeed
up the pace of the convergence. Several ways oftackling
thisproblem
arepossible:
I)
The mostsimple
idea is tolinearly
combine bothalgorithms,
which in the case of ~ = 0 writes:p ~-flE"
~~'
~~j 4cosh~(flE»/2)
~ ~
Z
~~~~ ~~'~~
where the
parameter
I has to belarge enough
toactually
decrease the convergence times.Learning dynamics (5A)
looks ratherinvolved,
but it is theprice
to pay when the space of solutions is no more convex.2)
Another way to increase theweight
of ill-classified patterns is to use the rule(3.8)
and make itasymmetrical by choosing
two differenttemperatures, according
to thesign
of thestability:
~~
~~
4
cosh~
~»E@
/2) ~~~ ~'~~
with
fl~
=fl+
if E" > 0 andfl~
=fl~
if E~ <0, fl+
>fl~
The determination of theparameters fl,
~ and I and the results of simulations carried outusing perscriptions (5A)
and(5.5)
arebeyond
the scope of thisarticle,
and will bereported
elsewhere.3) Finally,
it may be agood strategy
to use all the threeperceptron type
oflearning algo-
rithms in turn,
starting
with Minover or any other rule that ensureshigh stabilities,
which willswiftly give
the solution to theproblem
if the solutionexists,
and thenfinishing
thelearning
session wih a combination such as
ISA)
or(5.5)
if the solution without errorswas not found.
The
problem
is to find the criteria that will tell when to swap froma
type
of rule to another.This may prove to be a difficult
problem.
References
[II
ABBOTTL-F-,
Network 1(1990)
105.[2] HOPFIELD
J-J-,
Proc. Nail. Acad. Sci. U-S-A. 79(1982)
2554.[3] COVER
T-M-,
IEEE Trans. ElectronicComputers
EC14(1965)
326.[4] GARDNER E., Europllys. Lett. 4
(1987)
481.[5] AMIT D-J-, GUTFREUND H., SOMPOLINSKY H., Ann.
Pllys.
173(1987)
30.[6] MINSKY M. and PAPERT S., Perceptrons
(MIT Press, Cambridge, 1988).
[7] KRAUTH W. and MEzARD M., J.
Pllys.
A 20(1987)
L745.(8] RUJAN P.,
Preprint (1992).
[9] OPPER M. and HAUSSLER
D., Pllys.
Rev. Lett. 66(1991)
2677.[10] WATKIN
T-L-H-,
RAU A. and BIEHL M.,University
of OxfordPreprint (1992).
(ii]
GALLANTS-I-,
IEEE Trans. Neural Networks 1(1990)
179.[12] NABUTOVSKY D. and DOMANY E., Neural
Computation
3(1991)
604.(13] SOLLA
S-A-,
Neural Networks: from Models toApplications,
L. Personna2 and G.Dreyfus
Eds.(I.D.S.E.T., Paris, 1989)
p.168.[14] PERETTO P., Neural Networks1
(1988)
309.[15] WIDROW B. and HOFF
M-E-,
WESCON Convention,Report
IV(1960)
96.[16] DIEDERICH S. and OPPER
M., Pllys.
Rev. Lett 58(1987)
949.[17] SEEWER S. and RUJAN
P.,
J.Pllys.
A: Math. Gen. 25(1992)
L505.[18] LE CUN
Y., Cognitiva'85 (1985)
599.[19] KRAUTH W., NADAL J.-P and MtzARD M., J.