HAL Id: jpa-00247330
https://hal.archives-ouvertes.fr/jpa-00247330
Submitted on 1 Jan 1997
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Replica Symmetry Breaking and the Kuhn-Tucker Cavity Method in Simple and Multilayer Perceptrons
F. Gerl, U. Krey
To cite this version:
F. Gerl, U. Krey. Replica Symmetry Breaking and the Kuhn-Tucker Cavity Method in Sim- ple and Multilayer Perceptrons. Journal de Physique I, EDP Sciences, 1997, 7 (2), pp.303-327.
�10.1051/jp1:1997147�. �jpa-00247330�
Replica Symlnetry Breaking and the Kuhn-Tucker Cavity Method in Silnple and Multilayer Perceptrons (*)
F. Gerl
(~)
and U.Krey
(~>**)(~) Institut ffiir Theoretische
Physik
der UniversitstG0ttingen,
Bunsenstrasse 9, 30373G6ttingen, Germany
(~) Institut fur
Physik
II der UniversitstRegensburg,
Universitfitsstrasse31,
93040Regensburg.
Germany
(Recei~.ed
loJul,v
1995, re~.ised 29 Noiember 1995,accepted
21 October1996)
PACS.87.lo.+e
General,
theoretical. and mathematicalbiophysics (including logic
ofbiosystems,
quantumbiology,
and relevant aspects ofthermodynamics,
informationtheory, cybernetics,
and bionicsPACS.02.50.-r
Probability
theory, stochastic processes, and statisticsAbstract. Within a Kuhn-Tucker cavity method introduced in a former paper,
we study optimal
stability )earning
forsituations,
where in thereplica
formalism thereplica
symmetry may bebroken, namely (I)
the case of asimple
perceptron above the criticalloading,
and(iii
the case oftwo-layer AND-perceptrons,
if one learns with maximalstability.
We find that the deviation of ourcavity
solution from thereplica
symmetric one in these cases is a clear indication of thenecessity
ofreplica
symmetrybreaking.
In any case the cavity solution tends to underestimate the storagecapabilities
of the networks.1. Introduction
In a recent paper
[ii,
we introduced a new kind ofcavity method,
with which we could solve thelearning problem
for perceptrons withQ-
andQ'
state Potts modelinput
and output neurons.In this
method,
the Kuhn-Tuckerconditions,
which lead tooptimal stability
in AdaTrontype learning
processes, have been built into thecavity
formulation. Insubsequent
papers weextended this method to the
problem
of thegeneralization ability
of a perceptron trained foroptimal stability [2j,
and to theproblem
ofstoring
of correlatedpatterns [3j.
In the
present
paper, weapply
our method to cases where in thereplica
formalism thereplica symmetry
isbroken, namely (I)
toperceptrons
withIsing
neurons above the criticalloading
and
(it)
totwo-layer
ANDperceptrons.
Cavity
ideas were firstapplied
to neural networksby
MAzard [4]; and Kinzel andOpper
[5].The
approach
ofGriniasty [6j,
whose work will be discussed below incomparison
with our ownfindings, employs
ideas introducedby
MAzard. Our formulation of thecavity method,
on the otherhand,
wasinspired by
thejust-mentioned
work of Kinzel andOpper.
In anotheroriginal approach, Wong
[7jemploys
ideas which are related to ours andGriniasty's.
(*)
based on the PhD thesis of F. Gerl,Regensburg,
1994(**)
Author forcorrespondence (e-mail: Uwe.Kreyflphysik.uni-regensburg.de)
©
Les(ditions
dePhysique
19972.
Simple Perceptrons
above the CriticalLoading
We use the definitions as in
[2],
which aresimpler
than those we had to introduce for the Potts model case[lj.
Ourperceptron
has Ninput
neurons k=
I,..., N,
whosepossible
states areSk "
+i,
and oneoutput
neurons',
also withpossible
states +I. Thecouplings leading
from theinput
neurons to theoutput
neuron are assumed to be realnumbers,
which are collected into thecoupling
vector J :=(Ji,
,
JN)
with the Euclideanlength
L :=(J(
=(J)
+.. +J()~/~,
which iskept fixed,
while the componentsJk
areadapted
to certain tasksby "training processes"
(see below).
The relation betweeninput
andoutput
isIN
s'
=
sign ~j Jk
Sk,
(i)
k=1
which can be considered as a
binary
classification of thepossible inputs,
or as answers onqitestions.
Now we assume that there is a
training
set of pinput-output pairs,
where theinputs
are("
.=(((,.. ,(()
for ~=
i,..
, p, and the
corresponding
desiredoutputs
are(P.
Here and in thefollowing,
unless otherwisestated,
we assume that allinput components
and theoutputs
are
independent
randomnumbers,
which take the values +i withequal probability.
The
optimization task,
which theperceptron
then has to fulfill in course of thetraining by adaptation
of the components of thecoupling
vectorJ,
is the minimization of aHamiltonian,
whichperforms
aweighted
count of bad classifications(see below)
of thetraining examples.
Among
these Hamiltonians are that of Gardner and Derrida[8j,
whichsimply
counts the badclassifications, namely
it :=
~j V(EP)
:=
~j
@(~(E~ IL) ). (2)
» v
Here
@(x)
= i for x >0,
= 0else,
and E" is defined as usual as the "oriented field"acting
on
pattern
~,N
l~~ "
(~ ~ ~k~~
,
(~)
k=I
while L
=
(J(
as above. Bad classifications arethose,
where E~IL
is <~c, I.e. for
~ > 0
they
are not
necessarily
wrong but lack apresclibed
amount ofstability,
which is measuredby
~.There are p =. n N such
"questions
withprescribed answers",
I.e. ~ =i,...,
p, and it is assumed that theloading parameter
a remainsfinite,
while N ~ cc and p ~ cc.Furthermore,
thestability
parameter ~ is to bemaximized,
if error-free classification(I.
e. ii =0)
ispossible.
Gardner and Derrida [8j evaluated not
only
the number of errors li above the criticalloading nc(~c),
where forpositive
~c error-free classification is nolonger possible,
butthey
also tried to evaluate the so-calledAlmeida-Thoitlewline aAT(~).
Above thisline,
within thereplica formalism, replica symmetry breaking (RSB)
is necessary.Surprisingly,
Gardner and Derrida found in [8] that OAT > ac for ~ > 0:However,
this was due to a subtleintegration
errorrecently
discoveredby
Bouten[9],
whoproved
that OAT(~)
=ac(~),
asexpected.
Bouten alsoshowed that
replica symmetry
isalways broken,
if the distribution of local fields possesses a gap. So theseresults,
on which we will commentlater, provide
a test for ourcavity approach.
Other
Hamiltonians,
on which we comment at a laterstage,
are(see [10, iii)
thePerceptron
function and the AdaTron function withVIE~)
=
l~ (E~/L)l~
@l~(E~/L)1, (4)
with x
= i and z
=
2, respectively,
2. I. REPRESENTATION oF OPTIMAL COUPLINGS. Instead of
minimizing
theweighted
errorrate
f(a, ~)
:=ii/p,
one can of course also maximize~(a, Ii
for thegiven training
set. Sincewe are above the critical
loading ac(~), f
p of thetraining examples
arebadly
classified.There is however an
exponentially large
numberfit
ofpartitions
of thetraining
set into a"good
fraction" and a "waste fraction" of sizeii Ii
p andf
p,respectively,
from which one has to choose theoptimal
one.Namely, fit
ctexp(c p),
with c =f
Inf (I f) In(I f).
We nevertheless assume, that every one of these combinations has been trained for
optimal stability,
e.g. with the AdaTronalgorithm [12j.
Theoptimal perceptron
with error ratef
isthen
given by
thepartition leading
to maximal ~.The
couplings
of aperceptron
trained foroptimal stability
canalways
beexpressed
in the form[12j
j~
=~ zv(»jv (~)
p/ k
»el(i-jjpj
with the so-called
"embedding strengths"
x" ofpatterns,
which do notbelong
to thefp badly
classified
patterns.
As can be shownusing Lagrangian multipliers [1,12j,
theseembedding strengths
have to fulfill the so called Kuhn-Tuckerconditions,
see below. Without restriction ofgenerality,
these areusually
formulatedby fixing
thelength
L of thecoupling
vector J insuch a way that the
stability
limit for ~ > 0corresponds
to E"=
i,
I.e. L=
~~~
With thisconvention,
which wealways
use in thefollowing,
unless otherwisestated,
the Kuhn-Tuckerconditions are
:
either
(x"
> 0 and E"=
i)
or(x"
= 0 and E" >1). (6)
In
fact,
the AdaTronalgorithm (without overrelaxation)
da"
=
max(-x~,
iE") (sequentially
or inparallel) (7) simply
fixes the x"repeatedly
to values which fulfill(6):
If it converges, the conditions arenecessarily obeyed.
Using
the "oriented correlation" matrixB~~
=
("(~ ~j ((([ (8)
N
~
and the definitions
(3) and(5),
we can write for the oriented field E"E"
=
~j
B""x".(9)
v
With the Kuhn-Tucker conditions we have
finally
~2 jj~p~pu~u_ ~jj~v (i~)
N
P>" P
The basic idea of the
cavity
method here is, to add apattern,
I.e. the"cavity",
to a set ofperfectly
trainedpatterns. By calculating
the necessaryadjustments
to embed thispattern
wegain
valuable information about the wholesystem.
As iniii,
we therefore add a new"question
with desired answer"((°, (°)
to thetraining
set,assuming
onesimple groundstate.
We note that the distribution of the oriented fields
E° acting
on it, before any furtheradaptation
has beenperformed,
which in thefollowing always
will be indicatedby
the',
isGaussian with average 0 and variance
(J(~
=L~
=
~~~.
Of course the error ratef
has toremain constant.
Therefore,
as will be seenbelow,
the beststrategy is,
togive
up and add thenew
pattern
to the "waste fraction of thetraining set",
if the oriented fieldf° acting
on(°
issmaller than a certain number Z < 1. For
self-consistency,
Z is determinedby
where t :=
f° IL
and z :=Z/L.
On the otherhand,
iff°
is >1,
then it is not necessary toembed
f°
in thecouplings,
since itcan be added to the set of those
correctly
classifiedtraining patterns,
which need noexplicit embedding (see [13j) by
the AdaTronalgorithm, [12j. Thus, only
for Z <f°
<I,
I.e.z < t < ~, the new pattern must be embedded in the
couplings,
and theimplementation strength
xP of the otherpatterns
must be correctedby
dxP(see below),
to compensate for the influence of the new pattern.
As we have
just
seen; theparallel
AdaTronalgorithm
would in a first steptry
to embed(°;
if necessary, with the "bare"
embedding
X°= i
f°.
Thisgenerates
a
perturbation BP°X°
of the pattern ~. In a second
parallel
step, all those patterns ~, which are storedexplicitly,
then have to
respond by
d~"=
-B"°x°
to the disturbationby x°,
because the Kuhn-Tucker conditions still have to be fulfilled. Atpattern
0 these correctionsgenerate
a response fieldgx°
=
~j B°~dx~
=
~j (B°~)~x°
,
(12)
»,i~M>oj v.uM>ol
which reditces the effect of the AdaTron step with
x°. Therefore,
one has to enhancex°
= i
f°
by
anamplification factor I/(I
+g) (> I).
Now the(B°")~
areI/N
on average, see(8).
Therefore one
gets immediately
~j (B°")~
= a
P(~~
>0)
=: aeR.(13)
M,j~H>0j
aeR is the
percentage
of exhausteddegrees
offreedom, I.e.,
if pattern 0 is astypical
as the other randompatterns
~ =l,.
, p, one has to
postulate
~
g = -ae~ = -a
P(Z
<E°
<ii
= -a
/
Dt2
(-11. (14)
z
Note that we have constructed our
algorithm
in such a way that the further correctionsteps,
which are necessary toregain
the Kuhn-flicker conditionsexactly,
do notchange
the response atpattern
0 in thethermodynamic
limit ~v.hen convergence is achieved. Aproof
isgiven
in theAppendix,
where we also show that in this limit one can assume that the x~ and BP" arestatistically independent.
With our
approach,
the finalembedding strength
for(°
is thengiven by (i f°) /(i
+g).
Furthermore,
~v.e nowidentify
the distribution ofembedding strengths
of allpatterns
with that ofpattern (°. Putting
the Gaussian distribution of thecavity
fieldfi°,
felt beforetraining,
into
(10) gives
L~
~= oL
/
Vt ~(is)
i + g With L
=
~~~
thisimplies
~
i =
~ /
Dt(~ t)~ (16)
+ §
After
multiplication
with i + g, thisfinally
leads to our '~Kuhn-Tuckercavity
result"~ ~
l
= a
/
Dt+ a~c
/
Dt(~ t) (17)
~
= a
(1
+~2j
vi + °~(e-~~/2 e-z~/2j j~~)
@
This result will now be
compared
with that of thecavity approaches
ofGriniasty [6j,
andlvong [7j.
Their different result isequivalent
to thereplica
formalism in thereplica-symmetric approximation [8,10j;
therefore ~v.e use the suffix "RS" toindicati
their result. We havealready
shown in[ii
that below ac, wherereplica-symmetry
is exact, our Kuhn-Tuckercavity approach
and the RS
approach
agree; however in the presentsituation,
where a is > ac,they disagree.
When
Griniasty
derives the constant ofintegration
in[6j,
he uses asimple
method which inall cases known to us
gives
the correct RS result: The reaction factor g is assumed tovanish,
while at the same time also the B"U with ~
#
v areneglected.
With these twoneglections
oneobtains instead of
equation (10)
~ ~
(19)
~2
=
X(s ~RS
XRSfit
~~~~~~
N
M
Thus,
instead ofequations (17, 18),
the "RS"cavity
result would be~
l = aRs Dt
(~ t)~
Equation (20),
which is identical with the result obtainedby Griniasty
[6j orWong [7j,
agrees with the result of thereplica
calculation of Gardner and Derrida[8,10j
from thereplica-
symmetric approximation;
theexpression
abbreviatedby
M inequation (20) yields
the differ-ence between our acav
(=
a obtained fromEq. (18))
and aRs. Because of z < ~, it is M < 0 and therefore aRs > acav. For z ~ -cc, I.e. forf
~0,
one has M=
0,
and thus for a < acit is aRs
= ncav, as
already
mentioned.Although
the numerical resultsdiffer,
there is aninteresting
formalrelationship
between the basicequations
inWong's approach
[7j and the onepresented
here:equation (6)
in [7jagrees
formally
with ourequation (16),
and our reactionstrength
gcorresponds
to -a X in [7]. However -a x in [7j differs from our g, which isgiven by equation (RI, by
a termcorresponding just
to theexpression
JzI inequation (20).
The difference arises from a d-function contribution toI'(t)
at t = z in[7j,
wherelit)
and t in [7j are the oriented fieldsafter training
andbefore training, respectively. Probably
thisdifference,
whichonly
comes intoplay
above ac, where M is# 0,
is relevant withrespect
to the combinatorialexplosion,
which conflicts with theassumption
of aunique optimum
made in all theapproaches.
Research on thisproblem
is in progress.In
Figure i,
for error rates off
= 0.2 and0.02,
thelearning capacities a(~, f)
= acav
and aRs,
respectively,
as obtained from our Kuhn-flickercavity theory, equation (17),
andOc~
3 aRs
2 ~ °~
i
J=0 02
0
0 2
Fig.
I. For two different error ratesf
the storage capacity o ispresented
as a function of the stability ~ for both estimates(17)
and(20).
with the
replica-symmetric approximation (20) respectively,
areplotted against
thestability
~.
Obviously,
thelearning capacity
obtained with the Kuhn-flickercavity theory
islower, particularly
at small values of ~, i.e. for ~= 0 and
f
= 0.2, acav is
only 10/3,
whereas aRs is 15.53. This means at the same time that forgiven
~ and a, our error ratef
would belarger
than that obtained with
equation (20).
This is astrong
hint that theassumption
of aunique optimum
for the distribution of thosepatterns,
which areput
towaste,
is wrong.Therefore,
one cannot say in
advance,
whichapproach
is better: Rather what has beengained
is thefollowing:
Since bothapproaches
agree belowac(~, f
=
0),
but not forf
>0,
we can use thedifferent results as a criterion for the
necessity
ofreplica-symmetry breaking
forf
>0,
which agrees with the recentrigorous proof
of[9j.
2.2. COMPARISON wiTH A ONE-STEP-RSB CALCULATION.
Majer
et al.[llj
have per-formed a calculation above ac
(~, f
=
0)
within aone-step replica-symmetry-breaking approxi-
mation. From
equation (2), they
obtained thefollowing
result for the free energy/E)
"alla, ~)l
~~~ fi~iw 2x(1 ~wAq)
~~~~~2~u~~~~
+
°
/ Dz0
In/ ~ Dzi
exp
'°
(A
zifi )~
ma - 2
/q
+
exPi-Wxi4~i~R~1
+ 4~1ii
,
(211
with A
= ~
z0@
andAq
= 1- q0. The parameters qo, w and x have to be chosen suchthat the number of errors is maximized. In the limit qo ~ 0 one
regains
the RS result.In
Figure
2 thestability
~ isplotted against
the error ratef
for three different values ofa.
Obviously
there isonly
a small difference between the results of the RS and thei-step
RSBcalculation,
in contrast to ourcavity results,
which differconsiderably
from bothreplica calculations,
I.e. with ourestimate,
thestability
increases much moreslowly
withincreasing
a > ac.
3.0
Ncav
~~s ~/~
KRSB ,I'
/"
~
~/
~~
2.4
~/"
~/~
/
~
~/
~'
~~
II l'~
-~
_~
l-B ~/~
-~~
_~~
~
-~ ~
if
-'~
_~' l.2
--~
-~
-~ ~'
~~~~~
a=2
0.0
0.0 0.1 0.2 0.3 0.4 0.5
f
Fig.
2. Thestability
~ ispresented
as a function of the error ratef
fora =
2.0,
0.8 and 0.4 Resultsare shown for the
cavity-method
and thereplica
calculation in RS and1-step-RSB
approximation.For
f
<I,
we can compare these different estimates withsimple
simulations: For a < 2one can train
perceptrons
tooptimal stability by
means of the AdaTronalgorithm.
If one thenskips
thepattern
with thelargest embedding strength,
I.e. the one which was most difficult to store, and re-learns theremaining patterns,
onegets
an enhancedstability,
which agrees within the error limit with thereplica calculation,
and not with thecavity results,
as we found in the simulations.Thus,
forf
>0,
our Kuhn-flickercavity
methodyields
a non-sufficientapproximation
for~(a, f).
In thefollowing
wetry
to find the reasons for thediscrepancy
and to estimate at thesame time the
quality
of theI-step
RSB calculation. For this purpose, we need the distribution of the oriented fieldst,
whichaccording
toMajer
et al.[iii
iswhere V has
been
efined forsuch
that
the isThe
arameters qo>w
and x aredetermined for given a
and
~from
equation
(22) he ofthe
local fields canbe
determined; in particulara
b-peak
at t
= i is btained, which representsthe
patterns, whichare
the training.
However,our interest
is in the field-distribution
before the dditional training.This tribution can be obtained by
in
[6jto
the presentcase.
ccording to [6j, after
addition of the pattern f°, the
quantities zo/fi
the local orientedfieldsfelt before and
after
the
additionaltraining, respectively.
The
exact valueof
lo esultsfroma
ompromise between the ncreasethe
nergyV(lo
representing
the increase in energy of theii f)p patterns,
which had been stored before the addition of the new pattern.Thus we can determine the field before the corrections
by simply eliminating V(lo)
andlo
in(22):
With the abbreviations uo"
zo/fi
and vi"
zi/li,
theintegrand
inequation (22)
is
+
exP(-Wx) 9(~ /&
(Ho +
vi))
+9(uo
+ vi)ld(t
(Ho +
vi))
~~~~~~~~2j~(~~~~~~~~
~~~ ~~~~~~~
~~~~~ ~~~~~
+
exp( -wx)@(~ /& t)
+9(t ~) (23)
Here the first term describes the
patterns,
which have beensuccessfully
embedded(I.e.
with V =0),
the second termthose,
which arebadly
classified and put to the waste(I.e.
withV = I and
lo zo@ zi@
"0),
and the final term the patterns which are stored withoutembedding.
The denominatorfit
in(23)
is theintegral
over all fields t in(23)
and serves for normalization.Thus,
we have nolonger
a Gaussian field-distribution beforelearning.
In the
cavity picture,
this non-Gaussian distribution of the fieldsacting
on an untrainedpattern
stems from the fact that there are now manyground
states available. If we add apattern
to such amultitude,
everygroundstate
will still see a Gaussian distribution of the local fieldI°.
However theparticular ground
state, which aftertraining
appears as the one with thehighest stability,
islikely
to have ahigher-than-average
local field for the newpattern,
thusbeing
able to store it moreeasily.
Our field-distribution beforelearning
is then effectedby
this selection. At the end of this section we willshortly
comment on how an intrinsiccavity
approach
for RSB should take this selection effect into account.In
Figure 3,
for ~= l and a
=
i,
the local fields arepresented
for the threeapproximations mentioned, namely
for ourcavity method,
and for the RS and RSBIapproximations.
One can see that inRSBI,
the distribution of the local fields beforelearning
the additional pattern,although strongly non-Gaussian,
is stilleverywhere continuous,
and for t = ~ evencontinuously
differentiable.Comparing
the results of ourcavity theory
with the RSapproximation.
one can see that forincreasing
a > ac. on the one
hand,
the error ratef
obtained with the RStheory
is much better than that obtained with ourcavity
method(if
the error rate obtained with RSBI is considered astarget approximation),
seeFigure 2;
whereas. on the other
hand,
the value of thelimiting negative
field value t := to "z/~ (m
-0.5 inFig. 3),
below which patterns are nolonger learnable,
is almostexactly
the same withour
simple cavity approach
and the much morecomplicated
RSBI calculation.If we
accept
the local fields from RSBI as agood approximation,
thenagain
the Kuhn- Tucker conditions(6)
must be fulfilled afterlearning,
and with(17)
we can calculate theloading
aRsB-KT from the RSBIfield-distribution,
calculated with the parameters q, w and x.To this purpose, in the above-mentioned formula one
only
has toreplace
the Gaussian measure Dtby
the RSBIfield-distribution
of thosepatterns,
whichare
explicitly embedded,
I.e. fromto
to~(= ii
inFigure
3. This is related to nRsBi in a similar way as acav is related to aRso 7 P(cavJ
P(Rsi /~h
--P<RSBJ /
0.6 /
0.5 '
Q2 0.4
C-
0.3 '
'I
0.2 ' '
0.I -1, /
0.0
-3 -2 0 2 3
t
Fig.
3. Theprobability densit» P(t)
of the local fields t beforelearning
is presented as a function of t at a= I and ~
= l.
Again
results are sho~vn for thecavity-iuethod
and thereplica
calculation in RS andI-step-RSB
approximation. The distribution aftertraining
is given bypushing
the middlesegment to a
d-peak
at ~ = l in every case. Thecavity-method
vields an error ratef
= o.29092, the
replica
methodgives f
= o.13073 in RS andf
= o.13576 in
1-step-RSB approximation.
Thefraction of
explicitly
learned patterns is o.55042 for thecavity approximation
and o.71062 and o.66745respectively
for thereplica
calculations.For three values of ~, the
resplt fKT
i=fRsBi-KT
of such a calculation ispresented
inFigure
4. Forcomparison
also the results of the twosimple approximations, fcav
andfRs,
arepresented, together
with the RSBIreplica
resultfRsBi Obviously,
theimproved cavity
resultfKT
isonly slightly higher
thanfRsBi,
which isalready
a criterion for thequality
of the RSBIcalculation
compared
with the RStheory.
One expects therefore that the
rigorous
result should lie between ourfKT
as an upper andfRsBi
as a lowerbound,
so that furtherreplica symmetry breaking
steps shouldgive only
a
slight improvement. Precisely,
weexpect
both a slow monotonous increase of the error ratefRsBn
withincreasing
n in a RSBncalculation, [14j,
and a slow monotonous decrease offRsBn-KT,
and at the same time a decrease of the relative number ofexplicitly
embeddedpatterns and of the
averaged embedding strength.
As a consequence, aRsBn-KT, which is calculated from a formulaanalogous
to(17),
increasesslowly.
For n ~ cc, the RSBnreplica
result and the RSBn-KTcavity
result derived from the RSBn field-distribution should agree.We have thus found a method
leading
to an upper bound for the error ratef(a,
~c) as afunction of a and ~c, if the local fields before
re-learning
are known. In[14j,
Fontanari and Theumann derive an upper bound forf
at ~ =0,
evaluated in RSapproximation
at the AT line for finite temperatures: For o$ 50,
our RSBI-KT upper bound islower,
whereas foro > 50 the bound in
Figure
2 of[14]
is lower.That the RSBI itself is not
yet sufficient,
hasalready
been shownby Majer
andEngel [i ii.
Moreover,
also the fact that after our RSBI-KTcavity learning
there is still a gap in thefield-distribution,
which isaccording
to Bouten [9jresponsible
forRSB, points
to this fact.Additionally,
in thisapproximation
the condition is violated that the number ofexplicitly
embeddedpatterns
should not exceed the number N ofcoupling degrees
offreedom,
see equa-tion
(23).
0.5
,,
fca% _""
IRS ,"
)I'
I' /
,' 1'
0.4 fKT
,,'$/~~
,' /
l' /~
l' /~
0.3
w-
is
0.2
,/
,"/
,'$"
-$~ ,'1' /$~
,'~
01 /~~~
n~o5
,w<~
/A~II ,/
~ w0
~'~
l.0 2.0 3.0
a
Fig.
4. The error ratef
ispresented
as a function of o for ~= l.25, 0.5 and 0. The results are
given
for thecavity method,
the RS andI-step-RSB approximation
as well as the resultfKT,
which is calculatedby putting together
the Kuhn-Tucker conditions and the local fields of the RSB-solution.The best estimate is ver»
likely
between theseRSB-graphs,
I.e.fRsB
andfKT,
andprobably
closer to /%SB.In connection with
Wong's
paper [7j it isinteresting
to note thefollowing: Using
the RSBI distribution of fields in connection withequation (6)
in [7j onereproduces self-consistently
the RSBI result for a ofhlajer
andEngel [iii.
Theslight
difference to our result arisesagain
from a contribution of ad-peak,
which ispresent
inWong's approach,
but not in our's. Thiscontribution,
much smaller now, which in turnoriginates
from the mentioned gap, leadsagain
to terms
corresponding formally
to M inequation (20).
Thus once more the difference betweenour results and those obtained with
Wong's approach,
bothstarting
with the above-mentioned RSBI fielddistribution,
shows that also the RSBIcalculation,
albeit agood approximation,
is notyet
exact.The results discussed here have
recently
beensupported by
acomplicated 2-step-RSB
cal- culation:Whyte
andSherrington
find in[15j
that in fact the next step in thereplica breaking
scheme raises the error rate
f,
butonly by
an amount which istypically O(10~~).
This agrees very well with thepredictions
that we could make from ourfindings.
As
already mentioned,
the results of thecavity
methods arenecessarily
insufficient above ac, since the combinatorialpossibilities
to select the "wastepatterns"
are not considered.However,
with our
approach
it should also bepossible
toget equations,
which areequivalent
toRSBI,
withoutusing
thereplica trick,
I.e. as in the seminal book[16j
onspin glasses.
To achieve
this,
one considers a multitude ofground
states, which are ordered in an ultra- metric structure. If one decreases thestability constraints,
the number of differentground
states is assumed to increase
exponentially.
If one now adds apattern
to thisensemble,
alower
embedding strength
is thereforefavoured,
whichgives
ahigher storage capacity
com-pared
to the Gaussian one of ourequation (17).
Thisapproach
has theappealing trait,
that the number ofpatterns,
which have to bemisclassified, gets
lower as moredegrees
of freedom become available for theapproximation.
Within thereplica method,
in contrast, one has to take the "worst" value of the orderparameters
for technical reasons. Work on an intrinsiccavity approach
for a > ac is in progress.inputlayer
with 2N
flexible
weights
hidden
layer
weights
= I
output neuron with threshold 1/2
Fig.
5. The AND-machine with tree structure(NRF).
Theweights
between the intermediatelayer
and the output neuron are fixed. A threshold between o and I at the output-neuron takes care that there
only
is a positive output if both intermediate neurons arepositive.
3. The AND-Machine
For
multilayer perceptrons, replica symmetry breaking
is ageneral phenomenon,
whenoptimal learning capacity
for agiven stability
oroptimal stability
forgiven capacity
arerequired.
Asa consequence, the
optimal capacity
is noteasily
estimated.Griniasty
and Grossman[17j
have calculated thestorage capacity
of the AND-machine and gavearguments,
which made asuppression
ofreplica symmetry breaking
credible.However,
with our method we are able to check theirproposal
and find thatreplica
symmetry is broken.3.i. MODEL DESCRIPTION. The AND-machine is a
simple example
for amulti-layer
per-ceptron. Multilayer perceptrons
have been introduced to overcome the limitations ofsingle layer
perceptrons, which are limited tolinearly separable classifications [18j,
whereas withsufficiently
many units in the additionallayer(s)
between theinput
andoutput units,
everyBoolean function can be
implemented. However,
at the sametime,
theanalytical
treatment of the models becomes much morecomplicated,
and ingeneral,
the space of solutions is nolonger simply
connected. The RSapproximation yields
then results which contradict the ex- act bounds derivedby
Mitchison and Durbin[19j. However,
the necessary RSB calculation israther
complicated
for such models. Amultilayer model,
which has been studied in this way, is the so-called committee machine with 3(or
any other odd numberNh of)
hidden itnits[20, 21j.
Here the intermediate hidden
layer
consists of three neurons, and theoutput
isgiven by
themajority
vote of these hidden neurons. The maximalcapacity
of the system decreases from ac m 4.02 for the RSapproximation
to ac m 3.0 for the RSB-ansatz. This is for the case ofnon-overlapping receptive fields,
seeFigure
5.In contrast to the mentioned case of the committee
machine,
for the AND machine thenumber
Nh
of intermediate neurons isarbitrary (> 2);
here theoutput
unitgives
apositive
vote iff all intermediate neurons vote with +i. This AND machine was at first treated
by Griniasty
and Grossman[17j,
both withnon-overlapping
and also withoverlapping receptive
fields(NRF
and ORF cases,respectively).
Within areplica-symmetric approach,
the authors find for the case of anequal
number ofpatterns
with output +i and i for N= 2 intermediate
neurons a critical
capacity
of ac m 3.5(ac
zi 3.3 withsimulations),
for the case ofoverlapping
receptive fields,
and ac m 3.66 for the NRF caseiii].
At the end of their paper,Griniasty
and Grossman discuss the validitv of their RSapproximation
and findsupport
for theirassumption
that
only
asingle
minimum contributes to their solution.Griniasty
[6jrepeated
the calculation with hiscavity
method andgot
the same results.Wong's
recent resultiii
on thisproblem
will be commented upon below. In thefollowing,
we use our differentcavity
method.3.2. CALCULATION oF THE LEARNING CAPACITY. As we have seen, our method
yields
aconvenient way to
recognize
thenecessity
of RSB.Xdditionally
onegets
arough quantitative estimate,
to which extent the actual solution isapproximated.
At first westudy
the ANDmachine with
Nh
" 2 and
non-overlapping receptive
fields(NRF case).
Since this does notnecessitate an additional effort and the
argumentation
issimplified,
we assume that both subnetworks are trained withoptimal stability,,
aslong
as theoptimal capacity
ac is notyet
reached.lve assume p+
=
a+N
and p-= a-N
patterns
withpositive respectively negative
output.Then the
parameter
b is defined via1 + b
a~ =
j
°.(24)
The
(+)-patterns
must be trained in bothsubnets,
since both intermediate neurons must votepositively,
whereas for the(-)-patterns
one can choose at will a subnet with anegative
vote.From this
fact, Griniasty
and Grossman[iii
derive thesharp
bounds~
§ ac f ~
(25)
for the maximal
capacity
ac, which alsoapplies
withoverlapping receptive
fields(ORF case).
Moreover, recently
Wendemuth[22j
w~s able togeneralize
the method of Mitchison and Durbin[19j
and obtained for the AND-machine as a lower bound thesharper
condition4
(1 hi iog~ (3j ~
°~' ~~~~which also
applies
both to the NRF and the ORF case.Let us consider
fictitiously
allpossibilities
to distribute theresponsibility
for the(- )-patterns
among the
Nh subnetworks;
these patterns,including
the(+)-patterns,
shall then be trained tooptimal stability.
Afterwards we select thatdistribution,
which leads to maximalstability
~ =
min(~ci
~c2).In both subnetworks
=
1,
2 we have Ninput
neurons, and in both subnetworks thecouplings
are defined with
embedding strengths x(
asJ,k
"~x(("((~. (27)
~
»
Here
("
is the desired finaloutput.
Asbefore,
we define the oriented correlation matrixB,
++the
length
L of thecoupling
vectors and the oriented local fields E~y»u I
~v ~u j ~»
~uj~~j
~ p/ >k ~k
k
Li
=
JIJ>
=
j~oxiBi~xi 129)
P>U
E~
=
~jB"~~). (30)
u
The
embedding strengths x(,
for I=
1,
2 and all ~,together
with theEf,
fulfillagain
the KT conditionseither
(x(
> 0 andEf
=
ii
or(x$
= 0 andEf
>1). (31)
Additionally,
we know that for ouroptimal choice,
a(- )-pattern
can have ~positive embedding strength only
at one of thesubnets,
since otherwise the less stable subnet could enhance thestability by reducing
itsunnecessarily positive x$
to 0 andrelearning
the otherpatterns.
Furthermore,
in thethermodynamic limit,
both stabilities should beequal,
I.e.L(
=
L]
=:
L~.
Let us assume
again
that there isonly
onegroundstate,
and add a new pattern.Again
this feelsa
normally
distributed random oriented fieldE)
with varianceL~
in each subnet. A(+)-pattern
must be classified
correctly
in both subnets. So weneed,
in case of@)
<1, positive embedding strengths x)
=(i E)) /(i
+ gi),
where the response factors gi have still to be determined. In contrast, for a(-)-pattern
we can choose the subnet withnegative
intermediateoutput.
Aswe will see
below,
it is best to embed the(-)-pattern
in thatsubnet,
where the oriented field islarger,
so that theembedding strength
can be smaller. The field-distributionP(t)
for thelarger
oriented fieldt,
with normalizedcouplings, (E$~~
=Lt),
can be calculated from the Gaussian distribution of the fields(ti, t2)
of the two sublattices as follows:j
~ t
~~~~ ~~~~~ ~'~~
~ ~~/h
~ ~~~~~
If
again
we assume that x" and BPU areuncorrelated,
then the answer g of the patterns, which hadalready
beenimplemented
and now mustkeep
the KTconditions,
is similar toequation (14), namely
£ (B°~)~
= -ae~ =
-aP(xP
>0). (33)
(xf>o)
Identifying again
the distribution of theembedding strengths
with theprobability
distribu- tion forx°
andnormalizing
thecouplings
toi,
weget
~ ~
g = -a+
/
Via-
/ Vi16(t). (34)
Here the first and second
parts
describe the influence of(+)-
and(-)-patterns, whereby compared
withequation (32)
a factor1/2
~v~as taken intoaccount,
sinceonly
one of the two subnets is needed for(-)-patterns.
Thecapacity
for finite stabilities isagain
calculated from the KT conditions(31) through
~2
~
jj~m~Pu~u
~jj
~M~M=
~j~M (35)
1 fi~ ? fi~ ~ ~ l
vu p p
~
= a
/ dx, x,w(x~)
=
~' /
Vt(o+
+
a-16(t))
(~ct). (36)
+ g
-c~