HAL Id: jpa-00212455
https://hal.archives-ouvertes.fr/jpa-00212455
Submitted on 1 Jan 1990
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Landscape statistics of the binary perceptron
J.F. Fontanari, R. Köberle
To cite this version:
Landscape
statistics of the
binary
perceptron
J. F. Fontanari and R. KöberleInstituto de Fisica e
Quimica
de São Carlos, Universidade de São Paulo, 13560 São Carlos SP,Brasil
(Reçu
le 20 décembre 1989,accepté
sousforme définitive
le 13 mars1990)
Abstract. 2014 The
landscape
of thebinary
perceptron is studiedby
SimulatedAnnealing,
exhaustive search andperforming
random walks on thelandscape.
We find that the number of local minima increasesexponentially
with the number of bonds,becoming
deeper
in thevicinity
of aglobal minimum,
but more and more shallow as we move away from it. The random walker detects asimple dependence
on the size of themapping,
the architectureintroducing
a nontrivialdependence
on the number of steps.Classification
Physics
Abstracts 87.10 64.60C1. Introduction.
The
single-layer
perceptron
has beenobject
of intense theoretical research since Gardner’s seminal paper[1].
The tools for this area of research have beendeveloped
in theanalysis
ofthe infinité range
spin-glasses [2].
Remarkable outcomes of these studies were therederiva-tion of the maximal
storage
capacity
of the network for the randominput/output mapping [1]
]
and the calculation of the distribution of relaxation times for theoptimal
stability
perceptron
learning algorithm
of Krauth and Mezard[3].
Most of the results derived in the 1960’s[4, 5]
and alsorecently [1,
3,
6-9]
concernedperceptrons
with realweights, though
Gardner’sapproach
caneasily
be extended to the case where theweights
are constrained to take ononly
a limited number of values.Clearly, knowing
how the discretisation of theweights
affects theperformance
of theperceptron
isimportant
from thetechnological viewpoint
because somesort of discretisation is
always
made whenimplementing
the network in hardware. There isalso a
biological
motivation forconsidering
discreteweights,
since the finetuning
of thesynaptic
efficacies is not aplausible assumption
forhighly
robustsystems
asbiological
brains.In this paper we consider an extreme case of discretisation where the
weights
areconstrained to take the values ± 1 and we refer to this model as the
binary
perceptron
[6,
10].
The network consists of an
input layer composed of N binary
units= - 1, i = 1, ..., N
each one connected to abinary
output
unit S = ± 1through
theweights
{Ji
= ±1, i
=1,
...,
N } .
Once the values of the units in theinput layer (input pattern)
areknown,
the value of theoutput
unit is determinedby
theequation
The
perceptron’s
task consists oflearning
amapping
between pinput
patterns
and poutput
states. Such
mapping
can be emulatedby
theperceptron
if there exists a vectorJ =
(Jl,
J2, ...,
J N)
such thatthe p inequalities
are
simultaneously
satisfied.Despite
theapparent
simplicity
of thebinary
perceptron
itsanalysis
has been much harder to carry out than for theperceptron
with realweights.
Forinstance,
consider thecomputation
of the maximal
storage
capacity
a c defined as the ratio between the maximal numberof input
patterns
which can becorrectly mapped
into theirrespective
outputs
and the size of theinput
field N. For the real
weights
perceptron
thereplica symmetric
ansatz is exactand,
for the randominput/output mapping, gives
ac = 2[1]
while itgives
a valuelarger
than 1[6],
theInformation theoretic
bound,
for thebinary
perceptron.
Only recently
the issue of the maximalstorage
capacity
of thebinary
perceptron
was settledthrough
extensive numerical simulations[ 11 ]
andanalytical
calculationsusing
Parisi’s ansatz forbreaking
thereplica
symmetry
[12].
Itsvalue, - 0.83,
indicates that the discretisation of theweights
does notcause a serious
damage
to theperceptron
performance. Although
this result asserts that fora 0.83 there exist solution vectors J which
satisfy
theinequalities (1.2),
there is no knownalgorithm
forfinding
them,
in contrast to the case of realweights
where theperceptron
algorithm
isguaranteed
to converge in a finite time to one of the solutions[4].
Lacking
aspecific algorithm
for thebinary
perceptron
we need to resort to heuristic methods to find the solutions of agiven mapping [13].
The process offinding
these solutions can be viewed as a search in theweight
space for theglobal
minima of the energy functionwhich
represents
the number of misassociatedpatterns
(0 , E , p )
and©(x) = 1
forx ± 0 and 0 otherwise. Moreover
by restricting
the moves in theweight
space to one bondflip
one can think ofequation (1.3)
asdefining
an N-dimensionallandscape
whichmay have
manyvalleys
of low energy(local minima)
and zero-energy(solutions).
It’s clear that theperformance
of the heuristicalgorithms
intrying
to minimize E willdepend
upon severalproperties
of the energylandscape
like how mountainous itis,
the number of minima and its energydistribution,
the distance between minima and so on.The
goal
of this paper is tostudy
the statistical structure of the energylandscape
as afunction of N
characterizing
the network architecture and pcharacterizing
themapping
structure. Since we want to
guarantee
that at least one solutionexists,
we consider amapping
where
the p input
patterns
are chosenrandomly
withequal
probability
while theoutputs
aregiven by
where
Ki
= ± 1 at random[10].
Of course, we can choose K = 1 without loss ofgenerality
because
the emi’ s
are random. It isinteresting
that there exists a maximal number ofinput
patterns
Pmax ==== 1.35 N above which theonly
solution to theproblem
isK [10].
We do notpursue this issue in this paper.
The paper is
organized
as follows. In section 2 westudy
the distribution of minima of theand
by
agreedy (gradient descend) algorithm
forlarger
N. In section 3 westudy
thedistribution of
global
minima(solutions)
obtainedby
SimulatedAnnealing [14].
In section 4we
study
a random walk on the energylandscape
to obtain information about thelandscape’s
correlation. The details of this calculation aregiven
in anappendix. Finally,
in section 5 wesummarize our results and
present
someconcluding
remarks.2. Distribution of minima.
In this section we address the issue of the distribution of
minima,
disregarding
whether local orglobal,
of the energylandscape
definedby equation (1.3).
Thealgorithm
we use togenerate
the minima is thefollowing. Starting
from an initial guess vectorJo
with agiven
energy, one
computes
the energy of its Nneighbors (vectors differing by
1 bond fromJo).
The vector with lower energy iskept
and the process isrepeated
until it reaches a vector which has lower energy than any of itsneighbors.
There are severalquestions
of interest whose answers may shed somelight
on the structure of thelandscape :
1. What is the average energy of the minima and how does it
depend
on the distance to the solution vector 1 ?2. What is the average distance between minima ?
3. What is the average number of
steps
the abovealgorithm
takes to reach a minimum ?4. How does the number of minima
depend
on Nand p ?
To
investigate
thesequestions
we haveperformed
simulations for networks andmappings
of différent sizes.
Typically
our results are averages over 20 différentmappings
and 200 initial guess vectors. In thepresentation
of our results we introduce theparameter a
defined as theratio between the size of the
mapping
and the size of thenetwork, a
= p /N.
Table 1 shows the average and the width of the distribution of energy for several values of N and a in the case where the initial guess vector
components
are chosenrandomly
withequal
probability.
Bothquantities,
the average and thewidth,
are dividedby p
to facilitate thecomparisons
between the data. These results indicate that the average energy increaseslinearly
with p while the width increaseswith === p 1/2.
When N increases the average energyapproaches
p /2,
the energy of a randomconfiguration.
This result resembles Kauffman’scomplexity catastrophe [15]
sinceby increasing
thecomplexity
of the network(measured by
N)
one makes the local minima poorer to such apoint
thatthey
are not better than arandomly
chosen
configuration.
Infigure
1 wepresent
the energy distribution of the minima obtainedby
exhaustive search(all
minima werefound)
for N = 15 and several values of a. Notice thatas a increases the average energy tends to a well defined
limit,
whichdepends
onN,
inagreement
with the results of table 1.Table I. -
Average
and width(between
parenthesis) of
the distributionof
energyof
the minimaFig.
1.- Energy
distribution of the minima obtainedby
exhaustive search for N = 15 anda = 1.2
(A ),
2.0(e)
and3.0 (0).
For each curveP (E)
is dividedby Pmax(E)
1.In table II we
present
the results of a similarexperiment
except
that now the initial guessvector
components
are chosenequal
to - 1 withprobability 1/4
so thatJo
differs in averageby
N/4
bonds from the solution vector 1. These results indicate that the average energy of the local minima decreases asthey
come closer to aglobal
minimum.Supporting
thisconjecture
wepresent
infigure
2 the energy distribution of the minima obtainedby
exhaustive search forN = 15 and a =
2,
where we haveseparated
the minima in differentsamples according
to theirHamming
distance(normalized
to 1throughout
thispaper)
to the solution vector 1. Thisrather curious
property
of thebinary
perceptron
energylandscape,
reminiscent of a « MassifCentral » since there is a
high
concentration of very low energy minima around theglobal
minima,
seems to be a characteristic of some classesof landscapes [15].
Infigures
1 and 2 wehave divided P
( E ) by
its maximum valueP max ( E )
1 for each curve so that the data fordifferent a and
d
1 can bepresented
in the same scale.Table II.
Average
and width(between parenthesis) of
the distributionof
energyof
theminima in the case where the components
of Jo
are - 1 withprobability 1/4.
Now we turn to the distribution of the
Hamming
distancesdH
between minima. We have found that the average of thisdistribution,
dH,
isextremely
insensitiveto p
and N :already
forN = 51 it coincides with its value for N = 1 001. The
width, however,
though
insensitiveFig.
2.- Energy
distribution of the minima obtainedby
exhaustive search for N = 15 anda = 2.0 for
d,
1(o),
d1
1/5 (0), dl -- 1/3 (A )
andd1 >
4/5 (+ ).
For each curveP(E)
is dividedby
Pmax(E)
1.decreases with
N - 1/2.
The same comments are also valid for the distribution of distancesbetween the minima and the solution vector
1,
dl.
However,
dH
anddl
depend
on the choiceof the initial guess vector
Jo.
Table III shows thisdependence
for N = 1001 anda = 0.1 with
Jo
chosen inside aHamming sphere
centered on 1 and of radiusequal
to0.1,
0.25 and 0.5. Notice how the minima become closer to each other asthey
approach
thesolution 1 thus
corroborating
the « Massif Central »conjecture.
Table III.
- Average Hamming
distance between minima(dH)
and between the minima andthe solution 1
(il) for
Jo
chosen inside aHamming
sphere
centered on 1of
radius0. l, 0.25
and0.5. The numbers between
parenthesis
are the widthof
the distribution.In
figure
3 we show the number ofsteps
thegreedy algorithm
takes to reach a minimum as afunction of a. This
quantity
scalesnicely
withN 1/2 (we
have tested thisscaling
for N = 15 up to1 001),
increaseslinearly
for small a and seems to saturate forlarge
a. Thisbehavior
might
be a consequence of adecreasing
in the number of minima as aincreases,
since it is clear that as the minima become rarer more time will be needed to find them. To check it we present in
figure
4 the ratio between the number of minima and the total numberFig.
3. - Number of steps to reach a minimum(scaled by
N- l/2
as a function of a forN = 101
1 (e)
and 201(A ).
Fig.
4. - Ratio between the number of minima and the total number of vectors as a function of a for N =7(A),
9(D)
and 15(o).
fixed
N,
its number decreases as a - x and for N = 15 one finds x = 0.73. Notice the lineardecreasing
of the number of minima for small a and the saturation forlarger
a. One could belead to think that
finding
theglobal
minimum becomes easier as a - oo since there are fewertraps
(local minima)
on the way,however,
thepossibility
thelandscape
takes agolf
courseshape
as a increases cannot be dismissed.Indeed,
this seems to be the case : forN = 15 we have measured the
energy gap between the
global
minimum and the first excitation increases with a. We havealso observed that the energy gaps between local minima are much smaller then the ones
stated above so that a
golf
course scenario seems to beappropriate
to describe thelandscape
for verylarge
a. Fromfigure
4 one can also see that the number of minima increasesexponentially
with N.3. Distribution of
global
minima.In this section we focus on the
global
minima of the energylandscape.
We have used SimulatedAnnealing (SA) [14]
to obtain in average 50 different solutions permapping.
Theaverage and the width of the distribution of
dH
arecomputed
and the resultsaveraged
over~ 100
mappings.
However,
there are some inherentproblems
in this method.First,
for pnearPma,,, - 1.35 N there are very few solutions to a
mapping
so one should findpractically
allthem in order to have a
significative
statistics. Sincefinding
solutions in thisregion
has provento be an
exceedingly
difficult task even for apowerful
heuristic as SA we restrict ouranalysis
to the
region a
0.7.Second,
the solutions foundby
SAmight
not be arepresentative
sample
of the space of solutions sincethey
could possess someparticular properties
whichmade them accessible to SA.
However,
avoiding
to collect biased solutions for our statistics was the main reason we have chosen SA among other heuristicalgorithms :
thestochasticity
of SA makes it very
unprobable
to be attracted to somespecial
class of minima.Now we turn to the results of the simulations.
Keeping a
= 0.1 fixed we measure the mean and the width of the distribution ofdH.
Theresults,
presented
in table IV for several values ofN,
show that the mean ispractically
insensitive to N while the width becomes narrower as Nincreases. The
decreasing
width is fittedby
the curve 0.54N - 0.52,
indicating
then that in thethermodynamic
limit the solution space, at least for small a, has a verysimple
structure moresimilar to a
ferromagnet
whoseground
states are related to each otherby
thesymmetry
of themodel than to an infinite range
spin-glass
where there are nosymmetry operators
leading
from one
ground
state to the others. These results also show that thereplica symmetric
ansatz shouldgive
agood description
of thebinary perceptron’s
solution space for small a, inagreement
with the results of[12] though
theproblem
considered there was the randominput/output mapping.
Table IV.
- Average
and widthof
the distributionof
Hamming
distances betweenglobal
minima
for
a = 0.1.One of the
key points
to the calculation of ac for thebinary
perceptron
was to realize thatthe solutions do not become
arbitrarily
close to each other as a - a c. In table V we show thedependence
on a for N = 120 of the mean and the width of the distribution ofdH.
The solutions become closer to each other as a increases and the width decreases veryslowly
with a. As observed in thebeginning
of this section thesharp
reduction in the number of solutions prevents us of furtherincreasing a
and thenverifying
whetherdH
goes to zero orTable V.
- Average
and widthof
the distributionof
Hamming
distances betweenglobal
minima
for
N = 120.4. Random walks on the energy
landscape.
In this section we
study
random walks on the energylandscape
of thebinary
perceptron.
Thequantity
we focus on is the average variation of energy after s randomsteps.
Eachstep
consists of
flipping
onerandomly
chosen bond.Clearly
on onehand,
in a very correlatedlandscape
a random walker would need manysteps
to reach aposition,
i.e. a vectorJ,
whoseenergy differs
significantly
from the initial one. On the otherhand,
in a very uncorrelatedlandscape
our random walker would needonly
a fewsteps
to reach such aposition.
Thusstudying
the variation of energy in a random walk weexpect
to obtain information about thelandscape’s
correlation.The walker starts from a random initial
position,
i.e. from a vector J whosecomponents
± 1 are chosen
randomly
withequal probability.
We first calculate theprobability
distributionfor an energy
change
DE due toflipping m
bonds of J and then express it in terms of thenumber of
steps s
takenby
the random walker.We will be
mainly
interested in the first two moments ofP /1E.
In thefollowing
weonly
present
and discuss theresults,
giving
the details of the calculations in theappendix.
We have foundClearly
the first moment is zero since a bondflip
can increase or decrease the energy of arandom
configuration
withequal probability. Pm
in theequation
for the second moment isgiven by
for m « N and odd N > 1. From this
equation
one can see thatPm
=P m +
1 for odd m.
In the limit m =
f3 N
we can writeP m -+ P ( B )
whereBefore
comparing
thèse results with our simulations m has to be related to the number ofsteps
taken in our random walk. We thus ask for theprobability
Ps, m
ofobtaining
mflipped
bonds in s random
steps.
This is a well knownproblem proposed by
Ehrenfest[16],
which canThe mean number of
flipped
bonds after ssteps
is :The
stationary
solution for the number offlipped
bonds exhibits asharp
Gaussianpeak
ofwidth
N -
1/2 around the mean, so that we canreplace
the number offlipped
bondsby
its meanvalue. In this way we obtain the
dependence
ofPAE
on the number ofsteps
s. We compare( (AE)2 >
,computed by replacing
3
inequation (4.3) by
the r.h.s. ofequation (4.5)
andsubstituting
the result inequation (4.1 b),
with our simulations infigure
5.A very good
convergence with N to the theoreticalprediction
can be seen.Some comments
regarding
theinterpretation
of ( (AE )2 >
are in order. It is aproduct
of twofactors,
the linearp-behavior being displayed
in one of them reflects thedependence
on themapping
via a sum of p randomindependent
variables. The nontrivialdependence
ons/N
is determinedby
theperceptron’s
architecture.Fig.
5. - Second moment(scaled by
p)
of the distribution for AE in terms of the number ofsteps/N
forN = 201
(D)
and 401(o).
The lower curve is the theoreticalprediction.
5. Conclusion.
The aim of this
study
is to understandgeneral
features of thebinary
perceptron
landscape,
which wouldeventually guide
our intuition in the construction ofalgorithms searching
forglobal
minima. Our results are restricted tomapping problems
which have at least onesolution
[10].
Thegeneral problem
offinding
theglobal
minima of thebinary
perceptron
energy isNP-complete
[ 17]
since it isequivalent
tointeger programming although
atypical
problem
may be tractable[18].
Next we summarize our main results. We find that the numberof local minima increases
exponentially
withN,
but far from aglobal
minimumthey
becomeever shallower as
they
proliferate.
In thevicinity
of aglobal
minimum the average energy ofthe local
minima
tends todecrease,
indicating
the existence of a « Massif Central » a lafinding
theglobal
minimum does not become easier since thelandscape
seems to take agolf
courseshape.
Ananalysis
of the distribution of distances betweenglobal
minima obtainedby
Simulated
Annealing
for small « indicates that it tends to a delta function in thethermodynamic
limit.We
sample
the average features of the energylandscape
by
a random walkerstarting
at arandom
position.
We focus on theprobability
distribution for achange
of DE as a function ofthe number of
steps s
takenby
the random walker. The first moment is zero whereas thevariance has a trivial
dependence
on p but a verycomplex
dependence
ons/N
reflecting
theperceptron
architecture.Somewhat
surprisingly,
most of the features we studied were found todepend trivially
on p(it gives
the energyscale),
indicating
that it is theunderlying
network architecture whichcontrols the
landscape
structure. It should beinteresting
to see how thelandscape
featuresstudied in this paper
change
whenmappings
with morecomplex
statistics are considered.Acknowledgements.
JFF is indebted to R. Meir for
illuminating
discussions onperceptrons
and for aid with thesimulations in section 3. JFF also thanks P. Moscato for discussions on Kauffman’s work and
J. J.
Hopfield
for thehospitality
at Division ofChemistry (Caltech),
where this work wasstarted. The research of RK was
partly supported by CNPq
and JFF waspartly
supported by
aCNPq
fellowship during
hisstay
at Caltech.Appendix.
In this
appendix
wepresent
the details of the calculation ofP âE,
theprobability
distribution for an energychange
AE due toflipping
m bonds of a random vector J.Initially
we note that the p fieldswhere
TI r
= Su sui’ are
independent
random variables since theTI r’s
are chosenindependently
for different»’s.
Of course, in order to 1 be a solution of themapping
theq Ms
mustsatisfy
for each g.
A
change
in energy occurs, whenever some of the fieldschange sign. Suppose
thishappens
for n
fields,
out of which n+change
fromplus
tominus,
whereas n- do theopposite,
so thatn = n + + n - and AE = n -
2 n - .
Theprobability
for the occurrence of this energychange
iswith the notation
Ck = n! /k! (n - k )!
and n_ =(n - AE)/2.
Since the p fields hu areindependent,
Pn
isgiven by
where
Pm
is theprobability
for some field h IL tochange sign
under a simultaneousflip
of mbonds of J. In order to
compute
Pm
let usdrop
thepattern
indexes and divide the fieldm
where z contains the m
components
of J to beflipped.
Theprobability
for E
J
77i to
equal
i = 1
Z is
with k =
(m -
z)/2,
where we have used the fact that the presence of the.Ji s
turnsz into a sum of
independent
random variablesdespite
the constraint(A2).
Similarly
thedistribution for z is
given by
with k =
(m - z ) /2.
Pm
will be theprobability
for1 il ( > 1 z 1.
Restricting
ourselves forsimplicity
to oddN > 1 we
get
equation (4.2)
for m « N. The limit m =8 N,
equation (4.3),
is obtainedby
replacing
the binomial distributions ofequations (A6)
and(A7) by
Gaussians. Thusinserting equation (4.2)
or(4.3)
intoequations (A4)
and(A3)
we obtainP 4E.
References
[1]
GARDNER E., J.Phys.
A 21(1988)
257.[2]
MEZARD M., PARISI G., VIRASORO M. A.,Spin
GlassTheory
andBeyond (World
Scientific,Singapore)
1987.[3]
OPPER M.,Phys.
Rev. A 38(1988)
3824;KRAUTH W. and MEZARD M., J.