Landscape statistics of the binary perceptron

(1)

HAL Id: jpa-00212455

https://hal.archives-ouvertes.fr/jpa-00212455

Submitted on 1 Jan 1990

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Landscape statistics of the binary perceptron

J.F. Fontanari, R. Köberle

To cite this version:

(2)

Landscape

statistics of the

_binary

_perceptron

J. F. Fontanari and R. Köberle

Instituto de Fisica e

_Quimica

de São Carlos, Universidade de São Paulo, 13560 São Carlos SP,

Brasil

(Reçu

le 20 décembre 1989,

accepté

sous

_{forme définitive}

le 13 mars

1990)

Abstract. 2014 The

_landscape

of the

_binary

_perceptronis studied

_by

Simulated

_Annealing,

exhaustive search and

_performing

random walks on the

_landscape.

We find that the number of local minima increases

exponentially

with the number of bonds,

becoming

deeper

in the

_vicinity

of a

_{global minimum,}

but more and more shallow as we move _awayfrom it. The random walker detects a

_{simple dependence}

on the size of the

_mapping,

the architecture

_introducing

a nontrivial

dependence

on the number of _steps.

Classification

Physics

Abstracts 87.10 64.60C

1. Introduction.

The

_single-layer

_perceptron

has been

_object

of intense theoretical research since Gardner’s seminal paper

[1].

The tools for this area of research have been

_developed

in the

_analysis

of

the infinité range

spin-glasses [2].

Remarkable outcomes of these studies were the

rederiva-tion of the maximal

_storage

_capacity

of the network for the random

_{input/output mapping [1]}

_]

and the calculation of the distribution of relaxation times for the

_optimal

_stability

_perceptron

learning algorithm

of Krauth and Mezard

_[3].

Most of the results derived in the 1960’s

_{[4, 5]}

and also

_{recently [1,}

_3,

_6-9]

concerned

_perceptrons

with real

_{weights, though}

Gardner’s

approach

can

_easily

be extended to the case where the

_weights

are constrained to take on

_only

a limited number of values.

_{Clearly, knowing}

how the discretisation of the

_weights

affects the

performance

of the

_perceptron

is

_important

from the

_{technological viewpoint}

because some

sort of discretisation is

_always

made when

_implementing

the network in hardware. There is

also a

_biological

motivation for

_considering

discrete

_weights,

since the fine

_tuning

of the

synaptic

efficacies is not a

plausible assumption

for

_highly

robust

_systems

as

_biological

brains.

In this _paperwe consider an extreme case of discretisation where the

_weights

are

constrained to take the values ± 1 and we refer to this model as the

_binary

_perceptron

_[6,

_10].

The network consists of an

_{input layer composed of N binary}

units

_{= - 1, i = 1, ..., N}

each one connected to a

_binary

_output

unit S = ± 1

_through

the

_weights

{Ji

= _±

1, i

=

1,

...,

N } .

Once the values of the units in the

input layer (input pattern)

are

known,

the value of the

_output

unit is determined

_by

the

_equation

(3)

The

_{perceptron’s}

task consists of

_learning

a

_mapping

between _p

_input

_patterns

and p

output

states. Such

_mapping

can be emulated

by

the

_perceptron

if there exists a vector

J =

(Jl,

J2, ...,

J N)

such that

the p inequalities

are

_{simultaneously}

satisfied.

Despite

the

_apparent

_simplicity

of the

_binary

_perceptron

its

_analysis

has been much harder to _carryout than for the

_perceptron

with real

_weights.

For

instance,

consider the

_computation

of the maximal

_storage

_capacity

_{a c}defined as the ratio between the maximal number

_{of input}

patterns

which can be

_{correctly mapped}

into their

_respective

_outputs

and the size of the

_input

field N. For the real

_weights

_perceptron

the

_{replica symmetric}

ansatz is exact

_and,

for the random

_{input/output mapping, gives}

_ac= 2

[1]

while it

gives

_avalue

larger

than 1

[6],

the

Information theoretic

_bound,

for the

_binary

_perceptron.

_{Only recently}

the issue of the maximal

_storage

_capacity

of the

_binary

_perceptron

was settled

_through

extensive numerical simulations

_{[ 11 ]}

and

_analytical

calculations

_using

Parisi’s ansatz for

_breaking

the

_replica

symmetry

[12].

Its

value, - 0.83,

indicates that the discretisation of the

_weights

does not

cause a serious

_damage

to the

_perceptron

_{performance. Although}

this result asserts that for

a 0.83 there exist solution vectors J which

_satisfy

the

_{inequalities (1.2),}

there is no known

algorithm

for

_finding

_them,

in contrast to the case of real

_weights

where the

_perceptron

algorithm

is

_guaranteed

to _{converge in}a finite time to one of the solutions

_[4].

Lacking

a

_{specific algorithm}

for the

_binary

_perceptron

we need to resort to heuristic methods to find the solutions of a

given mapping [13].

The process of

_finding

these solutions can be viewed as a search in the

_weight

space for the

global

minima of the energy function

which

_represents

the number of misassociated

_patterns

_{(0 , E , p )}

and

_{©(x) = 1}

for

x ± 0 and 0 otherwise. Moreover

_{by restricting}

the moves in the

_weight

_spaceto one bond

flip

one can think of

_{equation (1.3)}

as

defining

an N-dimensional

landscape

which

may have

_many

valleys

of low energy

(local minima)

and zero-energy

(solutions).

It’s clear that the

performance

of the heuristic

_algorithms

in

_trying

to minimize E will

_depend

_uponseveral

properties

of the energy

landscape

like how mountainous it

_is,

the number of minima and its energy

distribution,

the distance between minima and so on.

The

_goal

_{of this paper is}to

_study

the statistical structure _{of the energy}

_landscape

as a

function of N

_{characterizing}

the network architecture and p

characterizing

the

mapping

structure. Since we want to

_guarantee

that at least one solution

exists,

we consider a

_mapping

where

_{the p input}

_patterns

are chosen

_randomly

with

_equal

_probability

while the

_outputs

are

given by

where

_Ki

= ± 1 at random

_[10].

Of course, we can choose K = 1 without loss of

generality

because

_{the emi’ s}

are random. It is

_interesting

that there exists a maximal number of

_input

patterns

Pmax ==== 1.35 N above which the

only

solution to the

_problem

is

_{K [10].}

We do not

pursue this issue in this paper.

The paper is

organized

as follows. In section 2 we

study

the distribution of minima of the

(4)

and

_by

a

_{greedy (gradient descend) algorithm}

for

larger

N. In section 3 we

_study

the

distribution of

_global

minima

_(solutions)

obtained

_by

Simulated

_{Annealing [14].}

In section 4

we

_study

a random walk on _{the energy}

_landscape

to obtain information about the

_{landscape’s}

correlation. The details of this calculation are

_given

in an

appendix. Finally,

in section 5 we

summarize our results and

present

some

concluding

remarks.

2. Distribution of minima.

In this section we address the issue of the distribution of

minima,

disregarding

whether local or

global,

of the energy

landscape

defined

_{by equation (1.3).}

The

_algorithm

we use to

generate

the minima is the

_{following. Starting}

from an initial guess vector

_Jo

with a

_given

energy, one

_computes

the energy of its N

neighbors (vectors differing by

1 bond from

Jo).

The vector _{with lower energy is}

_kept

and the process is

repeated

until it reaches a vector which has lower energy than any of its

neighbors.

There are several

_questions

of interest whose answers _mayshed some

_light

on the structure of the

_{landscape :}

1. What is the average energy of the minima and how does it

depend

on the distance to the solution vector 1 ?

2. What is the average distance between minima ?

3. What is the average number of

_steps

the above

_algorithm

takes to reach a minimum ?

4. How does the number of minima

_depend

on N

_{and p ?}

To

_investigate

these

_questions

we have

performed

simulations for networks and

mappings

of différent sizes.

_Typically

our results are _averagesover 20 différent

_mappings

and 200 initial guess vectors. In the

_presentation

of our results we introduce the

parameter a

defined as the

ratio between the size of the

_mapping

and the size of the

network, a

= p /N.

Table 1 _{shows the average and the width of the}distribution of _energyfor several values of N and a in the case _{where the initial guess}vector

_components

are chosen

_randomly

with

_equal

probability.

Both

_quantities,

the average and the

width,

are divided

_{by p}

to facilitate the

comparisons

between the data. These results indicate _{that the average energy increases}

linearly

with p while the width increases

with === p 1/2.

When N increases the average energy

approaches

p /2,

the energy of a random

_{configuration.}

This result resembles Kauffman’s

complexity catastrophe [15]

since

_{by increasing}

the

_complexity

of the network

_{(measured by}

N)

one makes the local minima poorer to such a

_point

that

_they

are not better than a

_randomly

chosen

_{configuration.}

In

_figure

1 we

_present

the _energydistribution of the minima obtained

by

exhaustive search

_(all

minima were

_found)

for N = 15 and several values of a. Notice that

as a increases the _{average energy}tends to a well defined

_limit,

which

_depends

on

N,

in

agreement

with the results of table 1.

Table I. -

Average

and width

(between

parenthesis) of

the distribution

_of

energy

of

the minima

(5)

Fig.

1.

_{- Energy}

distribution of the minima obtained

_by

exhaustive search for N = 15 and

a = 1.2

(A ),

2.0

(e)

and

3.0 (0).

For each _curve

P (E)

is divided

by Pmax(E)

1.

In table II we

_present

the results of a similar

experiment

_except

that now _{the initial guess}

vector

_components

are chosen

equal

to - 1 with

_{probability 1/4}

so that

_Jo

differs in average

by

N/4

bonds from the solution vector 1. These _{results indicate that the average energy of the} local minima decreases as

_they

come closer to a

_global

minimum.

_Supporting

this

_conjecture

we

_present

in

_figure

2 _{the energy distribution of the minima obtained}

_by

exhaustive search for

N = 15 and a =

2,

where we have

_separated

the minima in different

_{samples according}

to their

_Hamming

distance

_(normalized

to 1

_throughout

this

_paper)

to the solution vector 1. This

rather curious

_property

of the

_binary

_perceptron

energy

landscape,

reminiscent of a « Massif

Central » since there is a

_high

concentration of very low energy minima around the

global

minima,

seems to be a characteristic of some classes

of landscapes [15].

In

figures

1 and 2 we

have divided P

_{( E ) by}

its maximum value

_{P max ( E )}

1 for each curve so that the data for

different a and

_d

_{1 can}be

presented

in the same scale.

Table II.

_Average

and width

_{(between parenthesis) of}

the distribution

_of

_energy

_of

the

minima in the case where the components

of Jo

are - 1 with

_{probability 1/4.}

Now we turn to the distribution of the

_Hamming

distances

_dH

between minima. We have found that the average of this

distribution,

dH,

is

_extremely

insensitive

_{to p}

and N :

_already

for

N = 51 it coincides with its value for _N= 1 001. The

width, however,

though

insensitive

(6)

Fig.

2.

- Energy

distribution of the minima obtained

by

exhaustive search for N = 15 and

a = 2.0 for

d,

1

(o),

d1

1/5 (0), dl -- 1/3 (A )

and

d1 >

4/5 (+ ).

For each curve

_P(E)

is divided

by

Pmax(E)

1.

decreases with

N - 1/2.

The same comments are also valid for the distribution of distances

between the minima and the solution vector

_1,

_dl.

_However,

dH

and

_dl

_depend

on the choice

of the initial guess vector

_Jo.

Table III shows this

_dependence

for N = 1001 and

a = 0.1 with

_Jo

chosen inside _a

Hamming sphere

centered _on1 and of radius

equal

_to

0.1,

0.25 and 0.5. Notice how the minima become closer to each other as

_they

_approach

the

solution 1 thus

_{corroborating}

the « Massif Central »

_conjecture.

Table III.

- Average Hamming

distance between minima

_(dH)

and between the minima and

the solution 1

_{(il) for}

_Jo

chosen inside a

_Hamming

sphere

centered on 1

of

radius

0. l, 0.25

and

0.5. The numbers between

_parenthesis

are the width

_of

the distribution.

In

_figure

3 we show the number of

_steps

the

_{greedy algorithm}

takes to reach a minimum as a

function of a. This

quantity

scales

nicely

with

N 1/2 (we

have tested this

scaling

for N _{= 15 up}to

_{1 001),}

increases

_linearly

for small a and seems to saturate for

_large

a. This

behavior

_might

be a consequence of a

_decreasing

in the number of minima as a

_increases,

since it is clear that as the minima become rarer more time will be needed to find them. To check it we _presentin

_figure

4 the ratio between the number of minima and the total number

(7)

Fig.

3. - Number of _steps to reach a minimum

_{(scaled by}

_{N- l/2}

as a function of a for

N = 101

1 (e)

and 201

(A ).

Fig.

4. - Ratio between the number of minima and the total number of vectors _{as a}function of a for N =

7(A),

9

(D)

and 15

takes a

_golf

course

shape

as a increases cannot be dismissed.

_Indeed,

this seems to be the case : for

N = 15 _wehave measured the

(8)

energy gap between the

global

minimum and the first excitation increases with a. We have

also observed that the _{energy gaps between local minima}are much smaller then the ones

stated above so that a

_golf

course scenario seems to be

_appropriate

to describe the

_landscape

for _very

_large

a. From

figure

4 one can also see that the number of minima increases

exponentially

with N.

3. Distribution of

_global

minima.

In this section we focus on the

_global

minima of the _energy

_landscape.

We have used Simulated

_{Annealing (SA) [14]}

to obtain _{in average 50}different solutions per

mapping.

The

average and the width of the distribution of

dH

are

_computed

and the results

_averaged

over

~ 100

_mappings.

_However,

there are some inherent

_problems

in this method.

First,

for p

nearPma,,, - 1.35 N there are _{very few}solutions to a

_mapping

so one should find

practically

all

them in order to have a

_{significative}

statistics. Since

finding

solutions in this

region

has proven

to be an

_exceedingly

difficult task even for a

powerful

heuristic as SA we restrict our

analysis

to the

_{region a}

0.7.

Second,

the solutions found

_by

SA

_might

not be a

_{representative}

sample

of the space of solutions since

they

could possess some

_{particular properties}

which

made them accessible to SA.

However,

avoiding

to collect biased solutions for our statistics was the main reason we have chosen SA _amongother heuristic

_{algorithms :}

the

_{stochasticity}

of SA makes it very

unprobable

to be attracted to some

_special

class of minima.

Now we turn to the results of the simulations.

_{Keeping a}

= 0.1 fixed _we_measurethe _mean and the width of the distribution of

_dH.

The

_results,

_presented

in table IV for several values of

N,

show that the mean is

_practically

insensitive to N while the width becomes narrower as N

increases. The

_decreasing

width is fitted

_by

the curve 0.54

N - 0.52,

_indicating

then that in the

thermodynamic

limit the _{solution space,}at least for small a, has a very

simple

structure more

similar to a

_ferromagnet

whose

_ground

states are related to each other

_by

the

_symmetry

of the

model than to an infinite _range

_spin-glass

where there are no

_{symmetry operators}

_leading

from one

_ground

state to the others. These results also show that the

_{replica symmetric}

ansatz should

give

a

good description

of the

_{binary perceptron’s}

solution space for small a, in

agreement

with the results of

_{[12] though}

the

_problem

considered there was the random

input/output mapping.

Table IV.

_{- Average}

and width

_of

the distribution

_of

_Hamming

distances between

_global

minima

_for

a = 0.1.

One of the

_{key points}

to _{the calculation of ac for the}

_binary

_perceptron

was to realize that

the solutions do not become

_arbitrarily

close to each other as a - _{a c.}In table V we show the

dependence

on a for N = 120 of the _meanand the width of the distribution _of

dH.

The solutions become closer to each other as a increases and the width decreases _very

slowly

with a. As observed in the

_beginning

of this section the

_sharp

reduction in the number of solutions prevents us of further

_{increasing a}

and then

_verifying

whether

_dH

goes to zero or

(9)

Table V.

_{- Average}

and width

_of

the distribution

_of

_Hamming

distances between

_global

minima

_for

N = 120.

4. Random walks on the _energy

_landscape.

In this section we

study

random walks on the energy

landscape

of the

binary

_perceptron.

The

quantity

we focus on is the _{average variation of energy}after s random

_steps.

Each

_step

consists of

_flipping

one

_randomly

chosen bond.

_Clearly

on one

_hand,

in a _verycorrelated

landscape

a random walker would need many

steps

to reach a

_position,

i.e. a vector

_J,

whose

energy differs

significantly

from the initial one. On the other

hand,

in a _veryuncorrelated

landscape

our random walker would need

_only

a few

_steps

to reach such a

_position.

Thus

studying

the _{variation of energy in}a random walk we

_expect

to obtain information about the

landscape’s

correlation.

The walker starts from a random initial

position,

i.e. from a vector J whose

_components

± 1 are chosen

_randomly

with

_{equal probability.}

We first calculate the

_probability

distribution

for an _energy

_change

DE due to

_{flipping m}

bonds of J _{and then express}it in terms of the

number of

_{steps s}

taken

_by

the random walker.

We will be

_mainly

interested in the first two moments of

_{P /1E.}

In the

_following

we

_only

present

and discuss the

_results,

_giving

the details of the calculations in the

_appendix.

We have found

Clearly

the first moment is zero since a bond

flip

can increase or _{decrease the energy of}a

random

_{configuration}

with

_{equal probability. Pm}

in the

_equation

for the second moment is

given by

for m « N and odd N > 1. From this

_equation

one can see that

_Pm

=

P m +

1 for odd m.

In the limit m =

f3 N

_{we can}write

P m -+ P ( B )

where

Before

_comparing

thèse results with our simulations m has to be related to the number of

steps

taken in our random walk. We thus ask for the

_probability

_{Ps, m}

of

_obtaining

m

_flipped

bonds in s random

steps.

This is a well known

_{problem proposed by}

Ehrenfest

_[16],

which can

(10)

The mean number of

flipped

bonds after s

_steps

is :

The

stationary

solution for the number of

_flipped

bonds exhibits a

_sharp

Gaussian

peak

of

width

N -

1/2 around the mean, so that we can

_replace

the number of

_flipped

bonds

_by

its mean

value. In this way we obtain the

dependence

of

_PAE

on the number of

_steps

s. We compare

( (AE)2 >

,

computed by replacing

3

in

equation (4.3) by

the r.h.s. of

equation (4.5)

and

substituting

the result in

_{equation (4.1 b),}

with our simulations in

_figure

5.

_{A very good}

convergence with N to the theoretical

_prediction

can be seen.

Some comments

_regarding

the

_{interpretation}

_{of ( (AE )2 >}

are in order. It is a

product

of two

factors,

the linear

_{p-behavior being displayed}

in one of them reflects the

dependence

on the

mapping

via a sum of _prandom

independent

variables. The nontrivial

dependence

on

s/N

is determined

_by

the

_{perceptron’s}

architecture.

Fig.

5. - Second moment

(scaled by

p)

of the distribution for AE in terms of the number of

_steps/N

for

N = 201

(D)

and 401

(o).

The lower curve is the theoretical

_prediction.

5. Conclusion.

The aim of this

_study

is to understand

_general

features of the

_binary

_perceptron

_landscape,

which would

_{eventually guide}

our intuition in the construction of

_{algorithms searching}

for

global

minima. Our results are restricted to

_{mapping problems}

which have at least one

solution

_[10].

The

_{general problem}

of

_finding

the

_global

minima of the

_binary

_perceptron

energy is

NP-complete

[ 17]

since it is

_equivalent

to

_{integer programming although}

a

typical

problem

may be tractable

[18].

Next we summarize our main results. We find that the number

of local minima increases

_{exponentially}

with

_N,

but far from a

_global

minimum

_they

become

ever shallower as

_they

_proliferate.

In the

vicinity

of a

_global

_{minimum the average energy of}

the local

minima

tends to

_decrease,

_indicating

the existence of a « Massif Central » a la

(11)

finding

the

_global

minimum does not become easier since the

_landscape

seems to take a

_golf

course

_shape.

An

_analysis

of the distribution of distances between

_global

minima obtained

_by

Simulated

_Annealing

for small « indicates that it tends to a delta function in the

thermodynamic

limit.

We

_sample

_{the average features of the energy}

_landscape

_by

a random walker

_starting

at a

random

_position.

We focus on the

_probability

distribution for a

_change

of DE as a function of

the number of

_{steps s}

taken

_by

the random walker. The first moment is zero whereas the

variance has a trivial

_dependence

on _pbut a _very

complex

dependence

on

_s/N

_reflecting

the

perceptron

architecture.

Somewhat

_{surprisingly,}

most of the features we studied were found to

_{depend trivially}

on p

(it gives

the energy

scale),

indicating

that it is the

_underlying

network architecture which

controls the

_landscape

structure. It should be

_interesting

to see how the

_landscape

features

studied in this paper

change

when

_mappings

with more

_complex

statistics are considered.

Acknowledgements.

JFF is indebted to R. Meir for

_illuminating

discussions on

_perceptrons

and for aid with the

simulations in section 3. JFF also thanks P. Moscato for discussions on Kauffman’s work and

J. J.

_Hopfield

for the

_hospitality

at Division of

_{Chemistry (Caltech),}

where this work was

started. The research of RK was

_{partly supported by CNPq}

and JFF was

_partly

_{supported by}

a

CNPq

fellowship during

his

_stay

at Caltech.

Appendix.

In this

_appendix

we

_present

the details of the calculation of

_{P âE,}

the

_probability

distribution for an _energy

_change

AE due to

_flipping

m bonds of a random vector J.

Initially

we note that the _pfields

where

_{TI r}

_{= Su sui’ are}

_independent

random variables since the

_{TI r’s}

are chosen

_{independently}

for different

_»’s.

Of course, in order to 1 be a solution of the

mapping

the

q Ms

must

_satisfy

for each g.

A

_change

in energy occurs, whenever some of the fields

_{change sign. Suppose}

this

_happens

for n

fields,

out _{of which n+}

_change

from

_plus

to

_minus,

whereas n- do the

opposite,

so that

n _{= n +}+ n - and AE = n -

_{2 n - .}

The

_probability

for the occurrence of this _energy

_change

is

with the notation

_{Ck = n! /k! (n - k )!}

and n_ =

(n - AE)/2.

Since the p fields hu are

independent,

Pn

is

_{given by}

where

_Pm

is the

_probability

for some field h IL to

_{change sign}

under a simultaneous

_flip

of m

bonds of J. In order to

_compute

_Pm

let us

drop

the

_pattern

indexes and divide the field

(12)

m

where z contains the m

_components

of J to be

_flipped.

The

_probability

_{for E}

_J

77

i to

equal

i = 1

Z is

with k =

(m -

z)/2,

where we have used the fact that the presence of the

_{.Ji s}

turns

z into a sum of

independent

random variables

despite

the constraint

(A2).

Similarly

the

distribution for z is

given by

with k =

(m - z ) /2.

Pm

will be the

_probability

for

_{1 il ( > 1 z 1.}

_Restricting

ourselves for

_simplicity

to odd

N > 1 we

_get

equation (4.2)

for m « N. The limit m =

8 N,

equation (4.3),

is obtained

Scientific,

Singapore)

1987.

[3]

OPPER M.,

Phys.

Rev. A 38

₍₁₉₈₈₎

₍₁₉₈₃₎

671.

[15]

KAUFFMAN _{S. A.,}Lectures in the Sciences

_{of Complexity,}

Ed. D. Stein

_{(Addison-Wesley,}

Longman)

1989, pp. 527.

[16]

KAC M. and LOGAN J., Fluctuations Eds. E. W. Montroll and J. L Lebowitz

_{(North-Holland,}

Amsterdam)

1979, pp. 1.

[17]

GAREY M. R. and JOHNSON D. _S.,

_Computers

and

_{Intractability :}

A Guide to the

Theory of

NP-Completeness (Freeman,

San

Francisco)

1979.