Replica Symmetry Breaking and the Kuhn-Tucker Cavity Method in Simple and Multilayer Perceptrons

(1)

HAL Id: jpa-00247330

https://hal.archives-ouvertes.fr/jpa-00247330

Submitted on 1 Jan 1997

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Replica Symmetry Breaking and the Kuhn-Tucker Cavity Method in Simple and Multilayer Perceptrons

F. Gerl, U. Krey

To cite this version:

F. Gerl, U. Krey. Replica Symmetry Breaking and the Kuhn-Tucker Cavity Method in Sim- ple and Multilayer Perceptrons. Journal de Physique I, EDP Sciences, 1997, 7 (2), pp.303-327.

�10.1051/jp1:1997147�. �jpa-00247330�

(2)

Replica Symlnetry Breaking and the Kuhn-Tucker Cavity Method in Silnple and Multilayer Perceptrons _(*)

F. Gerl

(~)

^and ^U.

Krey

_(~>**)

(~) Institut ffiir Theoretische

Physik

der Universitst

G0ttingen,

Bunsenstrasse 9, 30373

G6ttingen, Germany

(~) Institut fur

Physik

II der Universitst

Regensburg,

Universitfitsstrasse

31,

93040

Regensburg.

Germany

(Recei~.ed

lo

Jul,v

1995, re~.ised 29 Noiember 1995,

accepted

21 October

1996)

PACS.87.lo.+e

General,

theoretical. and mathematical

biophysics (including logic

of

biosystems,

_quantum

biology,

and relevant aspects of

thermodynamics,

information

theory, cybernetics,

and bionics

PACS.02.50.-r

Probability

theory, stochastic processes, and statistics

Abstract. Within _a Kuhn-Tucker cavity method introduced in _a former _paper,

we study optimal

stability )earning

for

situations,

where in the

replica

formalism the

replica

symmetry _may be

broken, namely (I)

the _case of _a

simple

_perceptron above the critical

loading,

and

(iii

the _case of

two-layer AND-perceptrons,

if _one learns with maximal

stability.

We find that the deviation of _our

cavity

solution from the

replica

symmetric _one in these _cases is _a clear indication of the

necessity

of

replica

symmetry

breaking.

In any case the cavity solution tends to underestimate the storage

capabilities

of the networks.

1. Introduction

In _a recent paper

[ii,

_we introduced _a _new kind of

cavity method,

with which _we could solve the

learning problem

for perceptrons ^with

Q-

and

Q'

state Potts model

input

^and output neurons.

In this

method,

the Kuhn-Tucker

conditions,

which lead to

optimal stability

in AdaTron

type learning

_processes, have been built into the

cavity

formulation. In

subsequent

_papers _we

extended this method to the

problem

of the

generalization ability

of _a perceptron ^trained ^for

optimal stability _[2j,

and to the

problem

^of

storing

of correlated

patterns [3j.

In the

present

paper, we

apply

_our method to _cases where in the

replica

formalism the

replica symmetry

^is

broken, namely (I)

to

perceptrons

with

Ising

_neurons above the critical

loading

and

(it)

to

two-layer

AND

perceptrons.

Cavity

ideas _were first

applied

to neural networks

by

^MAzard _[4]; and Kinzel and

Opper

_[5].

The

approach

of

Griniasty _[6j,

whose work will be discussed below in

comparison

with _our _own

findings, employs

ideas introduced

by

MAzard. Our formulation of the

cavity method,

_on the other

hand,

_was

inspired by

the

just-mentioned

work of Kinzel and

Opper.

In another

original approach, Wong

_[7j

employs

ideas which _are related to _ours and

Griniasty's.

(*)

based _on the PhD thesis of F. Gerl,

Regensburg,

1994

(**)

Author for

correspondence (e-mail: Uwe.Kreyflphysik.uni-regensburg.de)

©

Les

(ditions

_de

Physique

1997

(3)

2.

Simple Perceptrons

above the Critical

Loading

We _use the definitions _as in

[2],

^which are

simpler

than those _we had _to introduce for the Potts model _case

[lj.

Our

perceptron

^has ^N

input

_neurons k

=

I,..., N,

whose

possible

states _are

Sk "

+i,

and _one

output

_neuron

s',

also with

possible

states +I. The

couplings leading

^from the

input

_neurons to the

output

_neuron _are assumed to be real

numbers,

which _are collected _into the

coupling

vector J _:=

(Ji,

,

JN)

with the Euclidean

length

L _:=

(J(

=

(J)

_+.. ₊

J()~/~,

which is

kept fixed,

while the components

Jk

_are

adapted

to certain tasks

by "training processes"

(see below).

The relation between

input

and

output

^is

IN

s'

=

sign ~j _Jk

_Sk

,

(i)

k=1

which _can be considered _as _a

binary

classification of the

possible inputs,

_or as answers on

qitestions.

Now we assume that there is _a

training

set of _p

input-output pairs,

^where ^the

inputs

are

("

_.=

(((,.. ,(()

for _~

=

i,..

, p, and the

corresponding

desired

outputs

are

(P.

Here and in the

following,

unless otherwise

stated,

_we _assume that all

input components

^and ^the

outputs

are

independent

random

numbers,

which take the values +i with

equal probability.

The

optimization task,

which the

perceptron

^then ^has to fulfill in _course of the

training by adaptation

of the components of the

coupling

vector

J,

is the minimization of _a

Hamiltonian,

which

performs

_a

weighted

count of bad classifications

(see below)

of the

training examples.

Among

these Hamiltonians _are that of Gardner and Derrida

[8j,

^which

simply

counts the bad

classifications, namely

it _:=

~j _V(EP)

:=

~j

_@(~

_(E~ _{IL) ).} ₍₂₎

» v

Here

@(x)

= i for _x _>

0,

₌ 0

else,

and E" is defined _as usual _as the "oriented field"

acting

on

pattern

_~,

N

l~~ "

(~ ~ _~k~~

,

(~)

k=I

while L

=

(J(

as above. Bad classifications _are

those,

where E~

IL

is _<

~c, I.e. for

~ > 0

they

are not

necessarily

_wrong but lack _a

presclibed

amount of

stability,

which is measured

by

_~.

There _are _p _=. _n N such

"questions

with

prescribed answers",

I.e. ~ =

i,...,

_p, and it is assumed that the

loading parameter

_a remains

finite,

while N _~ _cc and _p _~ _cc.

Furthermore,

the

stability

parameter ~ is to be

maximized,

if error-free classification

(I.

e. ii ₌

0)

is

possible.

Gardner and Derrida [8j ^evaluated ^not

only

the number of _errors li above the critical

loading nc(~c),

^where ^for

positive

~c error-free classification is _no

longer possible,

but

they

also tried to evaluate the so-called

Almeida-Thoitlewline aAT(~).

Above this

line,

within the

replica formalism, replica symmetry breaking (RSB)

is necessary.

Surprisingly,

Gardner and Derrida found in [8] ^that OAT > ac for _~ _> 0:

However,

this _was due _to _a subtle

integration

_error

recently

discovered

by

Bouten

[9],

^who

proved

that _OAT

(~)

₌

ac(~),

_as

expected.

Bouten also

showed that

replica symmetry

is

always broken,

if the distribution of local fields _possesses _a gap. So these

results,

_on which _we will comment

later, provide

a test for _our

cavity approach.

Other

Hamiltonians,

_on which _we _comment _at _a later

stage,

_are

(see [10, iii)

the

Perceptron

function and the AdaTron function with

VIE~)

=

l~ (E~/L)l~

_@l~

(E~/L)1, ₍₄₎

with _x

= i and _z

=

2, respectively,

(4)

2. I. REPRESENTATION _oF OPTIMAL COUPLINGS. Instead of

minimizing

the

weighted

_error

rate

f(a, ~)

_:=

ii/p,

_one _can of _course also maximize

~(a, Ii

for the

given training

set. Since

we are above the critical

loading ac(~), f

_p of the

training examples

_are

badly

classified.

There is however _an

exponentially large

number

fit

of

partitions

of the

training

set into _a

"good

fraction" and _a "waste fraction" of size

ii Ii

_p and

f

_p,

respectively,

from which _one has to choose the

optimal

_one.

Namely, fit

_ct

exp(c p),

with _c ₌

f

In

f (I f) In(I f).

We nevertheless assume, that _every _one of these combinations has been trained for

optimal stability,

_e.g. with the AdaTron

algorithm [12j.

^The

optimal perceptron

with _error rate

f

is

then

given by

the

partition leading

to maximal _~.

The

couplings

of _a

perceptron

^trained ^for

optimal stability

_can

always

be

expressed

in the form

[12j

j~

=

~ _zv(»jv _(~)

p/ ^k

»el(i-jjpj

with the so-called

"embedding strengths"

x" of

patterns,

^which do not

belong

to the

fp badly

classified

patterns.

^As _can be shown

using Lagrangian multipliers [1,12j,

these

embedding strengths

have to fulfill the _so called Kuhn-Tucker

conditions,

_see below. Without restriction of

generality,

these _are

usually

formulated

by fixing

the

length

L of the

coupling

vector J in

such _a _way that the

stability

limit for _~ > 0

corresponds

to E"

=

i,

I.e. L

=

~~~

_With _this

convention,

which _we

always

_use in the

following,

unless otherwise

stated,

the Kuhn-Tucker

conditions _are

:

either

(x"

_> 0 and E"

=

i)

_or

(x"

₌ 0 and E" _>

1). (6)

In

fact,

the AdaTron

algorithm (without overrelaxation)

da"

=

max(-x~,

i

E") (sequentially

_or in

parallel) (7) simply

fixes the x"

repeatedly

to values which fulfill

(6):

If it converges, the conditions _are

necessarily obeyed.

Using

the "oriented correlation" matrix

B~~

=

("(~ ~j _((([ ₍₈₎

N

~

and the definitions

(3) and(5),

_we can write for the oriented field E"

E"

=

~j

B""x".

(9)

v

With the Kuhn-Tucker conditions _we have

finally

~2 jj~p~pu~u_ ~jj~v _(i~)

N

P>" P

The basic idea of the

cavity

method here is, to add _a

pattern,

I.e. the

"cavity",

to _a set of

perfectly

trained

patterns. By calculating

^the _necessary

adjustments

to embed this

pattern

we

gain

valuable information about the whole

system.

As in

iii,

_we therefore add a new

"question

with desired answer"

((°, (°)

to the

training

set,

assuming

one

simple groundstate.

We note that the distribution of the oriented fields

E° acting

on it, before _any further

adaptation

has been

performed,

which in the

following always

will be indicated

by

the

',

is

(5)

Gaussian with average ⁰ ^and ^variance

(J(~

=

L~

=

~~~.

Of _course the _error rate

f

has to

remain constant.

Therefore,

_as will be _seen

below,

the best

strategy is,

to

give

_up ^and add the

new

pattern

to the "waste fraction of the

training set",

if the oriented field

f° acting

on

(°

_is

smaller than _a certain number Z _< 1. For

self-consistency,

Z is determined

by

where t _:=

f° IL

and _z _:=

Z/L.

On the other

hand,

if

f°

_is _>

_1,

_then _{it is} _not _necessary _to

embed

f°

_in _the

_couplings,

_since _it

can be added to the set of those

correctly

classified

training patterns,

which need _no

explicit embedding (see _{[13j) by}

the AdaTron

algorithm, [12j. Thus, only

for Z _<

f°

_<

_I,

_I.e.

z < t < ~, the _new pattern must be embedded in the

couplings,

and the

implementation strength

xP of the other

patterns

must be corrected

by

dxP

(see below),

to compensate for the influence of the _new pattern.

As _we have

just

_seen; the

parallel

AdaTron

algorithm

would in _a first step

try

to embed

(°;

if necessary, with the "bare"

embedding

X°

= i

f°.

_This

_generates

a

perturbation ^BP°X°

of the pattern _~. ^In a second

parallel

_step, all those patterns ~, which _are stored

explicitly,

then have to

respond by

d~"

=

-B"°x°

_to _the disturbation

by x°,

because the Kuhn-Tucker conditions still have _to be fulfilled. At

pattern

⁰ ^these corrections

generate

a response field

gx°

=

~j B°~dx~

=

~j _(B°~)~x°

,

(12)

»,i~M>oj v.uM>ol

which reditces the effect of the AdaTron step with

x°. Therefore,

_one has to enhance

x°

= i

f°

by

_an

amplification factor I/(I

+

g) (> I).

Now the

(B°")~

_are

I/N

_on _average, _see

(8).

Therefore _one

gets immediately

~j _(B°")~

= a

P(~~

_>

0)

_=: _aeR.

(13)

M,j~H>0j

aeR is the

percentage

^of ^exhausted

degrees

of

freedom, I.e.,

if pattern 0 is _as

typical

_as the other random

patterns

_~ ₌

l,.

, p, one has to

postulate

~

g = -ae~ = -a

P(Z

_<

^E°

<

ii

= -a

/

_Dt

2

(-11. (14)

z

Note that _we have constructed _our

algorithm

in such _a way that the further correction

steps,

which _are necessary ^to

regain

the Kuhn-flicker conditions

exactly,

do not

change

the response at

pattern

0 in the

thermodynamic

limit ~v.hen convergence is achieved. A

proof

is

given

in the

Appendix,

where _we also show that in this limit _one _can _assume that the x~ and BP" _are

statistically independent.

With _our

approach,

the final

embedding strength

for

(°

_is then

given by (i f°) /(i

₊

g).

Furthermore,

_~v.e _now

identify

the distribution of

embedding strengths

of all

patterns

with that of

pattern (°. _Putting

_the Gaussian distribution of the

cavity

field

fi°,

_felt _before

_training,

into

(10) gives

L~

~

= oL

/

Vt ^~

(is)

i ₊ g With L

=

~~~

_this

implies

~

i =

~ /

_Dt

_(~ _t)~ ₍₁₆₎

+ §

(6)

After

multiplication

with i + g, this

finally

leads to our '~Kuhn-Tucker

cavity

result"

~ ~

l

= a

/

_Dt

+ a~c

/

_Dt

_(~ _t) ₍₁₇₎

~

= a

(1

+

~2j

vi ₊ ^°~

(e-~~/2 e-z~/2j _j~~)

@

This result will _now be

compared

with that of the

cavity approaches

of

Griniasty _[6j,

and

lvong [7j.

^Their ^different ^result is

equivalent

to the

replica

formalism in the

replica-symmetric approximation [8,10j;

therefore _~v.e _use the suffix "RS" _to

indicati

_their result. We have

already

shown in

[ii

that below _ac, where

replica-symmetry

is exact, our Kuhn-Tucker

cavity approach

and the RS

approach

_agree; however in the present

situation,

where _a is _> _ac,

they disagree.

When

Griniasty

derives the constant of

integration

in

[6j,

^he uses a

simple

method which in

all _cases known to us

gives

the correct RS result: The reaction factor _g is assumed to

vanish,

while at the _same time also the B"U with _~

#

_v _are

neglected.

With these _two

neglections

_one

obtains instead of

equation (10)

~ ~

(19)

~2

=

X(s ~RS

_XRS

fit

~~~~~~

N

M

Thus,

instead of

equations (17, 18),

the "RS"

cavity

result would be

~

l = aRs Dt

(~ t)~

Equation (20),

which is identical with the result obtained

by Griniasty

_[6j _or

Wong [7j,

agrees with the result of the

replica

calculation of Gardner and Derrida

[8,10j

from the

replica-

symmetric approximation;

the

expression

abbreviated

by

M in

equation (20) yields

the differ-

ence between _our _acav

(=

_a obtained from

Eq. (18))

and _aRs. Because of _z < ~, it is M < 0 and therefore _aRs > acav. For _z _~ _-cc, I.e. for

f

_~

0,

one has M

=

0,

^and ^thus for _a < ac

it is _aRs

= ncav, as

already

mentioned.

Although

the numerical results

differ,

there is _an

interesting

formal

relationship

between the basic

equations

in

Wong's approach

_[7j and the _one

presented

here:

equation (6)

in [7j

agrees

formally

with _our

equation (16),

and _our reaction

strength

_g

corresponds

to -a X in [7]. ^However -a x in [7j differs from _our g, which is

given by equation (RI, by

a term

corresponding just

to the

expression

JzI in

equation (20).

The difference arises from _a d-function contribution to

I'(t)

at t ₌ _z in

[7j,

^where

lit)

and t in [7j are the oriented fields

after training

and

before training, respectively. Probably

this

difference,

which

only

_comes into

play

above _ac, where M is

# 0,

is relevant with

respect

to the combinatorial

explosion,

which conflicts with the

assumption

^of a

unique optimum

made in all the

approaches.

Research _on this

problem

is in progress.

In

Figure i,

^for error rates of

f

₌ 0.2 and

0.02,

the

learning capacities a(~, f)

= acav

and _aRs,

respectively,

_as obtained from _our Kuhn-flicker

cavity theory, equation (17),

and

(7)

Oc~

3 aRs

2 ~ °~

i

J=0 02

0

0 2

Fig.

I. For two different _error _rates

f

the storage capacity o is

presented

_as _a function of the stability _~ for both estimates

(17)

and

(20).

with the

replica-symmetric approximation (20) respectively,

_are

plotted against

the

stability

~.

Obviously,

the

learning capacity

^obtained ^with ^the Kuhn-flicker

cavity theory

^is

lower, particularly

at small values of _~, i.e. for _~

= 0 and

f

= 0.2, _acav is

only 10/3,

whereas _aRs is 15.53. This _means at the _same time that for

given

_~ and _a, _our _error rate

f

would be

larger

than that obtained with

equation (20).

This is _a

strong

hint that the

assumption

of _a

unique optimum

for the distribution of those

patterns,

^which are

put

to

waste,

^is wrong.

Therefore,

one cannot say in

advance,

which

approach

is better: Rather what has been

gained

is the

following:

^Since both

approaches

_agree below

ac(~, f

=

0),

but not for

f

>

0,

we can use the

different results _as _a criterion for the

necessity

of

replica-symmetry breaking

for

f

>

0,

which agrees with the recent

rigorous proof

of

[9j.

2.2. COMPARISON _wiTH _A ONE-STEP-RSB CALCULATION.

Majer

et al.

[llj

have per-

formed _a calculation above _ac

(~, f

=

0)

^within a

one-step replica-symmetry-breaking approxi-

mation. From

equation (2), they

obtained the

following

result for the free _energy

/E)

"

alla, ~)l

~~~ fi~iw _2x(1 ~wAq)

^~

~~~~2~u~~~~

+

°

/ _Dz0

_In

/ ^~ _Dzi

exp

'°

(A

zi

fi _)~

ma - 2

/q

+

exPi-Wxi4~i~R~1

⁺ ^4~1

ii

,

(211

with A

= ~

z0@

^and

^Aq

₌ ^1- _q0. ^The parameters _qo, w and _x have to be chosen such

that the number of _errors is maximized. In the limit _qo _~ 0 _one

regains

the RS result.

In

Figure

2 the

stability

_~ is

plotted against

the _error rate

f

for three different values of

a.

Obviously

there is

only

_a small difference between the results of the RS and the

i-step

RSB

calculation,

in contrast to _our

cavity results,

which differ

considerably

from both

replica calculations,

I.e. with _our

estimate,

the

stability

increases much _more

slowly

with

increasing

a > ac.

(8)

3.0

Ncav

~~s ~/~

KRSB ,I'

/"

~

~/

~~

2.4

~/"

~/~

/

~

~/

~'

~~

II l'~

-~

_~

l-B ~/~

-~~

_~~

~

-~ ~

if

-'~

_~' l.2

--~

-~

-~ ~'

~~~~~

a=2

0.0

0.0 0.1 0.2 0.3 0.4 0.5

f

Fig.

2. The

stability

_~ is

presented

_as _a function of the _error rate

f

for

a =

2.0,

0.8 and 0.4 Results

are shown for the

cavity-method

and the

replica

calculation in RS and

1-step-RSB

approximation.

For

f

<

I,

_we can compare these different estimates with

simple

simulations: For _a _< 2

one can train

perceptrons

to

optimal stability by

_means of the AdaTron

algorithm.

If _one then

skips

the

pattern

with the

largest embedding strength,

I.e. the _one which _was most difficult to store, ^and ^re-learns ^the

remaining patterns,

one

gets

an enhanced

stability,

which agrees within the _error limit with the

replica calculation,

and not with the

cavity results,

_as _we found in the simulations.

Thus,

for

f

_>

0,

_our Kuhn-flicker

cavity

method

yields

_a non-sufficient

approximation

for

~(a, f).

In the

following

_we

_try

to find the _reasons for the

discrepancy

and to estimate at the

same time the

quality

of the

I-step

RSB calculation. For this purpose, we need the distribution of the oriented fields

t,

^which

according

to

Majer

et al.

[iii

is

where V has

been

efined for

such

that

_the ^is

The

arameters qo>

w

and x are

determined for _given a

and

~

from

equation

₍₂₂₎ ^he of

the

local fields can

be

determined; in particular

a

b-peak

at t

₌ i is _btained, _which represents

the

_patterns, which

are

the training.

However,our interest

is in the field-distribution

before the dditional training.

This ^tribution can be obtained by

in

[6j

to

the present

case.

ccording to [6j, after

addition of the pattern f°, the

quantities zo/fi

the local oriented_fields_felt before and

after

the

additional

training, respectively.

The

^exact value

of

lo esultsfrom

a

ompromise between the ncrease

the

nergy

V(lo

(9)

representing

^the increase in energy of the

ii f)p patterns,

which had been stored before the addition of the _new pattern.

Thus _we _can determine the field before the corrections

by simply eliminating V(lo)

and

lo

in

(22):

With the abbreviations _uo

"

zo/fi

and _vi

"

zi/li,

the

integrand

in

equation (22)

is

+

exP(-Wx) 9(~ /&

(Ho ⁺

vi))

₊

9(uo

+ vi

)ld(t

(Ho +

vi))

~~~~~~~~2j~(~~~~~~~~

~ ^~~~~~

^~~~~~ ^~~

~~~

+

exp( -wx)@(~ /& _t)

₊

_9(t _~) ₍₂₃₎

Here the first term describes the

patterns,

^which ^have ^been

successfully

embedded

(I.e.

with V =

0),

the second term

those,

which _are

badly

classified _and put to the waste

(I.e.

with

V = I and

lo zo@ zi@

"

0),

and the final term the patterns which _are stored without

embedding.

The denominator

fit

in

(23)

is the

integral

_over all fields t in

(23)

and _serves for normalization.

Thus,

_we have _no

longer

a Gaussian field-distribution before

learning.

In the

cavity picture,

this non-Gaussian distribution of the fields

acting

_on _an untrained

pattern

stems from the fact that there _are _now _many

ground

states available. If we add _a

pattern

to such _a

multitude,

_every

groundstate

will still _see _a Gaussian distribution of the local field

I°.

_However _the

_particular ground

state, which after

training

_appears _as the _one with the

highest stability,

is

likely

to have _a

higher-than-average

local field for the _new

pattern,

thus

being

able to store it _more

easily.

Our field-distribution before

learning

is then effected

by

this selection. At the end of this section we will

shortly

comment _on how _an intrinsic

cavity

approach

for RSB should take this selection effect into account.

In

Figure 3,

for _~

= l and _a

=

i,

the local fields _are

presented

for the three

approximations mentioned, namely

for _our

cavity method,

and for the RS and RSBI

approximations.

^One can see that in

RSBI,

the distribution of the local fields before

learning

the additional pattern,

although strongly non-Gaussian,

is still

everywhere continuous,

and for t ₌ _~ _even

continuously

differentiable.

Comparing

the results of _our

cavity theory

with the RS

approximation.

_one _can _see that for

increasing

_a _> _ac

. on the _one

hand,

the _error rate

f

obtained with the RS

theory

is much better than that obtained with _our

cavity

method

(if

the _error rate obtained with RSBI is considered _as

target approximation),

_see

Figure 2;

whereas

. on the other

hand,

the value of the

limiting negative

field value t _:= to _"

z/~ (m

-0.5 in

Fig. 3),

below which patterns are no

longer learnable,

is almost

exactly

the _same with

our

simple cavity approach

and the much _more

complicated

RSBI calculation.

If _we

accept

^the local fields from RSBI _as _a

good approximation,

then

again

the Kuhn- Tucker conditions

(6)

must be fulfilled after

learning,

and with

(17)

_we _can calculate the

loading

_aRsB-KT from the RSBI

field-distribution,

calculated with the parameters q, w and _x.

To this _purpose, in the above-mentioned formula _one

only

has _to

replace

the Gaussian _measure Dt

by

the RSBI

field-distribution

_of _those

_patterns,

_which

are

explicitly embedded,

I.e. from

to

~(= ii

in

Figure

3. This is related to nRsBi in _a similar _way _as _acav is related to aRs

(10)

o 7 ^P(cavJ

P(Rsi /~h

--P<RSBJ /

0.6 /

0.5 '

Q2 0.4

C-

0.3 '

'I

0.2 ' '

0.I -1, /

0.0

-3 -2 0 2 3

t

Fig.

3. The

probability densit» P(t)

of the local fields t before

learning

is presented as a function of t at _a

= I and _~

= l.

Again

results _are sho~vn for the

cavity-iuethod

and the

replica

calculation in RS and

I-step-RSB

approximation. The distribution after

training

is given by

pushing

the middle

segment to _a

d-peak

at _~ ₌ l in every case. The

cavity-method

vields _an _error rate

f

= o.29092, the

replica

method

gives f

₌ o.13073 in RS and

f

= o.13576 in

1-step-RSB approximation.

The

fraction of

explicitly

learned patterns is o.55042 for the

cavity approximation

and o.71062 and o.66745

respectively

for the

replica

calculations.

For three values of _~, the

resplt fKT

_i=

fRsBi-KT

of such _a calculation is

presented

in

Figure

4. For

comparison

also the results of the _two

simple approximations, fcav

and

fRs,

_are

presented, together

with the RSBI

replica

result

fRsBi Obviously,

the

improved cavity

result

fKT

is

only slightly higher

than

fRsBi,

which is

already

_a criterion for the

quality

of the RSBI

calculation

compared

with the RS

theory.

One expects therefore that the

rigorous

result should lie between _our

fKT

_as _an _upper and

fRsBi

_as a lower

bound,

_so that further

replica symmetry breaking

steps ^should

give only

a

slight improvement. Precisely,

_we

expect

both _a slow monotonous increase of the _error rate

fRsBn

with

increasing

_n in _a RSBn

calculation, [14j,

and _a slow monotonous decrease of

fRsBn-KT,

and at the _same time _a decrease of the relative number of

explicitly

embedded

patterns and of the

averaged embedding strength.

As _a consequence, aRsBn-KT, which is calculated from _a formula

analogous

to

(17),

increases

slowly.

For _n _~ _cc, the RSBn

replica

result and the RSBn-KT

cavity

result derived from the RSBn field-distribution should _agree.

We have thus found _a method

leading

to an upper bound for the _error rate

f(a,

_~c) _as _a

function of _a and _~c, if the local fields before

re-learning

_are known. In

[14j,

^Fontanari ^and Theumann derive _an upper bound for

f

at _~ ₌

0,

^evaluated ⁱⁿ ^RS

approximation

at the AT line for finite temperatures: ^For o

$ 50,

our RSBI-KT _upper bound is

lower,

whereas for

o > 50 the bound in

Figure

2 of

[14]

^is ^lower.

That the RSBI itself is _not

yet sufficient,

has

already

been shown

by Majer

and

Engel [i ii.

Moreover,

also the fact that after _our RSBI-KT

cavity learning

there is still _a _gap in the

field-distribution,

which is

according

to Bouten [9j

responsible

for

RSB, points

to this fact.

Additionally,

in this

approximation

the condition is violated that the number of

explicitly

embedded

patterns

should not exceed the number N of

coupling degrees

of

freedom,

_see _equa-

tion

(23).

(11)

0.5

,,

fca% _""

IRS _,"

)I'

I' /

,' 1'

0.4 ^fKT

,,'$/~~

,' /

l' /~

0.3

w-

is

0.2

,/

,"/

,'$"

-$~ ,'1' /$~

,'~

01 /~~~

n~o5

,w<~

/A~II ,/

~ w0

~'~

l.0 2.0 3.0

a

Fig.

4. The _error _rate

f

is

presented

_as _a function of _o for _~

= l.25, 0.5 and 0. The results _are

given

for the

cavity method,

the RS and

I-step-RSB approximation

as well _as the result

fKT,

which is calculated

by putting together

the Kuhn-Tucker conditions and the local fields of the RSB-solution.

The best estimate is ver»

likely

between these

RSB-graphs,

I.e.

fRsB

and

fKT,

and

probably

closer to /%SB.

In connection with

Wong's

_paper _[7j it is

interesting

to note the

following: Using

the RSBI distribution of fields in connection with

equation (6)

in [7j one

reproduces self-consistently

the RSBI result for _a of

hlajer

and

Engel [iii.

The

slight

difference to _our result arises

again

from _a contribution of _a

d-peak,

which is

present

in

Wong's approach,

but not in our's. This

contribution,

much smaller _now, which in turn

originates

from the mentioned _gap, leads

again

to terms

corresponding formally

to M in

equation (20).

Thus _once _more the difference between

our results and those obtained with

Wong's approach,

both

starting

with the above-mentioned RSBI field

distribution,

shows that also the RSBI

calculation,

albeit _a

good approximation,

is not

yet

exact.

The results discussed here have

recently

been

supported by

_a

complicated 2-step-RSB

calculation:

Whyte

and

Sherrington

find in

[15j

^that ⁱⁿ ^fact ^the ^next step ⁱⁿ the

replica breaking

scheme raises the _error rate

f,

but

only by

an amount which is

typically O(10~~).

This agrees very well with the

predictions

that _we could make from _our

findings.

As

already mentioned,

the results of the

cavity

methods _are

necessarily

insufficient above _ac, since the combinatorial

possibilities

to select the "waste

patterns"

_are not considered.

However,

with our

approach

it should also be

possible

to

get equations,

which _are

equivalent

to

RSBI,

without

using

the

replica trick,

I.e. _as in the seminal book

[16j

on

spin glasses.

To achieve

this,

_one considers _a multitude of

ground

states, which _are ordered in _an ultra- metric structure. If _one decreases the

stability constraints,

the number of different

ground

states is assumed to increase

exponentially.

If _one _now adds _a

pattern

to this

ensemble,

a

lower

embedding strength

is therefore

favoured,

which

gives

a

higher storage capacity

_com-

pared

to the Gaussian _one of _our

equation (17).

This

approach

has the

appealing trait,

that the number of

patterns,

^which ^have to be

misclassified, _gets

lower _as _more

degrees

^of freedom become available for the

approximation.

Within the

replica method,

in contrast, one has to take the "worst" value of the order

parameters

^for technical _reasons. Work _on _an intrinsic

cavity approach

for _a _> _ac is in progress.

(12)

inputlayer

with 2N

flexible

weights

hidden

layer

weights

= I

output neuron with threshold 1/2

Fig.

5. The AND-machine with tree structure

(NRF).

The

weights

between the intermediate

layer

and the output neuron are fixed. A threshold between o and I _at the output-neuron takes _care that there

only

is _a positive _output if both intermediate _neurons _are

positive.

3. The AND-Machine

For

multilayer perceptrons, replica symmetry breaking

is _a

general phenomenon,

when

optimal learning capacity

for _a

given stability

_or

optimal stability

for

given capacity

_are

required.

As

a consequence, the

optimal capacity

is not

easily

estimated.

Griniasty

and Grossman

[17j

have calculated the

storage capacity

of the AND-machine and gave

arguments,

^which ^made a

suppression

of

replica symmetry breaking

credible.

However,

with _our method _we _are able to check their

proposal

and find that

replica

_symmetry is broken.

3.i. MODEL DESCRIPTION. The AND-machine is _a

simple example

for _a

multi-layer

_per-

ceptron. Multilayer perceptrons

have been introduced to _overcome the limitations of

single layer

perceptrons, ^which are limited to

linearly separable classifications [18j,

whereas with

sufficiently

_many units in the additional

layer(s)

between the

input

and

output units,

_every

Boolean function _can be

implemented. However,

at the _same

time,

the

analytical

treatment of the models becomes much _more

complicated,

and in

general,

the _space of solutions is _no

longer simply

connected. The RS

approximation yields

then results which contradict the _ex- act bounds derived

by

Mitchison and Durbin

[19j. However,

^the necessary RSB calculation is

rather

complicated

for such models. A

multilayer model,

which has been studied in this _way, is the so-called committee machine with 3

(or

_any other odd number

Nh of)

hidden itnits

[20, 21j.

Here the intermediate hidden

layer

^consists ^of ^three neurons, and the

output

is

given by

the

majority

vote of these hidden _neurons. The maximal

capacity

of the system decreases from ac m 4.02 for the RS

approximation

to ac m 3.0 for the RSB-ansatz. This is for the _case of

non-overlapping receptive fields,

_see

Figure

5.

In contrast to the mentioned _case of the committee

machine,

for the AND machine the

number

Nh

of intermediate _neurons is

arbitrary (> 2);

here the

output

unit

gives

a

positive

vote iff all intermediate _neurons vote with +i. This AND machine _was at first treated

by Griniasty

and Grossman

[17j,

both with

non-overlapping

and also with

overlapping receptive

fields

(NRF

and ORF _cases,

respectively).

Within _a

replica-symmetric approach,

the authors find for the _case of _an

equal

number of

patterns

^with output +i and i for N

= 2 intermediate

neurons a critical

capacity

of _ac _m 3.5

(ac

_zi 3.3 with

simulations),

for the _case of

overlapping

receptive fields,

and _ac _m 3.66 for the NRF _case

iii].

At the end of their paper,

Griniasty

and Grossman discuss the validitv of their RS

approximation

and find

support

for their

assumption

(13)

that

only

_a

single

minimum contributes _to their solution.

Griniasty

_[6j

repeated

the calculation with his

cavity

method and

got

the _same results.

Wong's

_recent result

iii

_on this

problem

will be commented _upon below. In the

following,

we use our different

cavity

method.

3.2. CALCULATION oF THE LEARNING CAPACITY. As _we have _seen, _our method

yields

_a

convenient way ^to

recognize

the

necessity

^of RSB.

Xdditionally

_one

gets

a

rough quantitative estimate,

to which extent the actual solution is

approximated.

At first _we

study

the AND

machine with

Nh

" 2 and

non-overlapping receptive

fields

(NRF case).

Since this does not

necessitate _an additional effort and the

argumentation

^is

simplified,

we assume that both subnetworks _are trained with

optimal stability,,

_as

long

as the

optimal capacity

_ac is _not

yet

reached.

lve _assume _p+

=

a+N

and _p-

= a-N

patterns

^with

positive respectively negative

output.

Then the

parameter

b is defined via

1 + b

a~ =

j

^°.

(24)

The

(+)-patterns

_must be trained in both

subnets,

since both intermediate _neurons _must _vote

positively,

whereas for the

(-)-patterns

_one _can choose at will _a subnet with _a

negative

vote.

From this

fact, Griniasty

and Grossman

[iii

derive the

sharp

bounds

~

§ _ac f ^~

(25)

for the maximal

capacity

_ac, which also

applies

with

overlapping receptive

fields

(ORF case).

Moreover, recently

Wendemuth

[22j

w~s able _to

generalize

the method of Mitchison and Durbin

[19j

^and ^obtained ^for ^the AND-machine _as _a lower bound the

sharper

condition

4

(1 hi iog~ (3j ^~

^°~' ^~~~~

which also

applies

both to the NRF and the ORF _case.

Let _us consider

fictitiously

all

possibilities

to distribute the

responsibility

for the

(- )-patterns

among the

Nh subnetworks;

these patterns,

including

the

(+)-patterns,

shall then be trained to

optimal stability.

^Afterwards _we select that

distribution,

which leads to maximal

stability

~ =

min(~ci

_~c2).

In both subnetworks

=

1,

2 _we have N

input

_neurons, and in both subnetworks the

couplings

are defined with

embedding strengths x(

_as

J,k

_"

~x(("((~. ₍₂₇₎

~

»

Here

("

is the desired final

output.

^As

before,

_we define the oriented correlation matrix

B,

++

the

length

L of the

coupling

vectors and the oriented local fields E

~y»u ^I

~v ~u j _~»

_~u

_j~~j

~ p/ ^>k ^~k

k

Li

=

JIJ>

=

j~oxiBi~xi ₁₂₉₎

P>U

E~

=

~jB"~~). (30)

u

(14)

The

embedding strengths x(,

for I

=

1,

2 and all ~,

together

with the

Ef,

fulfill

again

the KT conditions

either

(x(

> 0 and

Ef

=

ii

_or

(x$

₌ 0 and

Ef

>

1). (31)

Additionally,

_we know that for _our

optimal choice,

_a

(- )-pattern

_can have _~

positive embedding strength only

at _one of the

subnets,

since otherwise the less stable subnet could enhance the

stability by reducing

its

unnecessarily positive x$

to 0 and

relearning

the other

patterns.

Furthermore,

in the

thermodynamic limit,

both stabilities should be

equal,

I.e.

L(

=

L]

=:

L~.

Let _us _assume

again

that there is

only

_one

groundstate,

and add _a _new pattern.

Again

this feels

a

normally

distributed random oriented field

E)

with variance

L~

_in _each _subnet. _A

(+)-pattern

must be classified

correctly

in both subnets. So _we

need,

in _case of

@)

<

1, positive embedding strengths x)

₌

(i E)) /(i

+ gi

),

where the response factors _gi have still to be determined. In contrast, for _a

(-)-pattern

_we _can choose the subnet with

negative

intermediate

output.

As

we will _see

below,

it is best to embed the

(-)-pattern

in that

subnet,

where the oriented field is

larger,

so that the

embedding strength

_can be smaller. The field-distribution

P(t)

for the

larger

oriented field

t,

^with ^normalized

couplings, (E$~~

₌

Lt),

_can be calculated from the Gaussian distribution of the fields

(ti, t2)

of the two sublattices _as follows:

j

~ t

~ ~'~~

~ _~~

/h

^~ ^~~

^~~~

If

again

_we _assume that x" and BPU _are

uncorrelated,

then the _answer _g of the patterns, which had

already

been

implemented

and _now must

keep

the KT

conditions,

is similar _to

equation (14), namely

£ _(B°~)~

= -ae~ =

-aP(xP

_>

0). (33)

(xf>o)

Identifying again

the distribution of the

embedding strengths

^with ^the

probability

distribution for

x°

and

normalizing

the

couplings

to

i,

_we

get

~ ~

g = -a+

/

_Vi

a-

/ _Vi16(t). ₍₃₄₎

Here the first and second

parts

^describe the influence of

(+)-

and

(-)-patterns, whereby compared

with

equation (32)

_a factor

1/2

_~v~as taken into

account,

since

only

_one of the two subnets is needed for

(-)-patterns.

The

capacity

for finite stabilities is

again

calculated from the KT conditions

(31) through

~2

~

jj~m~Pu~u

_~

jj

~M~M

=

~j~M (35)

1 fi~ ^? fi~ ^~ ~ ^l

vu p p

~

= a

/ dx, x,w(x~)

=

~' /

_Vt

_(o+

+

a-16(t))

_(~c

t). (36)

+ g

-c~

Replica Symmetry Breaking and the Kuhn-Tucker Cavity Method in Simple and Multilayer Perceptrons

HAL Id: jpa-00247330

https://hal.archives-ouvertes.fr/jpa-00247330

Submitted on 1 Jan 1997

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Replica Symmetry Breaking and the Kuhn-Tucker Cavity Method in Simple and Multilayer Perceptrons

F. Gerl, U. Krey

To cite this version:

F. Gerl, U. Krey. Replica Symmetry Breaking and the Kuhn-Tucker Cavity Method in Sim- ple and Multilayer Perceptrons. Journal de Physique I, EDP Sciences, 1997, 7 (2), pp.303-327.

�10.1051/jp1:1997147�. �jpa-00247330�

Replica Symlnetry Breaking and the Kuhn-Tucker Cavity Method in Silnple and Multilayer Perceptrons (*)

(~)

Krey

Physik

G0ttingen,

G6ttingen, Germany

Physik

Regensburg,

31,

Regensburg.

Germany

(Recei~.ed

Jul,v

accepted

1996)

General,

biophysics (including logic

biosystems,

biology,

thermodynamics,

theory, cybernetics,

Probability

stability )earning

situations,

replica

replica

broken, namely (I)

simple

loading,

(iii

two-layer AND-perceptrons,

stability.

cavity

replica

necessity

replica

breaking.

capabilities

[ii,

cavity method,

learning problem

Q-

Q'

input

method,

conditions,

optimal stability

type learning

cavity

subsequent

problem

generalization ability

optimal stability [2j,

problem

storing

patterns [3j.

present

apply

replica

replica symmetry

broken, namely (I)

perceptrons

Ising

loading

(it)

two-layer

perceptrons.

Cavity

applied

Replica Symlnetry Breaking and the Kuhn-Tucker Cavity Method in Silnple and Multilayer Perceptrons _(*)

optimal stability _[2j,

Griniasty _[6j,

sign ~j _Jk