Extensions of a solvable feed forward neural network

(1)

HAL Id: jpa-00210685

https://hal.archives-ouvertes.fr/jpa-00210685

Submitted on 1 Jan 1988

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Extensions of a solvable feed forward neural network

Ronny Meir

To cite this version:

Ronny Meir. Extensions of a solvable feed forward neural network. Journal de Physique, 1988, 49 (2),

pp.201-213. �10.1051/jphys:01988004902020100�. �jpa-00210685�

(2)

Extensions of a solvable feed forward neural network

Ronny ^Meir

Department ôf Electronics, ^Weizmann Înstitute ôf Science, ^Rehovot 76100, Îsrael

(Requ ^{le 9} septembre 1987, accepté le 21 octobre 1987)

Résumé.

²⁰¹⁴

J’étends la classe des modèles de réseaux neuronaux avec connexions unidirectionnelles discutée dans une publication antérieure. Je propose trois modifications principales : a) apprentissage pondéré, b) apprentissage de formes corrélées et c) effet de bruit synaptique. Chacun des modèles étudiés est résolu exactement et des relations de récurrence entre couches sont obtenues.

Abstract.

²⁰¹⁴

I extend the class of exactly solvable feed-forward neural networks discussed in a previous publication. Three basic modifications of the original ^model ^are proposed : a) learning ^with weights, b) learning ^biased patterns ^and c) the effect of static synaptic noise. Each of the models studied is solved

exactly ^and layer ^to layer recursion relations are obtained.

Classification

Physics ^Abstracts

05.20

^-

05.40

^-

75.10H - 87.30

1. Introduction.

The original ^Little [1] ^and Hopfield [2] ^{models of}

Neural networks have been much extended over the past few years following the work of Amit et al. [3].

These extensions have been in various directions.

The first type of extension was a modification of the learning ^rules ^to incorporate such effects as

forgetting [4, 5], storage of correlated patterns [6-9]

and more. These models were still of the original Hopfield paradigm ^{in the} ^sense ^that they consist of ^a network of symmetrically ^connected binary ^variables (spins) ^with 2-spin interactions of an infinite range type. The method of solution in these cases (when ^an

exact solution exists) ^{is the} replica method, following

the original work of Amit et al. [3].

Another class of models eliminates the restriction of symmetric ^bonds [10, 11] ^inherent ^to ^the Hopfield

model and the extensions discussed above. Techni-

cally, ^once the bonds are made asymmetric ^the

model is no longer Hamiltonian and the replica

method cannot be used. Recently, ^Derrida ^and ^co-

workers [12, 13] have shown that one can solve

exactly ^the dynamics ^of ^a ^{class of} heavily ^diluted asymmetric neural-networks. It turns out that the dilution and asymmetry ^{make the} model rather

easily ^soluble.

Another extension ^we consider is that of layered

architectures. This type ^of model, ^on ^which ^we ^will

focus in what follows, has been studied extensively

by computer scientists over the past few decades.

The origin of this class of models can be traced back to the idea of the perceptron introduced by

Rosenblat [14] and studied in detail by Minsky ^and Papert [15], who demonstrated the limits of the

single layer perceptron. ^In the last few years much work has been done in generalizing ^the original ^ideas

of Rosenblatt to multi-layered systems. ^{The main} feature of this class which distinguishes it from the

previous classes is the existence of « hidden units ».

These systems usually [16] consists of an input unit,

an output unit and intermediate ^« hidden ^» units that do the processing. Contact with the external world is made only ^via ^the input ând output ^{units. No} external constraints are placed ôn ^these units, ând they âre used in order to construct good ^« înternal representations » of the environment. Recently,

Rumelhart et al. [1’7] ^have ^found ^an algorithm ^called

« back propagation » which solves many of the

problems encountered in the earlier perceptron models. Multi-layered ^models ^with couplings

between and within layers ^{have been} introduced in the physics literature by ^Huberman ^{et al.} [18]. ^Later Domany ^{et al.} [19] introduced layered feed-forward networks, ^with ^no couplings ^within layers.

The dynamics ôf â ^model feed-forward neural- network have been recently ^solved by ^{Meir and} Domany [20, 21] (to ^be ^referred ^to âs MD). ^This

model will be briefly described in the next section.

Article published online by EDP Sciences and available at http://dx.doi.org/10.1051/jphys:01988004902020100

(3)

This paper generalizes ^our previous ^{work in} ^various

directions. 1) Incorporation of different learning schemes, namely ^the weighted learning ^{of Nadal}

et al. [4]. 2) Learning ^biased patterns [7], ^i.e. pat-

terns whose level of activity is different from 50 %.

3) Effect of static noise in the synapses. The paper is

organized ^as follows. In section 2 I give ^a ^detailed description ^{of the} network, its architecture and

operation ând â ^detailed description ^{of the} êxten-

sions considered. The exact analytic solution of the various extensions to the original ^MD ^{model is} given

in section 3 in the form of layer ^to layer ^recursion relations, while section 4 contains ^an analysis ^{of the}

results. Section 5 summarizes our findings.

2. Definition of the model.

The original ^model ^we studied is the following.

Consider L layers ; each contains N cells (spins),

with a binary variable Si ^{= ±} ¹ associated with cell i of layer I. Each cell is connected to all cells of the

neighbouring layers. ^{The bonds} ^are, however, ^unidi-

rectional : the state of layer ^I ⁺ ^{1 is} determined by

the state (at ^the previous ^time step) ^of layer ¹ according ^to ^a probabilistic rule. The dynamic pro-

cess is one which sets the layers sequentially : input corresponds ^to setting the first layer ⁱⁿ ^an ^initial

state, Si1. ^{At the} ^next ^time step the second layer ^is ^set

in state s1 ^and ^{so on.} ^The probability that the i’th

spin ^{in the} (I ⁺ ¹ )’th layer has the value Sil + ^1, given

that ^on the previous layer I the cells are in state

Sil, ^{is taken} ^to ^be

where

is the field produced by ^the spins ôf layer Î ât ^{site i of} layer Î ⁺ ^{1. The} parameter

governs the stochasticity ^{of the} dynamics, ^{which is}

deterministic for T -->* 0 (or 13 -+ oo ) and becomes

more stochastic as T increases. The couplings ^or

bonds Jfj ^are ^chosen ^according ^to ^some prescription,

which ^we took in ^our original ^solution [20, 21] ^to ^be

of the outer-product [1, 2] ^type,

where 6 f, , v ^=1, ^{..., p,,} ^are ^the ^stored ^key ^pat-

terns.

The extensions we consider in this paper ^are the

following :

1) Weighted learning ^schemes [4, 13] : ^{Here the} original learning ^rule (Eq. (3)) ^is modified. Each pattern is learned with a weight. This models the effect that patterns which have been recently ^learned

are embedded with a larger weight ^as compared ^with

« old » patterns. ^A ^more detailed discussion of the

« philosophy ^» of this modification can be found in reference [4]. ^The ^new couplings ^{in this} ^case ^{take the}

form :

where A (v / N ) obeys the normalization condi- tion [4]

where K is a normalization constant independent ^of

N. With this notation, ând assuming A(u) ^to ^be â decreasing ^function ôf û, ^the ^most recently ^stored pattern ^{is the} ône ^{with v} = 1, ^{and the} storage

« ancestry ^» increases with v. This modification was

originally proposed by Mezard et al. [4] ^{who solved}

the problem for the Hamiltonian networks within the framework of the replica theory. Derrida and Nadal [13] have also considered this modification of the learning rule for the diluted asymmetric ^neural-

networks [12]. They ^were âble ^to give ân êxact

solution of the dynamics ^{in this} ^case.

2) ^Biased patterns [7] : Following ^{Amit et} ^al. [7]

we study ^the properties ^of ^our network when the

mean level of activity is different from 50 % as in MD. Thus, every component e, J.L ⁱⁿ ^a ^learned

pattern ^can ^{be chosen} independently ^with ^a probabi-

li ty P ( e f, J.L ),

We adopt also the modified form of the coupling proposed by ^Amit ^{et al.} [7]

where _a, is the magnetization ^{of the} patterns ^on layer

1. 3) Static noise in the _synapses [22] : ^{Here the} couplings âre ^modified ^{so as} ^to înclude â ^random part ^{which is} ^not ^related ^to ^the learning process. The

couplings ^{in this} ^case take the form :

where the L1f; ^are independent, identically distributed

(4)

Gaussian random variables with zero mean and

variance A/ BIN. ^This problem ^was ^treated by Sompolinsky [22] ^{in the} ^context ^{of the} Hopfield

model.

It should be noted that in all the above extensions

we retain the basic feature of our network, î.e. êach pattern ^carries â layer index. This is a central feature that characterizes the class of model neural networks studied in [19] ; ^{it has} conceptual âs ^well âs ^technical significance. ^{The main} point is that while the input representation ^{of the} key pattern ^v, î.e. gf, ^,, îs

externally ^dictated, the network is free to choose the internal as well as output representations g f, v’

l ::> 1.

By ^exact solution of our model we mean the

following. The network is presented ^with ân înitial configuration ^S’ ôn ^{the first} layer (l = 1 ). This may be one of the key patterns, â noisy key pattern, â mixture state (one having ^finite overlap with several

key patterns) ôr just â ^random ^state. ^This ^state îs

characterized by ^its overlap mi with each of the key- patterns ^on ^{the first} layer. ^The overlap mlu ^{is defined}

by

(this definition will be generalized ^when dealing ^with

biased patterns ⁱⁿ ^sect. 3). Ôur solution consists of the calculation of mi ôn âll subsequent layers. ^From

the recursion relations for mi ^we ^{will able} ^to ^learn ^a

great deal about the performance of the network.

This will be done in section 4 after the solution is

presented.

3. Exact solution.

This section is divided into four parts. In the first I

give ^the general ^framework for the solution of feed- forward type networks. We have given ^a ^detailed

derivation of the original ^{model in} MD, ^{but will} recapitulate ^{the main} steps for the sake of complete-

ness. Our formulation is similar to that proposed by

Gardner et al. [24] ^{for the} ^case ^{of the} parallel dynamics of the Little and SK models. In each of the

remaining ^three parts Î ^consider ône ^{of the} êxten- sions of the basic model we have previously ^solved

and derive the exact layer ^to layer recursion rela-

tions, ^which ^are ^then analysed in section 4.

3.1 GENERAL FRAMEWORK. - Consider a random

assignment ^{of v} = 1, 2, ..., ps key patterns f, v ^on

each of L layers of the network. Choose an initial state on the first layer, S1,

The question ^we ask : what is the probability

p (SL I S1), ^{that the} dynamic ^rules (1-2) produce ^on

layer ^L ^a ^state ^S’ given the initial state S1. Note that

we must average both ^over the random assignment

of the es ^and ^over ^the probability distribution given

in equation (1). ^The conditional probability ^to get ^a configuration Sl + ^{1 on} layer ^I ⁺ 1, given ^the configu-

ration S’ ^on the previous layer, is obtained by taking

the product ^of equation (1) ^over all sites :

where hi + ^{1 is} given ⁱⁿ equation (2). ^The subscript e

denotes the dependence ^of P j ^on ^{all the} ^key patterns

g f. v’ ^A sequence of configurations S1,

^...,

^SL ^{will be} generated by ^our dynamic rules with the probability

In order to obtain the probability ^for ^a configuration SL ^on layer L, given the initial state S1, ^we ^must ^sum

this over all intermediate layers

Finally, averaging ^this quantity ^over ^the probability

distribution of the random variables 6 ^we ^{obtain the} probability ^P ^for SL given S’ ^for ^a given realization of the £is

The double averaging sign îndicates ân average ôver

the es.

Combining the above equations ^into ^a single expression ^{for the} probability distribution P (SL I Sl )

we obtain :

This expression ^{is the} starting point for the various models considered. We now derive the solution in each case. The basic idea in the solutions is to bring

the expression for P into the form of ^an integral ^of

the type

from which the layer ^to layer ^recursion ^relations ^will

be derived from the saddle point equations aF/ax.

To do this we will have to introduce various order

(5)

parameters ^{in order} ^to calculate the averages ^over the patterns ^{and the} ^trace ^over ^the spin ^variables.

3.2 WEIGHTED LEARNING SCHEME. - Here we use

the form given ⁱⁿ equation (4) ^{for the} coupling. ^In

this case the expression ^{for the} probability ^distribu-

tion takes the form :

In this equation and in what follows we have made the abbreviation Au

⁼

A (J.L / N). ^{In all} subsequent

discussion of the weighted learning scheme, ^I ^will

use the notation p,

⁼

gN. ^As ^we show in section 4.1,

the number of patterns embedded in the network is not equal ^to the number of effectively ^stored patterns, aN. That is, ^even though u = 1, 2, ..., p S patterns appear in the ^sum (4), only aN ps ^s patterns ^are effectively ^stored by the network [4]. (In ^the ^two

other extensions I consider these two numbers are

the same and will be denoted by the standard notation aN).

To proceed ^further ^we ^need ^to ^introduce ^new

variables which will make the calculation of the averages possible. Thus, ^we ^introduce ^the ^{two sets} of variables m, rh and Q , 0 through the relations :

and

At this stage ^we ^{make the} assumption ^{that the}

initial state S’ has a finite overlap mv ^with ^{pattern v}

and ^an overlap ^{of order} 1/ BIN with all the others.

Doing ^this ^we get ^the following expression ^{for P :}

In the expression ^above ^we ^have separated ^the ^terms with J.L =1= v ^{from the} ^term with g

⁼

^v. ^In ^this equation ^and in the remainder of this sub-section

whenever u appears it will only ^take values g ⁼ ^v.

Y is defined in equation (19) below. At this stage ^I make the following observation. Due to the fact that I have considered a system ^{of L} layers, the variables

SL and ç t JL corresponding ^to ^{the last} layer appear in

a different form from those corresponding ^to ^the layers ^I L. In what follows I will be interested in the layer ^to layer ^recursion relations. Clearly, ^the

recursion relations for layers ^I ^L ^do ^not depend ^on

what goes ôn in layer L. In order to avoid this asymmetry Î ^will implicitly âssume that the size of the system ^{L -} ôo ⁱⁿ deriving the recursion relations.

Thus, I define the probability distribution P by ^the equation :

With this proviso ^Y ^is given by

In this expression and in what follows, the upper limit on the summations on I will be assumed be

inifinity, ⁱⁿ accordance with what was said above.

The lower limit will be 1, ^unless ^otherwise specified.

To proceed ^further ^{we use} the fact that ^our initial state S’ has_a ^finite overlap ^{with the} pattern v ^and’

order 1/ BIN- with the others. We assume that this

(6)

situation holds at all times (i.e. ôn âll layers), ând

thus make the following rescaling of the variables in the expression ^Y for g :0 v

Using ^the ^new ^variables A I and A ^we ^expand ^the

expression ^{for Y} ^to ^lowest ^{order in N}

In order to separate ^variables A 1’that ^carry ^a ^pattern

index, ^{from the} cpi, that carry ^a site index, ^we ^need

to introduce additional variables using ^the following

identities :

Going ^back ^to equation (17) ^we ^note ^that ^we ^still

need to calculate the average ^over the pattern

ç f, v ^{and the} ^trace ^over the variables Sl. This give ^an

additional contribution of the form

Combining all the above results into one formula

we obtain :

The integral ^over the variables m v ^can be done and

we finally ^{obtain :}

where

and

The function Z, ^is given by

In the limit N - oo we can calculate the integral (23) using ^the saddle-point method. This corresponds

to evaluating the derivatives of the form aF lax

⁼

⁰

(7)

where x stands for any one of the integration

variables. Doing ^this ^we obtain the saddle point equations in the form of recursion relations for the various order parameters introduced. After ma-

nipulating ^these expressions using ^similar techniques

to those described in MD we finally obtain the

following recursion relations which are derived in

appendix ^A.

where a J.L, l + ¹ ^is ^{given by}

In this equation the average is with respect ^to ^the Gaussian random variable z with zero mean and unit variance. The initial conditions are au, ⁼ ¹ ^which implies q ^{1= K} (assuming the normalization condi- tion 4b).

In the limit when N --> oo we can transform the

sum in (29) înto ân integral. Doing ^this, ând defining rl

⁼

gql ^we obtain the following recursion relations at zero temperature (3 ---+ oo ) :

with

al(u) = 1 and

These equations constitute the solution of this model and will be analysed ^{in the} ^next section. In particular,

we will show that the number of effectively ^stored

patterns may be smaller than the number of embed- ded patterns ps.

3.3 LEARNING BIASED PATTERNS.

^-

In this subsec- tion we give ^the ^exact solution of the model as

defined in section 2, through ^the coupling Jfj ^given ⁱⁿ

equation (6) ^{and the} probability distribution for the random variable )/, , ^given ⁱⁿ ^equation ^(5). ^Follow-

ing Amit et al. [7] I also define the order parameter

m in this case to be :

where a, ^{is the} magnetization ^on layer I. The full

overlap fn-1. ^is ^{given by}

Following the derivation given ^{in the} previous ^sub-

section for ^« Learning ^with Weights ^», ^and making

the obvious generalizations appropriate ^here ^we finally obtain the following expression ^{for the} prob- ability distribution P (Details of the derivation are

given ⁱⁿ appendix B).

where

and

The function Z is given by

In these equations a

⁼

psIN (see discussion following Eq. (14)). ^{As in the} previous ^section ^we calculate the

integral by ^the saddle-point method. The general recursion relations and their detailed derivation are given

in appendix ^{B. For} the sake of brevity, ^we give ^here only the recursion relations at T

⁼

0 and for the case

where a,

⁼

^a, i.e. a is independent ^{of the} layer. These recursions have the following ^{form :}

(8)

These equations together with the initial conditions m ^{1 =} m 1 ^and q ^{1 =} ^{1 -} ^{a 2} constitute the solution of the model (at ^T

⁼

0).

3.4 EFFECT OF STATIC NOISE.

^-

In this subsection I

give the solution to the basic MD model with the inclusion of a static (non-learned) component ^{in the} coupling. ^This generalization has been treated in the context of the Hopfield ^model by Sompolinsky [22],

with similar results to those we obtain below. I will derive the recursion relations at zero temperature and analyse them in the next section. The only

difference between the derivation at zero tempera-

ture and at a finite temperature ^T is that instead of the function P (SI + 11 1 Sl) ^appearing ⁱⁿ equation (9)

we have a theta-function constraining ^the dynamics.

The probability distribution P (SL S1 ) ^is ^given ^{in this}

case by

In this equation the square brackets represent ^an average ^over the Gaussian random variables i1fj ^with

zero mean and variance A/ J N. Transforming ^the

theta-function into an integral by using ^the identity

we bring the function P into a form which can be

easily evaluated. The integral ^over ^the distribution

P (A!.) is Gaussian and can be easily ^{done. The} remaining average ^over the patterns f, -, ^and ^trace over Sli is done in a manner analogous ^to ^that

described in the previous ^two subsections, ^{and will}

not be repeated. ^{The final} layer ^to layer ^recursion

relations I obtain are :

The initial conditions are as before given by

Ml=Ml 1 and q 1 = 1.

4. Analysis ^of ^the ^solution.

In the previous ^section ^we presented the recursion relations for the three generalizations of the basic MD model. This section contains an analysis ^{of these}

recursion relations. The most important question ^I

wish to address is the asymptotic behaviour of the system. ^That is, given ân înitial ^state ^{which has} overlap m ¹ ^with â given pattern v what is the behaviour for I - oo. If m * is finite (and ^close ^to 1)

we say that the system ^has recognized pattern ^v. ^If

the asymptotic overlap ^{is 0} (or ^order 1/ BIN), ^the

system has lost all trace of its initial state, and thus has not recognized pattern ^v. ^{In MD} ^we ^{found that} the system undergoes ^a phase transition as the parameter a is varied. For example, ^at ^T

⁼

0, ^we

find that for a > 0.27 the asymptotic overlap ^is ^zero,

whatever the initial state. This means that the system

cannot function any longer ^{as an} associative memory.

I find that similar type of behaviour persists ^{in the}

models described above. In the following ^{three sub-}

sections I discuss the behaviour of the network for the three generalizations treated above.

4.1 WEIGHTED LEARNING SCHEMES.

^-

Starting

from the general recursion relations given in equa- tions (29), (30) ^we restrict ourselves in this section to the case of the so called marginalistic learning

scheme. This corresponds ^to ^the following ^{choice of}

the function Au ^{[4, 13].}

With this choice of Au ^I ^find ^two ^different ^types ^of

behaviour as a function of the parameter ^E. ^This behaviour is the same as that obtained by ^{Mezard et}

al. for the case of Hamiltonian networks and by

Derrida and Nadal for the case of the diluted

asymmetric ^networks. Following Derrida and Nadal

I define the number of e f fectively ^stored patterns ^to

be pm, and the parameter ^«

⁼

pmln. Recall that the

total number of patterns embedded in the system

(9)

was ps

⁼

gN. Similarly ^to the results of the above authors I find two regimes ^as ^a function of 8. The behaviour in each regime is of the following ^nature (using the notation of Derrida and Nadal [13]) :

(i) ^Good learning regime ^for g -- g * (E ). ^{In this} regime all the embedded patterns ^are effectively stored, i.e. pm = ps.

(ii) Forgetting regime ^for g * -- g -- g c (E ). ^In ^this regime only ^the ^most recently ^stored patterns ^are

effectively stored, ^{while the} rest cannot be retrieved.

In this regime pm ps.

(iii) ^Above gc (e ) ^no ^stored pattern ^is effectively

memorized. In this regime ^a

⁼

^0.

There is never a complete deterioration regime ^for

this range of ^E. As found in the diluted model of Derrida and Nadal there is an asymptotic ^finite capacity ^a as g - ^{oo .} That is, for g g ^* the capacity

a is equal ^to g, while for g > g * ^the limiting capacity, ^a (g - oo ), is finite. This behaviour can be

seen in figure ^2.

For the marginalistic type ^of learning ^described by equations (42) ^I find the critical value of E to be Ec

⁼

1.68... In figure ^{1 I} give ^the capacity ^a vs. g in

Fig. ^1.

^-

Weighted learning : ^effective capacity ^a ^vs.

g

⁼

ps/N where p, is the number of patterns embedded in the network. Three regimes ^{are seen} ^{in the} figure ^as

discussed in the text (e

⁼

¹ EC).

the first regime (for ^E ^{= 1} E c). ^The ^different ^types

of behaviour described in (i), (ii) ^and (iii) ^above ^can

be clearly ^seen ^{in this} figure. Figure ² depicts ^the

same graph ^{for E}

⁼

² :> £c which is in the second

regime ⁱⁿ ^E. ^As we can see from figure ^{2 the} ^curve

« (g ) ^indeed levels off to its asymptotic ^behaviour already at g - 1, ^{and the} complete deterioration

regime ^a

⁼

^{0 is} ^never ^reached.

Fig. ^2.

^-

Weighted learning : ^effective capacity ^a ^vs.

g

⁼

ps/N where ps is the number of patterns embedded in the network for E

⁼

2 > ec. Here only ^two ^distinct regimes

are observed as discussed in the ^text.

4.2 BIASED PATTERNS.

^-

Starting ^{from the} ^zero temperature recursion relations given in equa- tions (39) I address the problem ^{of the} asymptotic overlap m ^* and the critical value of â as a function of the magnetization a imposed ôn êach layer.

Solving equations (39) numerically ^I ^{find the} ^curve a, (a) given ⁱⁿ figure ^{3. As} ^can ^be expected ^{we see}

that the storage capacity decreases with the magneti-

zation a and approaches zero as a - 1. This curve is similar in shape ^to that obtained by ^{Amit et} ^al. [7] ^for

the case of Hopfield model with biased patterns.

Figure ⁴ depicts ^the asymptotic overlap m c *(a),

where mc* m * (a c). Again, ^one observes that as

the magnetization increases the asymptotic overlap

decreases and approaches ⁰ ^as a -> 1. The full

overlap ^however (Eq. (33b), ^m ^{= m} ^{+ a 2} ^is always

close to 1, ^even ^{in the} vicinity ^of a c.

Fig. ^3.

^-

^Biased patterns : ^critical capacity a vs. a, the

magnetization ^on ^each layer.

4.3 STATIC SYNAPTIC NOISE.

^-

As for the original

MD model the saddle-point equations ^contain ^two

(10)

Fig. ^4.

^-

^Biased patterns : asymptotic overlap m * (a,) ^vs.

a, the magnetization ^on ^each layer.

stable solutions. The solution with m - 1 disappears

above a critical value a c (A). ^I plot this critical value of ^a as a function of A in figure ^{5. As} ^can ^be expected ^this ac is a monotonically decreasing

function of A. We observe that retrieval is possible only ^for ^L1 0.8 which is the same value obtained by Sompolinsky ^{in the} ^case ^of ^the Hopfield ^{model. In} figure ^{6 I} plot ^the asymptotic ^{value of} m c *

⁼

^{m *} (a c)

as a function of noise level A. I find that it goes

continuously ^to ^zero ^{as A -} Ac-.

As ^was found by Sompolinsky [22] ^{in the} ^case ^of

the Hopfield model I find our layered ^model ^to ^be

rather insensitive to static noise. In fact the network

performs rather well even when the width of the noise term is comparable ^to ^{the width} of the_Hebb

component which is of the order of J a / ^{J N .}

Fig. ^5.

^-

^Static synaptic noise : critical capacity ^ac ^{vs. d,}

the width of the distribution of the random variable

4. 5. Discussion.

In this paper I have extended the solvable class of feed-forward neural networks defined by ^{Domany et}

Fig. ^6.

^-

^Static synaptic ^{noise :} asymptotic overlap m *(a,) ^{vs. d,} the width of the distribution of the random variable .1fj’

al. [19] and solved by ^{Meir and} Domany [21] ^to

include :

a) ^A weighted learning ^scheme.

b) Learning ^{of biased} patterns.

c) Inclusion of static noise in the synapses.

My analysis has shown that all these modifications

can be successfully incorporated into the basic model of MD, ^thus enlarging its domain of operation. ^As

we have shown, ^the introduction of weighted ^learn- ing prevents ^the abrupt decline in performance ^of

the network at a sharp ^{value of} a(= 0.27). ^The price ^we pay, of course, is that ^« anciently ^» ^learned patterns ^are forgotten. We have also shown that

simply correlated patterns, ^as ^{in the} ^biased pattern

case, ^can be successfully learned and retrieved.

Finally, ^static synaptic ^noise ^was ^found ^to ^have ^little

effect on the performance ^{of the} network, ^even

when it is comparable ⁱⁿ magnitude ^to ^the ^learned

component.

The mathematical techniques ^{used in} this paper and in MD should be applicable ^to many types ^of feed-forward neural networks, and allow one to obtain interesting analytical insight ^into ^these models, ^which ^would ^not be available from computer simulations.

Of course the most interesting part ^{of the} theory

of feed-forward networks has not been addressed in this paper, i.e. the problem ^of learning. ^We ^have previously introduced [19] ^a layered model posses-

sing ^such ^a dynamical learning stage which leads to

perfect recall of all key patterns. ^We ^note ⁱⁿ passing

that much recent work is concerned with learning algorithms for feed-forward networks [16]. ^Numeri-

cal work has demonstrated the utility ^of such sys- tems, but no convergence theorem has been proved

for the learning stage, ^as has been for the single layered perceptron [15].

Two interesting unanswered questions ⁱⁿ ^the

theory of feed-forward neural networks, ^which ^can

(11)

hopefully be attacked with the techniques ^described

in this paper, ^are the following :

1) The maximal storage capacity ^{of such} systems.

This question ^has recently been addressed by ^Gard-

ner and Derrida [25] in the framework of the

Hopfield type networks. It would be interesting ^to

compare the capacity ^{of the} ^two types ^of systems.

2) ^The ^structure ^{of the} attractors, ^{i.e. the} topog- raphy ^{of the} ^state space. Questions such as the

asymptotic overlap ^of ^two initially ^close patterns ^are of importance ⁱⁿ answering ^this question.

We are currently studying ^the possibility ^of using

such networks for storing sequences of different

periods (in ^a single network).

I thank E. Domany ^{and H.} Orland for many

helpful discussions, ^{and E.} Domany ^{for his} ^encour- agement. This research was supported by ^{the US-}

Israel Binational Science Foundation, the Israel

Academy of Sciences and the Minerva Foundation.

Appendix ^A.

In this appendix I derive the recursion relations for the weighted learning ^scheme. ^The derivation is very similar to that given ⁱⁿ appendix ^A ^{of MD} but will be

repeated ^for completeness. ^The saddle-point equations derived from equation (22) ^{are :}

As in MD we need to make an ansatz concerning ^the

solution of the saddle-point equations. Thus, ^I

assume

With this assumption ^one ^can ^show that p ^must ^also

vanish. To see this I note that p1, ^contains ^a ^term ^of

the form (i Â 10 À 10 - 1) zp.

^.

^This ^term ^is proportional

to

Assuming qL

⁼

^{0 for} ^some large ^L (which ^{is the}

number of layers ^{in the} system ^which ^we implicitly

assume tends to infinity) ând integrating ôver À L gives d (Â L). În ^{the last} ^term ôf (A.3) Â L

multiplies A L - 1 ; ^hence if A . L = 0 and qL - 1 = 0,

there remains only ône ^term ^with A L -1, ^{and the} integral ôver ^{A L -} âlso yields 8 (A . L -1 ) ând ^so ôn,

until I

⁼

l o ^is reached, ^{for which} ^one gets

Thus ql

⁼

⁰ yields

However, ^for pl

⁼

^{0 it} ^{is easy} ^to ^see ^that /i ^of equation (27) satisfy the relation

The equation ^for ql is obtained from calculating

the average ( (A 1)2) z,, ^appearing ⁱⁿ ^{the first} ^equation

in (A.I). ^A straight-forward evaluation of this

integral, yields

Now we must evaluate i,61. Using ^the expression

for f given ⁱⁿ (26) together ^{with the} equations (A.1)

and (A.2), ^we get

It should be noted that (A.5) ^{holds for} P’

⁼

^{0 ;} ⁱⁿ

(A.7) ^we ^must first take the derivative and then set

pl

⁼

^{0. From} (26) ^it is easy ^to ^see that

and we find

Calculating the derivative and substituting ^{this into} (A.6), using ^{the first} equations ^of (A.1), yields ^the

recursion relation for ql. ^The ^recursion relation for

ml, is obtained by using (A.5) and the last of

equations (A.1) ^to get

Using equation (27) ^for I’ (with pl

⁼

0) yields ^the

(12)

required equation. Putting everything together ^I

obtain the following ^recursion relations, ^which ^are displayed in the main body of the paper ^as well.

where aM, I + ¹ ^is given by

In this equation ^the average ^{is with} respect ^to ^the Gaussian random variable z with zero mean and unit variance. The initial conditions are a JL, =1 ^which implies q ⁼ ^K (see Eq. (4b)).

Finally, ^{in order} ^to ^check self-consistency ^{of the}

solution we must demand that indeed 41

⁼

^{0. To}

show this we need to evaluate a f 1 + 1/ aq ^which ^can

be shown to yield

Using this result it is simple ^to check that indeed

ql

⁼

0, ^and ^our solution is consistent with the

starting assumption ^{that led} ^to ^it.

Appendix ^B.

In this appendix I derive the recursion relations for the network with biased patterns. ^As usual, ^we ^start

from the expression ^{for the} probability distribution

given ⁱⁿ equation (13). Inserting ^the explicit ^{form for}

Jfj ^given ⁱⁿ equation (6), ^and going through ^the ^same

introduction of variables m, m, cp , 0 ^as ⁱⁿ equations (15), (16) of the main text I obtain : (note ^however

that the definition of ml has been modified in this

case)

In the above expression ^we ^have separated ^the ^terms ^{with tk} > 1 from those with g ^{= 1. In} ^what follows tt always ^takes values > > 1. Using again the remarks made after equation (17) ^I implicitly ^assume

L --> oo and so neglect ^the boundary ^term resulting ^{from the} ^term ^with SL. As mentioned before, since I will be interested in the recursion relations this makes ^no difference. Doing this Y is given by :

which can be averaged ^over ^the probability distribution for g given ⁱⁿ equation (5). ^This yields :

Rescaling the variables ml u ând thl âs ⁱⁿ êquation (20) ând defining the variables pi ând

ql ^as ⁱⁿ equation (21) ^{we are} left with the following expression ^{for P :}

(13)

The trace over the variables Si ^can ^be easily done. This yields ^the following expression :

Collecting ^all ^terms containing the variables _cp, 0 ^we ^have integrals ^{of the} following ^{form :}

The integral ^over the variable 0 ^can be done and we are left with the following expression ^{for J :}

Combining all the above manipulations, ^we finally

obtain the final expreession ^{for P} given in the main text (Eqs. (34-38)).

As was mentioned in the text the integral ^is

calculated via the saddle-point method. To do this

we must set to zero the derivatives of the function F with respect ^to ^each ^one ^{of the} integration ^variables.

Doing ^this ^we obtain the saddle-point equations :

The equation ^for ^ml ^can ^be ^seen ^to ^give

where 11 ell ^£2 ^is ^given ⁱⁿ ^equation ^(37). ^{In order} ^to

solve the saddle-point equations ^we ^need ^to ^make ^an

ansatz concerning the solution. Using ^the experience

gained ^{in MD} we assume :

(14)

which will be checked for self-consistency ^at ^{the end}

of the calculation. With this assumption ^{it is} ^not

difficult to check that

as well (see appendix ^A ^for details). ^The equation

for ql ^{is :}

With pl

⁼

⁰ ^{one can} also check that the following

relation holds :

From (B.10), (B.ll) ^and (B.13) ^one ^{finds that}

From (B.9) ând (B.12) ^with ^the assumptions (B.10) ând consequences (B.ll), (B.13) ând (B.14)

it is simple ^to ^obtain ^the following ^recursion ^relations

after some algebra :

In these equations ( ... ) represents ^an average with respect ^to the Gaussian random variable z with zero mean and unit variance. Taking ^the ^zero temperature limit Q - ^oo ^we obtain the recursion relations given

in equation (39) in the main text. Finally, ^{one can}

check that the assumptions (B.10) indeed lead to a

self-consistent solution. To do this I substitute the solution (B.10) ând (B.11) ^{into the} saddle-point equations (B.8) and find that they âre îndeed

satisfied.

References

[1] LITTLE, W. A., Math. Biosci. 19 (1975) ^101.

[2] HOPFIELD, ^J. J., ^Proc. Natl. Acad. Sci. USA 79

(1982) ^2554.

[3] ^AMIT, ^D. J., GUTFREUND, ^{H. and} SOMPOLINSKY, H., Phys. ^Rev. ^A ³² (1985) 1007 ; Phys. ^Rev.

Lett. 55 (1985) 1530 ; ^Ann. Phys. ¹⁷³ (1987) ^30.

[4] MEZARD, M., NADAL, J. P. and TOULOUSE, G., ^J.

Phys. ^{France 47} (1986) ^1457.

[5] PARISI, G., ^J. Phys. ^A ¹⁹ (1986) ^L617.

[6] PERSONNAZ, L., GUYON, ^I. ^and DREYFUS, G., ^J.

Phys. Lett. France 46 (1985) ^L-359.

[7] ^AMIT, ^D. ^J., GUTFREUND, ^H. ^and SOMPOLINSKY, H., Phys. ^{Rev. 35} (1987) ^2293.

[8] KANTER, ^{I. and} SOMPOLINSKY, H., Phys. ^{Rev. A 35} (1987) ^380.

[9] DIEDERICH, ^{S. and} OPPER, M., Phys. Rev. Lett. 58

(1987) ^949.

[10] HERTZ, J., GRINSTEIN, ^{G. and} SOLLA, S., ⁱⁿ ^L. ^Van Hemmen and I. Morgenstern (eds) : Glassy Dynamics (Berlin : Springer Verlag) ^1987.

[11] SOMPOLINSKY, ^{H. and} KANTER, I., Phys. Rev. Lett.

57 (1986) ^2861.

[12] DERRIDA, B., GARDNER, ^E. ^and ZIPPELIUS, A., ^to be published ⁱⁿ Europhys. ^Lett.

[13] DERRIDA, ^B. ^and NADAL, J. P., submitted to J. Stat.

Phys.

[14] ROSENBLATT, F., Principles of Neurodynamics (Washington ^{D.C. :} Spartan) ^1961.

[15] MINSKY, ^{M. and} PAPERT, S., Perceptrons (Cam- bridge, Mass : MIT Press) ^1969.

[16] MCCLELLAND, J. L. and RUMMELHART, ^D. E., Parallel Distributed Processing : Explorations ⁱⁿ

the Microstructure of Cognition, ^{2 vols.} (Cam- bridge, Mass : The MIT Press) ^1986.

[17] RUMELHART, D., HINTON, ^{G. and} WILLIAMS, R., Nature 32 (1986) ^533.

[18] HOGG, ^{T. and} HUBERMAN, B., ^{J. Stat.} Phys. ⁴¹ (1985) 115 ; Phys. Rev. Lett. 52 (1984) ^1024.

[19] DOMANY, E., MEIR, ^{R. and} KINZEL, Europhys. ^Lett.

2 (1986) ^175.

[20] MEIR, ^R. ^and DOMANY, E., Phys. Rev. Lett. 59

(1987) ^{359 and} Europhys. ^{Lett. 4} (1987) ^645.

[21] MEIR, ^{R. and} DOMANY, E., Phys. ^Rev. ^A., in press.

[22] SOMPOLINSKY, H., Phys. ^{Rev. A 34} (1986) ^2571.

[23] GARDNER, E., ^J. Phys. A ¹⁹ (1986) ^L1047.

[24] GARDNER, E., DERRIDA, ^B. ^and MOTTISHAW, P., J.

Phys. ^{France 48} (1987) ^741.

[25] GARDNER, E., Edinburgh University preprint ^87/396

and GARDNER, ^{E. and} DERRIDA, B., Edinburgh

University preprint ^87/397.

Extensions of a solvable feed forward neural network

HAL Id: jpa-00210685

https://hal.archives-ouvertes.fr/jpa-00210685

Submitted on 1 Jan 1988

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Extensions of a solvable feed forward neural network

Ronny Meir

To cite this version:

Ronny Meir. Extensions of a solvable feed forward neural network. Journal de Physique, 1988, 49 (2),

pp.201-213. �10.1051/jphys:01988004902020100�. �jpa-00210685�

Extensions of a solvable feed forward neural network

Ronny Meir

Department of Electronics, Weizmann Institute of Science, Rehovot 76100, Israel

(Requ le 9 septembre 1987, accepté le 21 octobre 1987)

Résumé.

Abstract.

exactly and layer to layer recursion relations are obtained.

Classification

Physics Abstracts

05.20

05.40

75.10H - 87.30

1. Introduction.

The original Little [1] and Hopfield [2] models of

Neural networks have been much extended over the past few years following the work of Amit et al. [3].

These extensions have been in various directions.

The first type of extension was a modification of the learning rules to incorporate such effects as

forgetting [4, 5], storage of correlated patterns [6-9]

and more. These models were still of the original Hopfield paradigm in the sense that they consist of a network of symmetrically connected binary variables (spins) with 2-spin interactions of an infinite range type. The method of solution in these cases (when an

exact solution exists) is the replica method, following

the original work of Amit et al. [3].

Another class of models eliminates the restriction of symmetric bonds [10, 11] inherent to the Hopfield

model and the extensions discussed above. Techni-

cally, once the bonds are made asymmetric the

model is no longer Hamiltonian and the replica

method cannot be used. Recently, Derrida and co-

workers [12, 13] have shown that one can solve

exactly the dynamics of a class of heavily diluted asymmetric neural-networks. It turns out that the dilution and asymmetry make the model rather

easily soluble.

Another extension we consider is that of layered

architectures. This type of model, on which we will

focus in what follows, has been studied extensively

by computer scientists over the past few decades.

The origin of this class of models can be traced back to the idea of the perceptron introduced by

Rosenblat [14] and studied in detail by Minsky and Papert [15], who demonstrated the limits of the

single layer perceptron. In the last few years much work has been done in generalizing the original ideas

of Rosenblatt to multi-layered systems. The main feature of this class which distinguishes it from the

previous classes is the existence of « hidden units ».

These systems usually [16] consists of an input unit,

Rumelhart et al. [1’7] have found an algorithm called

« back propagation » which solves many of the

problems encountered in the earlier perceptron models. Multi-layered models with couplings

between and within layers have been introduced in the physics literature by Huberman et al. [18]. Later Domany et al. [19] introduced layered feed-forward networks, with no couplings within layers.

The dynamics of a model feed-forward neural- network have been recently solved by Meir and Domany [20, 21] (to be referred to as MD). This

model will be briefly described in the next section.

Article published online by EDP Sciences and available at http://dx.doi.org/10.1051/jphys:01988004902020100

This paper generalizes our previous work in various

directions. 1) Incorporation of different learning schemes, namely the weighted learning of Nadal

et al. [4]. 2) Learning biased patterns [7], i.e. pat-

terns whose level of activity is different from 50 %.

3) Effect of static noise in the synapses. The paper is

organized as follows. In section 2 I give a detailed description of the network, its architecture and

operation and a detailed description of the exten-

sions considered. The exact analytic solution of the various extensions to the original MD model is given

in section 3 in the form of layer to layer recursion relations, while section 4 contains an analysis of the

results. Section 5 summarizes our findings.

2. Definition of the model.

The original model we studied is the following.

Consider L layers ; each contains N cells (spins),

with a binary variable Si = ± 1 associated with cell i of layer I. Each cell is connected to all cells of the

neighbouring layers. The bonds are, however, unidi-

rectional : the state of layer I + 1 is determined by

the state (at the previous time step) of layer 1 according to a probabilistic rule. The dynamic pro-

cess is one which sets the layers sequentially : input corresponds to setting the first layer in an initial

state, Si1. At the next time step the second layer is set

in state s1 and so on. The probability that the i’th

spin in the (I + 1 )’th layer has the value Sil + 1, given

that on the previous layer I the cells are in state

Sil, is taken to be

Ronny ^Meir

Department ôf Electronics, ^Weizmann Înstitute ôf Science, ^Rehovot 76100, Îsrael

(Requ ^{le 9} septembre 1987, accepté le 21 octobre 1987)

exactly ^and layer ^to layer recursion relations are obtained.

Physics ^Abstracts

The original ^Little [1] ^and Hopfield [2] ^{models of}

The first type of extension was a modification of the learning ^rules ^to incorporate such effects as

and more. These models were still of the original Hopfield paradigm ^{in the} ^sense ^that they consist of ^a network of symmetrically ^connected binary ^variables (spins) ^with 2-spin interactions of an infinite range type. The method of solution in these cases (when ^an

exact solution exists) ^{is the} replica method, following

Another class of models eliminates the restriction of symmetric ^bonds [10, 11] ^inherent ^to ^the Hopfield

cally, ^once the bonds are made asymmetric ^the

method cannot be used. Recently, ^Derrida ^and ^co-

exactly ^the dynamics ^of ^a ^{class of} heavily ^diluted asymmetric neural-networks. It turns out that the dilution and asymmetry ^{make the} model rather

easily ^soluble.

Another extension ^we consider is that of layered

architectures. This type ^of model, ^on ^which ^we ^will

Rosenblat [14] and studied in detail by Minsky ^and Papert [15], who demonstrated the limits of the

single layer perceptron. ^In the last few years much work has been done in generalizing ^the original ^ideas

of Rosenblatt to multi-layered systems. ^{The main} feature of this class which distinguishes it from the

Rumelhart et al. [1’7] ^have ^found ^an algorithm ^called

problems encountered in the earlier perceptron models. Multi-layered ^models ^with couplings

between and within layers ^{have been} introduced in the physics literature by ^Huberman ^{et al.} [18]. ^Later Domany ^{et al.} [19] introduced layered feed-forward networks, ^with ^no couplings ^within layers.

The dynamics ôf â ^model feed-forward neural- network have been recently ^solved by ^{Meir and} Domany [20, 21] (to ^be ^referred ^to âs MD). ^This

This paper generalizes ^our previous ^{work in} ^various

directions. 1) Incorporation of different learning schemes, namely ^the weighted learning ^{of Nadal}

et al. [4]. 2) Learning ^biased patterns [7], ^i.e. pat-

organized ^as follows. In section 2 I give ^a ^detailed description ^{of the} network, its architecture and

operation ând â ^detailed description ^{of the} êxten-

sions considered. The exact analytic solution of the various extensions to the original ^MD ^{model is} given

in section 3 in the form of layer ^to layer ^recursion relations, while section 4 contains ^an analysis ^{of the}

The original ^model ^we studied is the following.

with a binary variable Si ^{= ±} ¹ associated with cell i of layer I. Each cell is connected to all cells of the

neighbouring layers. ^{The bonds} ^are, however, ^unidi-

rectional : the state of layer ^I ⁺ ^{1 is} determined by

the state (at ^the previous ^time step) ^of layer ¹ according ^to ^a probabilistic rule. The dynamic pro-

cess is one which sets the layers sequentially : input corresponds ^to setting the first layer ⁱⁿ ^an ^initial

state, Si1. ^{At the} ^next ^time step the second layer ^is ^set

in state s1 ^and ^{so on.} ^The probability that the i’th

spin ^{in the} (I ⁺ ¹ )’th layer has the value Sil + ^1, given

that ^on the previous layer I the cells are in state

Sil, ^{is taken} ^to ^be

is the field produced by ^the spins ôf layer Î ât ^{site i of} layer Î ⁺ ^{1. The} parameter

governs the stochasticity ^{of the} dynamics, ^{which is}

deterministic for T -->* 0 (or 13 -+ oo ) and becomes

more stochastic as T increases. The couplings ^or

bonds Jfj ^are ^chosen ^according ^to ^some prescription,

which ^we took in ^our original ^solution [20, 21] ^to ^be

of the outer-product [1, 2] ^type,

where 6 f, , v ^=1, ^{..., p,,} ^are ^the ^stored ^key ^pat-

The extensions we consider in this paper ^are the

1) Weighted learning ^schemes [4, 13] : ^{Here the} original learning ^rule (Eq. (3)) ^is modified. Each pattern is learned with a weight. This models the effect that patterns which have been recently ^learned

are embedded with a larger weight ^as compared ^with

« old » patterns. ^A ^more detailed discussion of the

« philosophy ^» of this modification can be found in reference [4]. ^The ^new couplings ^{in this} ^case ^{take the}

where K is a normalization constant independent ^of

N. With this notation, ând assuming A(u) ^to ^be â decreasing ^function ôf û, ^the ^most recently ^stored pattern ^{is the} ône ^{with v} = 1, ^{and the} storage

« ancestry ^» increases with v. This modification was

originally proposed by Mezard et al. [4] ^{who solved}

the problem for the Hamiltonian networks within the framework of the replica theory. Derrida and Nadal [13] have also considered this modification of the learning rule for the diluted asymmetric ^neural-

networks [12]. They ^were âble ^to give ân êxact

solution of the dynamics ^{in this} ^case.

2) ^Biased patterns [7] : Following ^{Amit et} ^al. [7]

we study ^the properties ^of ^our network when the

mean level of activity is different from 50 % as in MD. Thus, every component e, J.L ⁱⁿ ^a ^learned

pattern ^can ^{be chosen} independently ^with ^a probabi-

We adopt also the modified form of the coupling proposed by ^Amit ^{et al.} [7]

where _a, is the magnetization ^{of the} patterns ^on layer

3) Static noise in the _synapses [22] : ^{Here the} couplings âre ^modified ^{so as} ^to înclude â ^random part ^{which is} ^not ^related ^to ^the learning process. The

couplings ^{in this} ^case take the form :

where the L1f; ^are independent, identically distributed

variance A/ BIN. ^This problem ^was ^treated by Sompolinsky [22] ^{in the} ^context ^{of the} Hopfield

externally ^dictated, the network is free to choose the internal as well as output representations g f, v’

l ::> 1.

By ^exact solution of our model we mean the

following. The network is presented ^with ân înitial configuration ^S’ ôn ^{the first} layer (l = 1 ). This may be one of the key patterns, â noisy key pattern, â mixture state (one having ^finite overlap with several

key patterns) ôr just â ^random ^state. ^This ^state îs

characterized by ^its overlap mi with each of the key- patterns ^on ^{the first} layer. ^The overlap mlu ^{is defined}

(this definition will be generalized ^when dealing ^with

biased patterns ⁱⁿ ^sect. 3). Ôur solution consists of the calculation of mi ôn âll subsequent layers. ^From

the recursion relations for mi ^we ^{will able} ^to ^learn ^a