HAL Id: jpa-00246470
https://hal.archives-ouvertes.fr/jpa-00246470
Submitted on 1 Jan 1992
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Learning multi-class classification problems
Timothy L. H. Watkin, Albrecht Rau, Desiré Bollé, Jort van Mourik
To cite this version:
Timothy L. H. Watkin, Albrecht Rau, Desiré Bollé, Jort van Mourik. Learning multi-class classification
problems. Journal de Physique I, EDP Sciences, 1992, 2 (2), pp.167-180. �10.1051/jp1:1992131�. �jpa-
00246470�
J Phys. Ifrance 2
(1992)
167-180 FEBRUARY 1992, PAGE 167Classification Physics Abstracts
87.10 02.50 64.60
Learning multi-class classification problems
Timothy
L. H. Watkin(~),
Albrecht llau(~),
Desird Bolld(2)
and Jort van Mourik(2)
(~) Theoretical Physics, Oxford University, Keble Road, GB-Oxford OXI 3NP, G-B- (~) Inst. voor Theor. Fysica, K. U. Leuven, B-3001 Leuven, Belgium(received
26 August 1991, accepted in final form 20 October1991)
Abstract. A multi-class perceptron can learn from examples to solve problems whose answer may take several different values.
Starting
from ageneral
formalism, we consider the learningof rules by a Hebbian algorithm and by
a Monte-Carlo algorithm at high temperature. In the benchmark "prototype-problem" we show that a simple rule may be more than an order of magnitude more efficient than the well-known solution, and in the conventional limit is in
fact optimal. A multi-class perceptron is
significantly
more efficient than a more complicated architecture of binary perceptrons.1. Multi-class perceptrons for multi-class
problems.
One recent success in the
rapidly expanding
field of neural networks has been thelearning
ofa "rule" from
examples (correct
associations of"questions
andanswers") [1-7]. Analysis
sofar, however,
has been restricted to small classes ofpossible
rules and inparticular
to those in which there areonly
two answers, unlike thegenerality
of realengineering problems [8-10].
In this paper we seek to relax this restriction ina way which preserves the natural
symmetries
of theproblem.
Historically
a formal neuron has been allowed twoIsing-like
states,+I, by analogy
with theon/off
nature of abiological
neuron[II, 12].
Abinary
perceptron, forexample,
consists of NIsing inputs
which determine oneIsing output
and may learn a set of random N-vectorquestions,
withIsing
components, andIsing
answers associatedaccording
to some rule. In this wayalgorithms
have been devised [1, 2] to teach a perceptron"linearly-separable"
rules.Clearly
it is a matter of considerableengineering importance
to discover how the success oflearning
schemeschanges
if the restrictions in this formulation are relaxed. In [5, fi] forexample,
successivequestions
are chosen not at random but on the basis of what hasalready
been
learnt,
andrecently
astudy
has been made of how well perceptrons learnproblems
whichare not
linearly separable
[fi]. Here we consider anotherimportant generalisation
toproblems
in which eachdigit
of aquestion
may takeQ
values and in which the answer to aproblem
isone of
Q' possibilities.
Many
scientificapplications
exist. Bohr and his collaborators [8], forexample,
have traineda network to
predict
with 70$l accuracy the"secondary
structure" ofproteins
from their local sequence of amino acids. In our notation theQ input
states are the amino acids and theQ'
= 4168 JOURNALDEPHYSIQUEI N°2
output states
correspond
to the local sterec-chemical form of the amino acid chain:«-helix, p-turn, p-sheet
or random coil. Anothertechnically
relevant issue is faultdetection,
wherea network is
expected
todistinguish
classes of technical defects [9]. There is also the classic neural networkproblem
ofclassifying phonemes
fromfrequency analysis
of humanspeech
[10].May
not many outputs berepresented by
the output combinations of severalbinary
percep-trons? Previous studies have assumed that this is how multi-class classification would
proceed,
but there are twodisadvantages. Firstly,
abinary
perceptron divides theinput
space in two and so must separate thepossible
outputs into twoclasses,
which violates the symmetry of theproblem:
without aplioli knowledge
to the contrary we should assume that all answers areequivalent. Secondly,
if agreater
number of neurons is used more connections arerequired
atsome
engineering
cost; we would rather asingle
neuronperformed
a morecomplicated
function of itsinputs.
Such a neuron
already
exists in the literature and has beenextensively
studied in the verydifferent
problem
ofstoring
patterns[13-15].
Instead of two states in the output there areQ',
each of which has the same
relationship
to each of theothers,
like theQ'
vertices of aQ'-
I dimensional tetrahedron.Similarly
there areQ equivalent
states for eachinput.
At everytime step the neuron calculates a local "field" for each output state
using
adifferent,
constantfunction of its
inputs
and enters the state with thehighest
field. We shall refer to such a neuronas a "multi-class
perceptron". (In engineering
a similar network is called a "linearmachine", although
it does notperform
a linear function of itsinputs,
or a "winner-take-all" machine and is attributed to Nilsson[lfi]. Physicists
haveprefered
the term"Potts-perceptron" [15].)
This multi-class perceptron is not to be confused with a multi-level orgraded-response
neuron, whichonly
possesses levels ofactivity
between on andoff;
these states have acompletely
different symmetry(like
aladder)
from that of the multiclass perceptron.In this paper we will review the
theory
oflearning
fromexamples, explain
the formalism of the multi-class perceptron andbriefly
describe the ways in which the two may be combined.We then
apply
multi-class perceptrons to the benchmarkproblems
oflearning
a rule and ofclassifying
prototyes. For the prototypeproblem
we shall show that asimple learning
rule(even
with abinary perceptron)
is better than the well-knownprevious solution,
and in the conventional limits isactually optimal.
One lesson will be thatgeometrical arguments
may letus avoid the
extremely
difficultalgebra
of analgebraic
statistical mechanics formulation. We shallgenerally
find that in multi-classproblems
a multi-class perceptron issignificantly
moreefficient than a more
complicated
architecture ofbinary
perceptrons, but willfinally
discussproblems
in which an even better choice of neuron may beappropriate.
2. The
theory
of multi-classlearning.
2. I BINARY PERCEPTRON LEARNING. A conventional
binary
perceptron has NIsing inputs
(S;
=+I),
I = I,.,
N and one
Ising
outputSo given by So
=UIS)
=
SgnlJ
S)>Ii)
where we have defined the scalar
product
of two N-vectors x and y as x y =I/N ~~
ziy;.The N-vector J of
weights
defines the perceptron. If we chooseJj
E lZ it is aspherical binary
perceptron, or ifJj
E(+I)
anIsing binary
perceptron. J is normalised to J J = I, so that it lies on the surface of the unitsphere.
The perceptron learns from p
pairs
of N-vectorquestions (("),
p=
I,.
,p and answers(ff), associated according
to a ruleV,
defined asif
=V((~).
Eachf)
is chosenindependently
and
randomly
from the set(+I)
and p =aN,
where a remains constant as N- oc. A
N°2 LEARNINGMULII~LASSCLASSIFICAIIONPROBLEMS 169
linearly-separable
Boolean function is of the formV(t~)
=
Sgn(B f"), (2)
where B is a '~teacher" vector of unit
magnitude.
AnIsing
J istypically
used to learn anIsing B,
and aspherical
J to learn aspherical
B(assumed
to berandomly chosen).
If this is not the case and e-g- anIsing
perceptron has to learna
spherical
perceptron, theproblem
becomes unlearnable and hasrecently
beenanalysed by
[7]. From theexamples
we wish to constructa J which finds the
ff+~
for a random newquestion (~+~,
and it is clear that J = B fulfills this. The chance that J generates theright
answer is thegeneralisation ability
G(clearly
arandomly
chosen J willgive
G=
1/2
ifff+~
hasequal
chance ofbeing +I),
and thus thegoal
of thetraining
process is to minimize thegeneralisation
error Eg = I G,This,
as has beenrecently pointed
out[20, 23],
is notnecessarily equivalent
tominimizing
the number ofexamples
which the perceptrongets
wrong, the"training
error"E(J)
=L
e(-ffJ t~) (31
The
simplest learning algorithm [II
sets Jaccording
to a Variant of the Hebb rule J =fi Lftt~, (4)
»
where 7 is
just
a factor chosen to normalise J to I. We willnow present a
simple geometric
argument toanalyse
how the rule works.Figure
I shows aprojection
of the N-dimensional spacecontaining
J and B.Randomly
chosen((~)
have a component, y, in any direction y(I.e.
y =@( y)
which is Gaussian distributed with mean zero and standard deviation I. The components of(ff(~)
in the B direction add to Jconstructively,
while those in anyperpendicular
direction do sorandomly.
Thus afterpresenting
pexamples
the component of J in the B direction is(a/7)([
ya II"
(a/7)fi (in
thelarge
Nlimit),
while that in each ofthe
perpendicular
directions isfi@/7
the sum of a random walk of aN steps oflength
I
IN. However,
there are N I directionsperpendicular
to B and to eachother,
and so thesum of all the components not in the B
direction,
the component of Jperpendicular
toB,
is8
J
&/j
R
Fig-I.
Plane containing B and J, which lie at angle ~fi.170 JOURNAL DE PHYSIQUE I N°2
found from
Pythagoras
to be@/7. Thus, by Pythagoras,
7 = a +a22/x.
A well-known result is thatEg =
)
=
) cos~~(B J), since,
as may be seen fromFigure
I, this is the &action of the space whoseoverlap
with B and J has differentsign.
It follows that~~ ~" ~
li~
lx
~"~
~'
~~~
a result derived at some
length
in[ii.
Itimplies
that as a- oc, the
asymptotic
behaviour of Eg isEg +~
I/@.
An alternativeapproach ("zerc-temperature" learning)
is to search the N-dimensionalJ-space
to find the volume which stores every pattern with astability larger
than ~c, that
is,
forspherical J,
Volume=
/
dJ b(J
JI) fl e(ff@J ("
~c). (fi)»
As ~c is increased to a critical value the volume shrinks
continuously
to zero, centered on the J-vector with "maximumstability",
andalgorithms
exist which will construct this J,notably
the Minover
algorithm jig].
Reference [2] has shown that as a - oc,Eg tends to zero as I
la.
ForIsing J, however,
for which theJ-space integral
isreplaced by
a trace over all Jconfigurations,
a first-order transition(the thermodynamic transition)
toperfect generalisation
has been
predicted
within the framework of the"replica symmetric approximation"
to occur at ac = 1.245. However the"one-step replica
symmetry broken solution" raises thespinodal
line for the
vanishing
of the metastablespin glass phase
to aRsB " 1.fi28 [17].Analysis
of the zerc-temperaturealgorithm
iscomplicated, however,
and may be treated under the "annealedapproximation",
whichapproximates
thetraining
errorby
p times thegeneralisation
error. If a Monte-Carloalgorithm
is used to minimize the"training
error" at a temperature T(with
inversep), (I.e.
we use the error as an energy and generate states with Boltzmannprobability exp(-pE(J))),
then in thehigh
temperature limit"high-temperature learning"
the annealedapproximation
is exact and thealgorithm
isequivalent
tominimizing
a free energy
given by
[3]pi
= &Eg s,(7)
where +
«IT
and s is the entropy(to
find theappropriate
entropy see, forexample
[7]).Since we are
working
at ahigh temperature
a number ofexamples scaling
with T isrequired,
but otherwise the same
qualitative
behaviour has been observed as at low temperatures [3](gradual learning
forspherical
perceptrons and first order transitions toperfect generalisation
for
Ising ones).
Theoreticalpredictions
made with thehigh-T
formulation are found to agree with simulations for temperatures as low as T ss 5. It has been noted [7] that inproblems
which are not
linearly separable,
and so may not be learntexactly by
a perceptron,high
temperature
learning
may be the best way to avoid"overfitting~' giving
unduesignificance
to
unrepresentative examples.
A
fundamentally
different sort of rule was studied in[20],
the"prototype-problem".
Instead of a teacherB,
webegin
with porandom,
uncorrelatedIsing
N-vector prototypes(q"),
p=
I,.
,po> where po "
UN,
each of which has a randomIsing
output(qf).
The correct answer for anyinput
S is now the correctoutput
of theprototype
closest inHamming
distance to S.The rule is learnt from
examples
of each prototype((~~),
l= I,.
., p, chosen at random but with the constraint that
q" ("'
= m, For an extensive number of
examples
thisproblem
isclearly unlearnable,
since noplane
can divide theinput
space tocorrectly
answer everypossible input.
Instead [20] searched theJ-space
to find the Jminimizing
the number ofincorrectly
answered
examples,
thetraining
error, and consideredonly
the limit of msmall,
whichimplies
N°2 LEARNING MULn-CLASS CLASSIFTCAnON PROBLEMS 171
large
p, since p should be rescaledas
#
=m~p/(1- m2),
which remains of order I. For#
less than a critical
value, #c,
a J may be found which makes thetraining
energy zero; in this range the consequentgeneralisation
error falls from1/2
as# rises,
but then climbsslightly
as
#
-#c,
since theonly
J whichcorrectly
learns allexamples
hasoverfitting.
For#
>#c
thetraining
error risessmoothly
and thegeneralisation
error falls and both tend to the samevalue
E(m)
as#
- oc.Training
with a finitetraining
error(equivalent
to a finitetemperature)
eliminates the
problem
ofoverfitting
and theasymptotic
behaviour of thegeneralisation
erroris Eg
E(m)
+~§~~
2.2 THE MULTI-cLAss PERCEPTRON.
Only
when the concept of abinary
perceptron wasgeneralised
to many states did it become clear how manyassumptions
lie hidden in the +I notation. We shall work within the formalismrecently
derived in [15] to define a multi-class perceptron with NQ-state inputs (aj ); j
=I,
,
N and each aj E
( I,
,
Q)
and oneQ'-state
output a' E(I,.
,
Q').
The local field for output state s' isgiven by hs,
#£ Jj"
m,,a,
(8)
j,>
where ma,b +
Q6a,b
I is theQ-fold
multi-class operator, so that thesynaptic
matrixJJ"
is the
weight
of asignal coming
from theinput j
which is in state s on the state s' of theprocessing
unit. a' is setequal
to the state with thehighest
localfield,
I-e-a' =
(s[(hsj
> hs, Vs'# s[). (9)
Clearly
theoperations
of this perceptron are invariant under thetransformation
JJ'~
-JJ'~
+ VI VS'(lo)
for any
uj,
since it alters all the fieldsby
the same amount and thus there exists a gauge freedom which we shallusually
fixby enforcing
LJI'~
"
0,
V s, j,
(Ii)
~,
although
all choices offixing
the gauge are of courseequivalent.
We shall also chooseL~~'~
"0,
VS', j
,
(12)
s
since any other choice is
equivalent
toadding
a threshold to the field at state s'. Thus in thecase of
binary inputs, Q
"2, JJ'~
=-JJ'~
for alls', j,
so thattaking
aj = +I we may rewrite the fieldsimply
ashs,
=£ Jj'aj. (13)
I
If the output is
binary
as well(s'
=+I) then, renaming
J~ as Jgives
h>, = S'L Jj
«j
(14)
and the s'
maximising
thisexpression
issgn(J «).
172 JOURNAL DE PHYSIQUE No2
A
sphelical
multi-class perceptron has the(JJ'~)
chosen to be real numbers such that each vector(J~'~)
is normalised toI,
but what is the natural multiclassanalogue
of anIsing binary perceptron?
Should we allow the(Jj'~)
to take manyinteger
values? Thispoint
is consideredfurther in the conclusion. In our
present work, however,
in which we shallonly briefly
usequantised interactions,
we will choose eachJJ'~
tojust
be+I, which,
forQ
orQ' odd,
mealswe must
relinquish (it)
and(12).
We shall call the result anIsing
multi-class pelceptron.The
generalisation
of a multi-class perceptron to N' outputs or to afully
connectednetwork,
for which N'= N and
Q'
=
Q,
as in[13, 14],
isstraightforward.
2.3 MULTI-cLAss PROBLEMS. A multi-class
problem
is one in which the answer may take several values. It isstraightforward
to extend the well-knownproof
forbinary
perceptrons [21]to show that a
twc-layer
network of multi-class perceptrons mayperform
anylogical
function of itsinputs.
Here we will be concernedonly
withproblems
in whichquestions ((") strings,
N
digits long
withf)
E(I,.
,
Q)
are associated with answersif,
whereif
E(I,.
,
tj').
A
single
multi-class perceptron can learnexactly problems
which are theanalogues
of"linearly-separable" problems,
I-e- there exists aBj'~
which associates thequestions ((")
with the answers
if
via thedynamical
rule(8,9),
with(J~'~) replaced by (B~" ),
and the B"'obeying
the same gaugefixings (II,12).
We will use aspherical
multi-class perceptron, when the(B"')
arespherical
and anIsing
perceptron when theBj"
= +I.Generally
multi-classproblems
are notlinearly-separable,
of course. Agood example
is theproximity-problem,
in which prototypes may be associated with more than two outputs(this
is discussed in more detail in section4),
and multi-classanalogues
also exist of all the unlearnableproblems
discussed in [fi].As the multi-class
analogue
of(4), following
[13], we introduce:where
m[
~ is the
Q'-fold
multi-classoperator
and7"'
is introduced to normalise each J"' to I. Forthi
case of
Q
= 2 this reduces to
~~' "~R~~'~~"~~~~
~~~~
These rules
obey
the gauge constraints(II)
and(12),
sincethey
are true for every term of thesum over p. However, since we know that the output is unaffected
by
the gaugefixing
we canequivalently,
forQ
= 2,analyse
J~'
=
~,,j Lb,,,<ri», (17)
»
which violates
ill)
but makes it clear that the(f")
affectonly
the J" with s'=
if.
Thusnoise in the
(J")
is uncorrelated for different s'.Gauge
constraint(11)
can be enforced afterlearninf by subtracting,
from eachJ~'~, I/(Q'- I)
times the sum of the other(J"),
and the set(J'
should then be renormalised.In the rest of this work we shall confine ourselves to
Q
" 2
(binary inputs)
and considervarying Q'.
This is forsimplicity
and in the belief thatproblems
with many answers aremore
interesting
than those in whichinputs
take several values(many-valued inputs
may berepresented
anywayby
combinations ofbinary inputs). However,
it isstraightforward
to extend the arguments of the next two sections tohigher
values ofQ.
N°2 LEARNINGMULn~LASSCLASSIFICAnONPROBLEMS 173
3.
Learning
a learnable rule.An
algebraic
evaluation oflearning
with rule(4)
is ratherdifficult,
but may be avoidedby
ageneralisation
of thegeometric argument proposed
in the last section to derive(5) However,
there is a different distribution ofoverlaps
between B" and those((~)
whose answer iss',
asmay be seen from
figure
2which,
forQ'= 3,
shows theplane containing
the(B"): examples
must fall closer to
B~'
to beassigned
to it. The average of thisquantity
is~(Ql)-Ql jJS' ~p ~(jJ>' ~p jJi'
~@)
(~~)
fl >
j,#p
where
(.)
indicates the pattern average. This can be evaluatedusing
theintegral representation
of the
Heavyside
function togive
~~~'~ ~~
~~~ ~~ ~l
~~ ~~ ~'~ ~~'l~
~~'11~
~) ~~ ~'~ ~
, _,
l19)
where a % B' B~ =
ill
IQ')
Vi'#
s'. The components of((~)
inperpendicular
directions remain the same,however,
in thelarge
N limit. It follows that J B = AuIQ' (since only
oneexample
inQ'
contributes to eachB~'),
whereusing Pythagoras
the modulus isagain given by
7~ =(Ao/Q')~
+«IQ'.
The components of(J~') perpendicular
to the teacher vectors are,as
explained,
not correlated and so areeffectively perpendicular
in ahigh
dimensional space.However, enforcing
the gaugeconstraint,
asexplained
at the end of section2.3,
correlates these other directions andgives
js' ~i'
~
bS',I'
+~(' bS',I')
~
~~~~with
, _, , _,
J~ J~
= B~ B~
=
b,,,j,
+a(I b,,,j,) (21)
by
symmetry. We define R % J" B" for all s'.How are we to work out the
generalisation
error? Thegeometrical
argument in section 2comparing
areas on aplane,
which it waspossible
togeneralise
in [7] to areas on the surface of asphere,
is hard toapply
here since thesphere containing (B")
and(J")
is2(Q'-1)
8~
' ,
' ' ,
' ,
' '
' '
'
~3
Fig.2.
The plane of the three teachers forQ'
= 3. The dashed lines show the barriers between their regions.
174 JOURNAL DE PHYSIQUE I N°2
dimensional,
even afterenforcing ill).
We resort to analgebraic
treatment of the sort used to derivejig), giving
thegeneralisation
errorEg = I
£ lfl
e(B~'
SBb' S) fl
e(J~'
SJC' S)
a' b'#a' C'#a'
~
cc d~ 2(Q'- I)
1- , ~ jn ~
= i
I lm) (u2
v2)
2exP1- ~~,ju~ ~~1
122)where T is a
2(Q'- I)
x2(Q'- I)
matrixgiven by
IQ' I)u
-u-(Q'- I)v
vT =
'~ ~'
(23)
u -u
11
-(Q'~
I)11-U
(Q'- I)U
ii-e-
4IQ'- I)
xIQ'- I) blocks),
where u e I a and u eR(I a).
The results of this scheme areplotted
infigure
3 forQ
"
2, 3,4.
In all casesEg +~
I/@
as a- oc.
Although
theQ'
outputs could berepresented by
combinations ofonly
the nextlargest integer
to log~IQ') binary
outputs, it wouldrequire Q' binary
perceptrons to learn theproblem exactly,
since theinput
space must be dividedby Q' planes, lying along
theplanes
of the multi- class perceptron. It should be noted that the Hebb rule is unable to generate a set ofplanes
to learn theproblem perfectly, however, since,
as may be seen fromfigure I,
it is not clear froma
given example
whichplanes
should bechanged.
A better way to teach the multi-class perceptron would be with the multiclass maximum
stability rule,
which can begenerated by
analgorithm given
in[15],
butanalysis
of the con- sequences is cumbersome forQ'
> 2. We would expectqualitatively
similar results to thoseabove, however,
but with E+~ I
la,
as in [2]. Toverify
this weanalyse high
temperature learn-ing,
asexplained
in section2,
which in learnableproblems
is believed to be ofqualitatively
sbrilar form to the low temperature result [3]. For
simplicity
we will assume(21),
so that there is asingle
order parameter R + B" J"(taken
to be the same for alls'),
the cosine of theangle
between the spacecontaining (J")
and thatcontaining (B").
We will furtherassume that the J"
B~'
are the same for all s'
#
@'. We would not expect our choice for theoverlaps
between the(J"),
which caneasily
be enforcedby adding
a term to the train-ing
energy, to make a diJference to thedynamics,
sinceallowing
differentoverlaps
willsurely
increase Eg, withoutmaking
an extensive contribution to the entropy.Equation (21)
should thus beobeyed naturally.
Inproblems
where theangle
between the teacher vectors is not known inadvance,
the multi-class perceptron should alsonaturally
evolve to have the bestangle
between itsplanes.
The entropyrequired
for the free energy(7)
may be obtainedby considering
theIQ'- I)-dimensional
spaces in which the(J")
and(B") lie;
let us say thesespaces are at an
angle #.
If we choose anyIQ'- I) orthogonal
vectors in theB-space
and foreach
choose, independently,
a vector in the full N-dimensional space at anangle
#, then thesenew vectors
ii
thelarge
Nlimit)
will beeffectively orthogonal
and so the spacespanned by
these new vectors is at an
angle #
to that of the(B"),
and thus could be that of the(J").
We know
(from
[7]) that the entropy associated withchoosing
a new vector at anangle #
to agiven
vector is(to
within aconstant) ) log(I R~),
where R= cos
#.
Selection of the vectorsN°2 LEARNINGMULII-CLASSCLASSIFICAIIONPROBLEMS 175
Eg
LIn«3 Lina
Line I Q-I
o
o 2 ~ 5 6 1 a 9
a
Fig.3. Learning
with the Hebb rule. This shows thegeneralisation
errorusing
Q = 2,3 and 4, aslines 1, 2 and 3 respectively.
within the
B-space (I.e.
the orientation of theJ-space
and theB-space)
does notgenerate
anextensive entropy, so the total entropy for
(7)
is(Q'- 1)/2 log(I R~).
The numerical results
(for Q'= 2,3
and4),
where the line forQ'
= 2 is the same as in [3],show,
asexpected,
that thespherical
multi-class perceptron learnssmoothly
withasymptotic
Eg +~
I16.
The multi-classIsing
percep ton, at least forQ'
= 2, 3, has a first order transition to
perfect generalisation.
The result forQ'
= 3 h similar to the one forQ'
= 2. Thegeneralisation
error E(&) decreases with
increasing
& from2/3 smoothly
andjumps
at & ce 2.24discontinously
to zero. To derive this
high
temperature result we had to determine theIsing
entropy, which isthe
logarithm
of the maximal number of statescompatible
with the constraints(20,21). Using
a
simple counting
argument we find for the entropys(Q'
=3)
= a+ In a+ + a- In a- +b+ Inb+
+ b- In b- + 4d In&,(24)
where a+ +
ii
+R) /8 d, b+
+3(1+ R)/8
and d isgiven by
the solution of d~10(1
R~)&~/64
+3(1 R~)&/64 [3(1 R~) /64]~
=0,
for(I R)/8
> d > 0.It is
possible
to teach the sameproblem
toQ' binary
neurons in the same way. In factenforcing
theirweight
vectors to have the sameoverlaps
with each other as the(J")
makes thelearning equivalent
to thatabove,
but of course at the expense of more neurons and a morecomplicated
outputrepresentation.
176 JOURNAL DE PHYSIQUE I N°2
4. The
proximity problem.
Section 2.I introduced one
variety
of unlearnable rule: theproximity problem
ofclassifying inputs according
to theirHamming
distance from pa= ON prototypes
(q~).
We learn theproblem using noisy examples (~',
such that q~ (~~= m. Reference [20] assumed that the best
learning algorithm
would be to search theJ-space
for theweight
vector with the smallesttraining
error;they
were able to solve the model in the limit of small m, whichimplies
that alarge
number ofexamples
of each prototype must bepresented.
However,
it ispossible
to deduce theoptimal learning algorithm using
the veryoriginal approach
of[18],
which wasdeveloped
for a differentproblem. They
considered ahighly
dilutedneural network
storing
patterns(~
and minimised the output error of the stored patterns after the first time step when theinput
patterns were ensembles of theirnoisy
versions withoverlap
m with the clean patterns. From the work of [24] we know that the output error at the next time step is
given by (nelecting
additiveconstants) H(mA/Wfi),
where A=
@for (, H(z)
ef]~
Dz and Dz +fi exp(-z~/2)dz.
[18] showed that the J which minirnises thisexpression
for msmall,
isgiven by
the Hebb rule(4).
For m close to I theoptimal
J isgiven by
amaximum-stability
rule(MSR).
For intermediate m theoptimal
J has to be foundnumerically using
a Maxwell construction.Reference [20] shows that the
generalisation
error of thebinary
perceptron in the case of theproximity problem
isgiven by H(mA/WW),
where A=
@qoJ
q.Using
theinsight
outlined above we see that
only
forhigh
m is the MSR(used
in[20]) optimal;
the J which minimises thegeneralisation
for small m(the
case [20]mainly considers)
isgiven by
the Hebb rule.This is because the MSR generates narrow,
deep valleys
in the energy surface around eachexample presented,
so that theseexamples
are well stored. The Hebbrule, by
contrast, gen-erates
wider,
butshallower, valleys
so that eachexample
may not beperfectly stored,
but itsinfluence extends over a wider
region.
If the noise ishigh,
soexamples
fall along
way from prototypes, the second rule is to beprefered,
togenerate
avalley
around prototypes, so in thislimit,
the one solvedby
[20], the Hebb rule isoptiinal
forbinary
perceptrons(as
we willverify).
We will assume the same resultapplies
to multi-class perceptrons.We shall
again
choose toanalyse
rule(17) (optionally enforcing (Ii) afterwards,
as in Sect.2),
whichimplies
that each J" is affectedonly by examples
(~~ such that's' =qt. If,
afterlearning,
thealignment
between J" and one suchq~
is called Athen,
from[20],
theoverlap
zbetween J" and a new
example
(~~ has a distribution~~~~~
2x(~- m2)
~~P~l ~~)
~~~~The distribution of
overlaps
between("~
and every other J" is anindependent
Gaussian with zero mean and unitvariance, independent
for different s'. The chance that("~
will beincorrectly classified,
Eg, is the chance that the correct field ishigher
than theQ'-
Iothers,
and so~, ~
Eg - i
/ A
exP
I- Ii "'ill
1£~ YI (26)
It should be noted that this
formula,
for the case ofQ'
= 2
(for
which oneintegral
may beperformed),
differsby
a factor ofvi
from that of [20]. This is because their A is defined asoverlap
with J=
fi(J~ J~),
the normalisedbinary
form. The normalisation factor isfi,
since J~ and J~
are,
by (17),
uncorrelated andthus,
in ahigh
dimensional space,effectively
N°2 LEARNING MULTI-CLASS CLASSIFICATION PROBLEMS 177
perpendicular.
Anexpression
of similar form to(26)
was derived [13] for the very differentproblem
ofstoring many-valued
patterns at 2erc-temperature inHopfield-like
networks.To calculate the distribution of
A,
consider that as far as a prototype(without
loss ofgenerality
prototypeI, q~)
is concerned,
the J" with
qj
=
s',
is the sum of three components:I) The correct prototype
q~; it)
noise from theexamples
of that prototype;iii)
noise fromother prototypes whose correct output is also s'. Each
example
has a component m in the direction of its prototype, so(I)
ispmq~/7",
andWfi
ina different random direction
(effectively perpendicular
for differentexamples
since p «N)
so that term(it), by Pythagoras,
is
p(1- m2)/7"
Themagnitude
of the sum of these terms is(p2m~
+pi
Im~))~/~ /7"
(iii)
is the sum of poIQ'_independent
vectors of thiskind, and,
since po ocN,
we add them in the same way as the noise terms of the Hebb rule in section3,
to obtain7"
=fi@(P~m~
+p(I m~))~/~,
the factorby
which J" must be normalised. ThenA,
theoverlap
between a pattern and J", is the sum of term(I)
and the random variable which is the component of(iii)
in the direction
(,
a Gaussian of width@,
~~~~~ l~~~'~ ~~ ~i~~~~
~~~~which, rescaling
with#
%pm~ /(I m~)
and oo %a(I m~) /m~, gives
Q,(1_~2) ~
~~~~~ ~~~~'~ ~o~~
~~~~
As m
- 0 and with ao
=
O(I)
we obtainPr(A)
= b(A @/(mfi)),
whichgives
Eg = i
/
Dz
HP-~ z
+
«~fl1/>~ (29)
For
Q'
= 2 this reduces to Eg = I H
(I/ ao(I
+I/#)).
Figure
4 shows agraph
from [20] with ao = 1.6. Line I is their zero-temperature result(with overfitting)
and line 2 is their resultallowing
atraining
error. The results for Hebblearning
with a
binary
perceptron are shown as line 3;they
tend to the same minimumgeneralisation
error but the Hebb rule
typically requires
one and a half orders ofmagnitude
fewerexamples
than the
algorithm
of [20] to obtain the samegeneralisation
error.Calling E(m) again
the minimumtraining
error, we see that as in [20], EgElm)
+~
I/#
as#
- oc, butclearly
the coefficient of I/#
is very much lower for Hebblearning.
Thepoints
show the results of asingle
numerical simulation
using
N = 5000 and m = 0.I.Figure 5,
which has a horizontal scale linear inlog(I
+#),
compares several values ofQ' using
the same ao
= 1.6. The
problem
becomes harder for alarger
number of output states becauseinputs
must be classified into a greater number of areas;however,
the number of prototypes in each class is smaller. It is found that asQ'
rises the minimalgeneralisation
error first riseswith it and then, at a critical
Q'(ao),
fallsback, tending
to zero asQ'
- oc.We can compare how the
four-output problem
is learntby
twobinary
neurons,encoding
the four states as thepossible
combinations of the twospins (state
I is(I,1),
2 is(I, -I), etc.).
This
representation
isclearly
notunique
and breaks the symmetry, of the answers. Each of thebinary
perceptrons learns togive
one answer for patterns near two of the prototypes and178 JOURNAL DE PHYSIQUE I N°2
o. 5
E
~ o_
O.
L>ne3
O
lO~~ 10° 10~ 10~
P
Fig. 4.
Learning
the proximity problem for txo" 1.6. Lines 1 and 2, from [20], show respectively,
the effects of learning by minimising the training error and by fixing it at
a finite value. Line 3 is the Hebb result, with experimental points shown for comparison.
io
£g
08
oi
oh
os
° ' Line 5
0 3 Line 3
Line2 LineI
Line'
o i
o
o oi o., o-b o-S i-o i-i i , 6 1-S i-o i i 1.~
lnil+j) Fig-b- Learing
theproximity problem
for ao= 1.6
using Q' equal
to 2(line
I, as infigure 4),
3(line 2),
4(line
3) and 12(line 4).
Line 5 is Hebblearning
of theQ'
= 4 problem