HAL Id: hal-03239021
https://hal.archives-ouvertes.fr/hal-03239021
Submitted on 27 May 2021
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A Cohenence Approach to Learning from Reward.
Application to the Reactive Navigation of a simulated Mobile Robot
Frédéric Davesne, Claude Barret
To cite this version:
Frédéric Davesne, Claude Barret. A Cohenence Approach to Learning from Reward. Application to
the Reactive Navigation of a simulated Mobile Robot. 4th European Workshop on Reinforcement
Learning (EWRL’99), Oct 1999, Lugano, Switzerland. �hal-03239021�
tive Navigation Learning
Frederi Davesne
and Claude Barret
CEMIF - Systemes Complexes, 40, Rue du Pelvoux - 91020 EVRY CEDEX - FRANCE,
davesneemif.univ-evry.fr
CEMIF - Systemes Complexes, 40, Rue du Pelvoux - 91020 EVRY CEDEX - FRANCE,
barretemif.univ-evry.fr
Abstrat. Withinthis paper, anewkindoflearning agents-so alledConstraint based
Memory Units (CbMU) - is desribed. The framework is the inremental building of a
omplexbehaviour,givenasetofbasitasks andasetofpereptiveonstraintsthatmust
befullledtoahievethebehaviour;thedeisionproblemmaybenon-Markovian. Ateah
time,oneofthebasitasksisexeuted,sothattheomplexbehaviourisatemporalsequene
ofelementarytasks.
A CbMUan be modelled as anadaptive swith whih learns to hoose amongits setof
outputhannelstheonetobeativated(givenitspereptivedataandashorttermmemory),
inordertorespetapartiularonstraint. Anoutputhannelmaybelinkedeithertothe
ringofabasitaskortotheativationofanotherCbMU;thisallowsahierarhialdeisional
proess,implyingdierentlevels ofontexts.
The dynamis of the system is learnt by the mean of a pereptive graph and the yles
detetedby the shorttermmemoryof aCbMUare utilisedas sub-goals to buildinternal
ontexts. ThelearningproedureofaCbMUisareinforementlearninginspiredalgorithm
basedonanheuristiwhihdoesnotneedinternalparameters. Itisahievedbyaonsisteny
lawbetweenthebinaryvaluesoftheonnetednodesofthepereptivegraph,inspiredfrom
theAIminimaxalgorithm.
Inthisartile,anexampleofprogrammingwithCbMUsisgiven,usingasimulatedKhepera
robot. Theobjetive is to builda goal-reahing behaviour whih is formulated by ahigh
levelstrategyomposedoflogialrulesusingpereptiveprimitives. FourCbMUsarereated,
eahonedediatedtotheexploitationofpartiularpereptivedata,andvebasitasksare
utilised.
1 Introdution.
1.1 Development ontext.
Withintheframeworkofmobilerobotis,itisoften
diÆulttoestablisharelationshipbetweenthedata
pereived by the robot and the behaviour it must
ahieveaordingto itsinputdata.
Indeed, the pereptive data may be verynoisy or
may not be interpreted easily, so that modelling
the mapping between pereption and ould be a
very diÆult task. Reinforement learning meth-
ods(Watkins,1989)havebeenwidely usedinthat
ausetheydonotneedapriorknowledgeaboutthe
proessmodel. Moreover,theytheoretiallyahieve
inrementallearningandtheyanopewithapos-
sibleinertiaofthesystem. Butndingsuitablein-
ternalparametersforthose algorithms isnotintu-
itiveand maybeadiÆulttask(Bersini andGor-
rini, 1996). Besides, it is not easy to nd a om-
promise between the stability and the robustness
ofthealgorithmanditsinrementalharateristi.
So,thelearningstagemaybefast,buttheamount
of time neededto developa suessful experiment
is often important. Finally, given that the rein-
forementmethods needto suÆientlyexplore the
pereptionspaebefore ndingasuitablesolution,
learningtot aomplexbehaviourinareasonable
lapseoftimeturnstobeimpossiblewithoutnding
out someharateristis of the proess, leadingto
aproblemwithasigniantlydereasedpereption
spae. Asolutionouldbetodividethewholetask
intooordinatedsub-tasks,eahonebeingeasierto
learn than the omplex behaviour. However, the
problemisturnedintoanotherone: hoosingtoex-
eute a preise sub-task is often triky, espeially
ifthehoiedependsonthepereptualdataofthe
agent. Inthat ase,applying asimpleswithing is
not generally suÆient; the agent has to learn to
deide whih sub-task is to be exeutedaording
to its inputdata. Moreover,when afailure in the
learningproessours,onehastoknowiftheause
of the mistake is due to a misleading hoie of a
sub-taskorto aninternal deienyof theeleted
sub-task unit. In the last eventuality, it ould be
neessary to modify this unit to make it avoiding
the same mistake. So, it must have the apaity
tolearnatanytimeitisused: thisisanimportant
fousofinrementallearningmethods.
1.2 Overview.
Theframeworkistheinrementalbuildingofaom-
plexbehaviour,givenaxedsetofbasitasks. We
suppose that thedesired taskanbeseenasaset
of onstraints. For example, the art pole prob-
lem (g. 1) possesses twoonstraints whih must
be veried at eah time: X 2 [X
min
;X
max
℄ and
2[
min
;
max
℄.
So,adeisionalproessmustbelearnt,aordingto
thepereptiveonstraints,inordertofullthemat
eahtime. ACbMUisapartofthedeisionalpro-
ess. Itisanadaptiveswithwhihlearnstohoose
amongitsoutputhannelstheonetobeativated,
given its inputdata. Here,the learningriteria is
the respet ofthe CbMU onstraint. Thus, it dif-
fersfromthetypialreinforementbasedmethods,
whereasit hassome hardlinks with the reinfore-
mentlearningonept: itis atrial/failuremethod
whihdoesnotneed apriorknowledgeof thepro-
essmodel anditis inremental.
1.3 Validation of the omputing
method that uses CbMUs.
Ageneralgoal-seekingproblemwillbeomputed,in
whihtheobstaleavoidaneisperformedbyawall-
following behaviour. To do so, the mobile robot
Khepera (Mondadaet al., 1994) simulatorwritten
Xmin Xmax
Θ Θ
Θ
min max
X
Figure 1: The art pole problem: a typial on-
straintbasedissue.
operatingsystems,will beutilised, whihallowsto
testtherobustnessofthealgorithminaverynoisy
pereptivedata ontext. Theinremental apabil-
ity and the learningrapidityof the algorithm will
beshown.
2 Constraint based Memory
Unit speiations.
2.1 Main ideas
frameworkInthispaper,wesupposethatthetask
whih is to learnanbeahievedwith atemporal
sequeneof aniteset of basitasks(let p be the
numberofthebasitasks). Thus,ateahtime,one
ofthemisexeuted,inordertofullasetofbinary
onstraints. AonstraintKanbewrittenlikethis:
8t;X
min
<X(t)orlikethis: 8t;X(t)<X
max where
X isonesignaloftheontinuousinputspaeofthe
learningagent.
hierarhial deomposition of the task Let
onsider a very simple task (T): \follow a wall",
whihisarriedoutwiththreebasitasks: \gofor-
ward",\moveontheleft"and\moveontheright".
Thetaskanbe dividedinto\followawallon the
left"(T
1
)OR\followthewallontheright"(T
2 ). T
1
an be expressed like this: \do not bump into a
wallontheleft"(T
3
)AND\do notbetoofarfrom
thewallonyourleft"(T
4
). Thesamedeomposition
anbedonefor(T
2
). ThehoiebetweenT
1 andT
2
isontext-dependant;onedeidestoexeuteoneof
thetwosub-tasks depending on twodierent on-
texts: \thereisawallontheleftandthereisnoob-
staleontheright"(C
1
)and\thereisawallonthe
rightandthere isnoobstaleontheleft"(C
2 ); the
hoiebetweenT
3 andT
4
isalsoontext-dependant:
\AmIgoingtobumpintothewall?"(C
3
)and\Am
Igoingto betoofarfrom thewall?"(C
4
). All the
ontextsanbeexpressedwithonstraints.
We notie that the hoie among the three basi
tasks implies a hierarhial deisional proess at
T T1 T2 1
DOT
1
ELSEIFCONTEXT(T)=C
2 DOT
2 ELSE
Thisisnotaproperontextforfollowingawall
(T
1 )[K
T1
℄IFCONTEXT(T
1 )=C
3 DOT
3
ELSEIF
CONTEXT(T
1 )=C
4 DOT
4
ELSEhoosethebasi
task\goforward"
(T
3
)Choosethebasitask\moveontheright"
(T
4
)Choosethebasitask\moveontheleft"
K
T
1
anbeexpressed withtheinputsignalsofthe
system.
(someidentialsub-tasksanbedonefor(T
2 ))
Thisbasiexampleshowsthatwehavebuiltapro-
gram with someaprioriknowledgeupon what we
preisely know about the task (for example, \if I
am farfrom thewallonmy left side,moveonthe
left"). Theboundsoftheontextsandtheswithes
fromoneontexttoanotherhavetobelearnt. This
isdonebytheCbMUs,eahoneopingwithapar-
tiular swith (C
1
$ C
2 ,C
3
$ C
4
). The deom-
positionmayredue theinputspaeortheoutput
spaeforeahlearningswith.
Thus, knowing the hierarhial deomposition of
theonstraintsandthesetofbasitasks,theprob-
lem isto shapethedierentontexts andto learn
howthedynamisbringsthesystemfromaontext
toanothertofull theonstraints.
Context speiationWeassumethataontext
is not redued to an area of the input spae but
also inludes ashort term memory (the task may
be non-Markovian). Thus, a deisionis taken a-
ordingtotheurrentinputsignalandtheontent
oftheshorttermmemory.
Coarse desriptionofa CbMUACbMUhasa
spei onstraintto opewith. Its inputspae is
ontinuousandisdividedintoasetof boxes(letn
bethe dimensionof theinput spae). The CbMU
mayswithfromonesub-tasktoanotheronewhen
itsinputsignalmovesfromoneboxtoanother.
ThebinaryonstraintoftheCbMUisaset ofon-
ditions uponsomeof the omponents ofthe input
spae. Forexample, in theart-poleproblem, two
of the four input omponentspossess aonstraint
(X and).
The CbMU learning proess is based on a oarse
learningofthedynamisofthesystem,bythemean
ofapereptivegraph(g. 2). Eahbox ofthein-
put spae is assoiated to a preise node (round
node). Theationofhoosing asub-taskwhenen-
teringabox(whentheinputsignalmovesfromone
boxtoanother)islinkedtoasquarenode. Thear
fromaroundnodetoasquarenodesymbolisesthe
hoieoftheCbMUwhereasthearfromasquare
to around noderepresentstheresponseofthe dy-
namisofthesystemwhenhavinghosenapreise
reahedwhenevertheonstraintisnotfullled.
Whenenteringabox,asub-taskisseleted. Thede-
isionistakenregardingthebinaryqualityofeah
ationnode.
Consisteny law Eah node of the pereptive
graphpossessesabinaryquality(-or+).Atthebe-
ginningofthelearningproess,thequalityof eah
nodeis+,exeptthequalityoftheendingnode(-).
Forweonsider the learningof the dynamis asa
twoplayersgames(the CbMU andthedynamis),
thequalitiesmaybeturnedto-usingaonsisteny
law between the onneted nodes, derivated from
theAIminimaxalgorithm. Thismayhappenwhen
a new ar is disovered. So, a CbMU may learn
(modies its quality values) only when a new fea-
turein thedynamis indisovered.
Main hypothesis: the yleswithin the per-
eptive graph are of speial interest Remem-
berthatwewanttofullaonstraintateahtime.
Forthepereptivegraphpossessesanite number
ofnodes,someylesmayappear. Ourhypothesis
isthat theylesmaybeusedtobuildtheinternal
ontextsoftheCbMU.
Let'stakethe exampleof thepole-balaningprob-
lemwitha1-dimensionalinputspaegeneratedby
. Theonstraintis 2[ 0:2rad;0:2rad℄ and the
twobasitasksare\pushontheleft"and\pushon
theright".[-0.2,0.2℄isdividedinto10statesS
1 ::S
10
(g. 3). Theproblem islearlynon-Markovianbe-
ausewedonotknowtheangularspeedofthepole;
weannot build asuessful poliy if onlyonea-
tionisassoiatedtoeahstate.
Let onsider a short term memory ontaining the
last 5 states reahed. If a state appears twie in
thismemory,aylehasbeenperformedandanew
ontext is reated (with its own poliy). We an
buildasuessfulpoliywiththefollowingrule:
(At the beginning of the trial, no speial ontext)
IF>0PUSHONTHERIGHTELSEPUSHON
THELEFT
(Aylehasbeenperformed)LetS=[S
min
;S
max
℄
bethelaststateintheshorttermmemory. Thepol-
iyis: IF S
max
>0(IF >0:04PUSH ONTHE
RIGHT ELSE PUSH ON THE LEFT) ELSE (IF
> 0:04 PUSH ON THE RIGHT ELSE PUSH
ONTHE LEFT)
Althoughthisruleisverysimple,itpermitstobal-
anethepolefor100000stepsatleast,evenwitha
15perentnoiseupon.
So,thebasiideaisthatwhenayleisdisovered
in a pereptive graph, a speial node and a new
meta-ationarereated: thespeialnodemeans\I
have just done this yle" and the meta-ation is
1
2
3 4
5 E
a b
a
a a b
b b
b Perceptive graph
a 4(a)
5(a) 2(a) 3(b) 1(b) 2 dimensional input space
boundary of the constraint domain
Figure2: ThepereptivegraphofaCbMU.
the sequeneof state/ation performed in this y-
le. And a ontext is the ombination of the last
yleenounteredandanareaintheinputspae.
Advantages of theproposedmethod
The hierarhial deomposition of the task
through the dierent onstraints permits to
bringsomeaprioriknowledge,reduingthein-
put ortheoutputspae foreah learningpro-
ess. At eah time, the deision involves dif-
ferentlevels of ontextswhih lterthe input
data
TheCbMUsanopewithPartiallyobservable
Markovdeisionproblems(POMDPs): thede-
tetion of yles into the pereptive graph is
used to build new internal ontexts. A y-
le is a kind of sub-goal whih is memorised,
likein theHQ-Learningmethod (Wieringand
Shmidhuber,1997). Butthenumberofpossi-
ble sub-goalsdoesnotneed to bexed at the
beginningofthelearningstage.
ACbMUisabletoadaptitselfwheneveranew
arisreatedinitspereptivegraph,breaking
theonsistenylawuponthequalitiesofsome
nodes.
Therearenointernalparameters.
The learning proess is not CPU onsuming,
beauseitonlyonsistsonaddingnodesorars
and performing min or max operations upon
thequalitiesofthenodes.
Drawbaks of the proposedmethod
Thelearningproessisdesignedforonstraint
basedtasks(nooptimalpoliy)
Thenumberofinputsignalsmustbesmallto
haveareasonablenumberofnodes.
0 0.02 0.06 0.10 0.16 0.20
-0.20 -0.16 -0.10 -0.06 -0.02
S1 S2 S3 S4 S5S6 S7 S8 S9 S10
Figure 3: The pole-balaning problem with a 1-
dimensionalinputspae.
CST ACT
CbMU
O1 I1
I2 I3
O2 V1
V2 FAIL CNX
Figure4: TheexternalstrutureofaCbMU.
2.2 External struture of a CbMU.
ACbMU(g. 4) isablakboxomposed ofthree
kindsofinputs: thepereptivedata,whihisave-
tor (I
1
;:::;I
n ) 2 R
n
, the CST bit, whih is the
binary value of the onstraint at time t, and the
ACT bit, whih is the urrent state of ativation
of the CbMU. An output hannel among the ve-
tor (O
1
;:::;O
p
) 2 f0;1g p
may be red only if the
ACT bit is set to 1 (the CbMU is ativated). At
eah time, one and onlyone hannel may beati-
vated. ItrepresentsthehoieoftheCbMU,given
thepereptivedata(I
1
;:::;I
n
), in order to respet
theonstraintgivenbytheCSTbit.
The hoie leads to a modiation of the CbMU
environment, so that it hanges the values of the
inputdata,leadingtoapossiblehangeoftheCST
bit (g. 5 ). The external available informations
are:
the binaryqualities(V
1
;:::;V
p
)2f0;1g p
asso-
iated to the ring of theoutput hannels. If
V
k
is set to 0, it meansthat theativation of
the hannel k isonsidered to lead(sooner or
later) to a non-respet of the onstraintCST
(see paragraph2.4).
TheFAILbit,whihis setto1ifthelearning
proedure of the CbMU has failed (see para-
graph2.4)
TheCNXbit,whihissetto1iftheonnexion
totheCbMUisallowed(theACTbitmodia-
tionispermittedbytheCbMU).Theallowane
onditionis: FAIL=0andCST=1. IftheCNX
bit equals0,theACTbit isautomatially set
to 0(the CbMUdisonnetsitself).
CbMU feedback signal CST
output channel activated
modification environment
input vector I
Figure 5: Diagram showing the links between a
CbMUanditsenvironment,whiletheCNXbitre-
mainsequalto1.
2.3 Internal struture of a CbMU.
The CbMU is internally omposed of two main
items(g. 6),whihgoalistoprovideateahtime
aqualitytotheringofeahavailableoutputhan-
nel,givenapartiularinputdata:
asetofpereptualareasfZ
1
;:::;Z
p
g,eahone
linked to a partiular output hannel. Eah
Z
k 2R
n
k n
isonnetedtosomeoftheinput
hannels of theCbMU and isdivided apriori
into aset of b
k
boxes Box
j;j2f1;:::;bkg
k
reated
aordingly to the following set of equations:
8
>
>
>
>
>
<
>
>
>
>
>
:
8k2f1;:::;pg S
j2f 1;:::;bkg Box
j
k
=Z
k
8fj;lg2f1;:::;pg 2
;j6=l;
Box j
k T
Box l
k
=;
Box j
k
=fI =(I
1
;:::;I
n
k )/
8l2f1;:::;n
k g;m
j
l I
l
<M j
l g
Thus, eah box Box j
k
is parameterised by
n
k
ouples of values (m j
l
;M j
l
) whih are the
boundaryvaluesforeahpereptiveinputsig-
nalused by thepereptiveareaZ
k
assoiated
withthehannelkoftheCbMU.
aset ofpre-onnetedbits, whoseinitialvalue
is1,dividedinto twoategories:
1. thepereptualstatebits P,eahof them
may be assoiated to a set of p boxes
fBox j
1
1
;:::;Box jp
p g.
2. the hoie bits C, eah of them pre-
onnetedto apereptualstatebit.
The pre-existing ending state E orresponds
to a non-respet of the CST onstraint (CST
turnsto0).
Thewaythepereptualareasaredividedisonsid-
eredtobeanaprioriknowledge: itisnotmodied
duringthelearningstageoftheCbMU.
ShorttermmemoryandyledetetionItre-
allsthelast5ationnodestheCbMUhasreahed.
Ifthelastelementoftheshorttermmemoryisequal
tooneofthefourothers,aylehasjustbeenper-
formed and theCbMU swithesto a newontext.
All the ontexts are assoiated to a preise yle
andpossesstheirownpereptivegraph;thesystem
swithesfromaontextK
i
toaontextK
j
byper-
formingtheyleassoiatedtoK
j
inthepereptive
graphofK
i .
2.4 Learning proedure of a CbMU.
Introdution The proposed learning algorithm
hassome hard links with the reinforement learn-
ing onept: it is a trial/failure method, it does
notneedapriorknowledgeoftheproessmodel,it
opeswiththetemporalreditassignmentproblem
anditisinremental.
However,itisnotbasedonanoptimisationmethod,
butontherespetofbinarypereptiveonstraints.
Moreover, eah CbMU may learn (that is to say
\adaptsitselftoorretadetetedinonsistenybe-
tweentherealfatsandthepreditedones")when-
everitisativated.
Theobjetiveofeahpre-onnetedsetofbitsisto
evaluate the impat of ahoie among the O
k on
theevolutionofthepereptionsignalIreeivedby
theCbMU. The binary valueof a P bit expresses
thequalityof theassoiatedpereptivestate,that
isto saythe apabilityof the CbMU to nda se-
queneofhoiesfromthisstateinordertorespet
theonstraintoftheCbMU.ThebinaryvalueofaC
bitexpressesthequalityofahoie fromapreise
pereptivestate.
Thelearningalgorithmisbasedontwoitems:
theon-linebuildingofonnexionsbetweenthe
setsofpre-onnetedbits(soalledtheperep-
tivegraph),makinganinternal representation
of thedynamisofthesystem.
a onsistenylawbetween two onnetedbits
of the pereptive graph, derived from the AI
minimaxalgorithm(Rih,1983).
Thepereptivegraph. Theobjetiveistoeval-
uate the impat of ahoie among the O
k
on the
evolutionofthepereptionsignalI reeivedbythe
CbMU.Todoso,whiletheCSTbitremainsequal
to1,theCbMUpossessesateahtimeasinglea-
tivepereptualstateP. Whenafailureisdeteted
(CSTturnsto0), theCbMUisin thespeialstate
E.
Thus, the CbMUA
i
hasanitenumberof states,
inludinganending stateE. Thedynamis of the
systemmakesthe agent movefrom one stateP to
anotherstate P',aordingto the hoie C made