A Cohenence Approach to Learning from Reward. Application to the Reactive Navigation of a simulated Mobile Robot

(1)

HAL Id: hal-03239021

https://hal.archives-ouvertes.fr/hal-03239021

Submitted on 27 May 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A Cohenence Approach to Learning from Reward.

Application to the Reactive Navigation of a simulated Mobile Robot

Frédéric Davesne, Claude Barret

To cite this version:

Frédéric Davesne, Claude Barret. A Cohenence Approach to Learning from Reward. Application to

the Reactive Navigation of a simulated Mobile Robot. 4th European Workshop on Reinforcement

Learning (EWRL’99), Oct 1999, Lugano, Switzerland. �hal-03239021�

(2)

tive Navigation Learning

Frederi Davesne

and Claude Barret

CEMIF - Systemes Complexes, 40, Rue du Pelvoux - 91020 EVRY CEDEX - FRANCE,

davesneemif.univ-evry.fr

CEMIF - Systemes Complexes, 40, Rue du Pelvoux - 91020 EVRY CEDEX - FRANCE,

barretemif.univ-evry.fr

Abstrat. Withinthis paper, anewkindoflearning agents-so alledConstraint based

Memory Units (CbMU) - is desribed. The framework is the inremental building of a

omplexbehaviour,givenasetofbasitasks andasetofpereptiveonstraintsthatmust

befullledtoahievethebehaviour;thedeisionproblemmaybenon-Markovian. Ateah

time,oneofthebasitasksisexeuted,sothattheomplexbehaviourisatemporalsequene

ofelementarytasks.

A CbMUan be modelled as anadaptive swith whih learns to hoose amongits setof

outputhannelstheonetobeativated(givenitspereptivedataandashorttermmemory),

inordertorespetapartiularonstraint. Anoutputhannelmaybelinkedeithertothe

ringofabasitaskortotheativationofanotherCbMU;thisallowsahierarhialdeisional

proess,implyingdierentlevels ofontexts.

The dynamis of the system is learnt by the mean of a pereptive graph and the yles

detetedby the shorttermmemoryof aCbMUare utilisedas sub-goals to buildinternal

ontexts. ThelearningproedureofaCbMUisareinforementlearninginspiredalgorithm

basedonanheuristiwhihdoesnotneedinternalparameters. Itisahievedbyaonsisteny

lawbetweenthebinaryvaluesoftheonnetednodesofthepereptivegraph,inspiredfrom

theAIminimaxalgorithm.

Inthisartile,anexampleofprogrammingwithCbMUsisgiven,usingasimulatedKhepera

robot. Theobjetive is to builda goal-reahing behaviour whih is formulated by ahigh

levelstrategyomposedoflogialrulesusingpereptiveprimitives. FourCbMUsarereated,

eahonedediatedtotheexploitationofpartiularpereptivedata,andvebasitasksare

utilised.

1 Introdution.

1.1 Development ontext.

Withintheframeworkofmobilerobotis,itisoften

diÆulttoestablisharelationshipbetweenthedata

pereived by the robot and the behaviour it must

ahieveaordingto itsinputdata.

Indeed, the pereptive data may be verynoisy or

may not be interpreted easily, so that modelling

the mapping between pereption and ould be a

very diÆult task. Reinforement learning meth-

ods(Watkins,1989)havebeenwidely usedinthat

ausetheydonotneedapriorknowledgeaboutthe

proessmodel. Moreover,theytheoretiallyahieve

inrementallearningandtheyanopewithapos-

sibleinertiaofthesystem. Butndingsuitablein-

ternalparametersforthose algorithms isnotintu-

itiveand maybeadiÆulttask(Bersini andGor-

rini, 1996). Besides, it is not easy to nd a om-

promise between the stability and the robustness

ofthealgorithmanditsinrementalharateristi.

So,thelearningstagemaybefast,buttheamount

of time neededto developa suessful experiment

is often important. Finally, given that the rein-

forementmethods needto suÆientlyexplore the

(3)

pereptionspaebefore ndingasuitablesolution,

learningtot aomplexbehaviourinareasonable

lapseoftimeturnstobeimpossiblewithoutnding

out someharateristis of the proess, leadingto

aproblemwithasigniantlydereasedpereption

spae. Asolutionouldbetodividethewholetask

intooordinatedsub-tasks,eahonebeingeasierto

learn than the omplex behaviour. However, the

problemisturnedintoanotherone: hoosingtoex-

eute a preise sub-task is often triky, espeially

ifthehoiedependsonthepereptualdataofthe

agent. Inthat ase,applying asimpleswithing is

not generally suÆient; the agent has to learn to

deide whih sub-task is to be exeutedaording

to its inputdata. Moreover,when afailure in the

learningproessours,onehastoknowiftheause

of the mistake is due to a misleading hoie of a

sub-taskorto aninternal deienyof theeleted

sub-task unit. In the last eventuality, it ould be

neessary to modify this unit to make it avoiding

the same mistake. So, it must have the apaity

tolearnatanytimeitisused: thisisanimportant

fousofinrementallearningmethods.

1.2 Overview.

Theframeworkistheinrementalbuildingofaom-

plexbehaviour,givenaxedsetofbasitasks. We

suppose that thedesired taskanbeseenasaset

of onstraints. For example, the art pole prob-

lem (g. 1) possesses twoonstraints whih must

be veried at eah time: X 2 [X

min

;X

max

℄ and

2[

min

;

max

℄.

So,adeisionalproessmustbelearnt,aordingto

thepereptiveonstraints,inordertofullthemat

eahtime. ACbMUisapartofthedeisionalpro-

ess. Itisanadaptiveswithwhihlearnstohoose

amongitsoutputhannelstheonetobeativated,

given its inputdata. Here,the learningriteria is

the respet ofthe CbMU onstraint. Thus, it dif-

fersfromthetypialreinforementbasedmethods,

whereasit hassome hardlinks with the reinfore-

mentlearningonept: itis atrial/failuremethod

whihdoesnotneed apriorknowledgeof thepro-

essmodel anditis inremental.

1.3 Validation of the omputing

method that uses CbMUs.

Ageneralgoal-seekingproblemwillbeomputed,in

whihtheobstaleavoidaneisperformedbyawall-

following behaviour. To do so, the mobile robot

Khepera (Mondadaet al., 1994) simulatorwritten

Xmin Xmax

Θ Θ

Θ

min max

X

Figure 1: The art pole problem: a typial on-

straintbasedissue.

operatingsystems,will beutilised, whihallowsto

testtherobustnessofthealgorithminaverynoisy

pereptivedata ontext. Theinremental apabil-

ity and the learningrapidityof the algorithm will

beshown.

2 Constraint based Memory

Unit speiations.

2.1 Main ideas

frameworkInthispaper,wesupposethatthetask

whih is to learnanbeahievedwith atemporal

sequeneof aniteset of basitasks(let p be the

numberofthebasitasks). Thus,ateahtime,one

ofthemisexeuted,inordertofullasetofbinary

onstraints. AonstraintKanbewrittenlikethis:

8t;X

min

<X(t)orlikethis: 8t;X(t)<X

max where

X isonesignaloftheontinuousinputspaeofthe

learningagent.

hierarhial deomposition of the task Let

onsider a very simple task (T): \follow a wall",

whihisarriedoutwiththreebasitasks: \gofor-

ward",\moveontheleft"and\moveontheright".

Thetaskanbe dividedinto\followawallon the

left"(T

1

)OR\followthewallontheright"(T

2 ). T

1

an be expressed like this: \do not bump into a

wallontheleft"(T

3

)AND\do notbetoofarfrom

thewallonyourleft"(T

4

). Thesamedeomposition

anbedonefor(T

2

). ThehoiebetweenT

1 andT

2

isontext-dependant;onedeidestoexeuteoneof

thetwosub-tasks depending on twodierent on-

texts: \thereisawallontheleftandthereisnoob-

staleontheright"(C

1

)and\thereisawallonthe

rightandthere isnoobstaleontheleft"(C

2 ); the

hoiebetweenT

3 andT

4

isalsoontext-dependant:

\AmIgoingtobumpintothewall?"(C

3

)and\Am

Igoingto betoofarfrom thewall?"(C

4

). All the

ontextsanbeexpressedwithonstraints.

We notie that the hoie among the three basi

tasks implies a hierarhial deisional proess at

(4)

T T1 T2 1

DOT

1

ELSEIFCONTEXT(T)=C

2 DOT

2 ELSE

Thisisnotaproperontextforfollowingawall

(T

1 )[K

T1

℄IFCONTEXT(T

1 )=C

3 DOT

3

ELSEIF

CONTEXT(T

1 )=C

4 DOT

4

ELSEhoosethebasi

task\goforward"

(T

3

)Choosethebasitask\moveontheright"

(T

4

)Choosethebasitask\moveontheleft"

K

T

1

anbeexpressed withtheinputsignalsofthe

system.

(someidentialsub-tasksanbedonefor(T

2 ))

Thisbasiexampleshowsthatwehavebuiltapro-

gram with someaprioriknowledgeupon what we

preisely know about the task (for example, \if I

am farfrom thewallonmy left side,moveonthe

left"). Theboundsoftheontextsandtheswithes

fromoneontexttoanotherhavetobelearnt. This

isdonebytheCbMUs,eahoneopingwithapar-

tiular swith (C

1

$ C

2 ,C

3

$ C

4

). The deom-

positionmayredue theinputspaeortheoutput

spaeforeahlearningswith.

Thus, knowing the hierarhial deomposition of

theonstraintsandthesetofbasitasks,theprob-

lem isto shapethedierentontexts andto learn

howthedynamisbringsthesystemfromaontext

toanothertofull theonstraints.

Context speiationWeassumethataontext

is not redued to an area of the input spae but

also inludes ashort term memory (the task may

be non-Markovian). Thus, a deisionis taken a-

ordingtotheurrentinputsignalandtheontent

oftheshorttermmemory.

Coarse desriptionofa CbMUACbMUhasa

spei onstraintto opewith. Its inputspae is

ontinuousandisdividedintoasetof boxes(letn

bethe dimensionof theinput spae). The CbMU

mayswithfromonesub-tasktoanotheronewhen

itsinputsignalmovesfromoneboxtoanother.

ThebinaryonstraintoftheCbMUisaset ofon-

ditions uponsomeof the omponents ofthe input

spae. Forexample, in theart-poleproblem, two

of the four input omponentspossess aonstraint

(X and).

The CbMU learning proess is based on a oarse

learningofthedynamisofthesystem,bythemean

ofapereptivegraph(g. 2). Eahbox ofthein-

put spae is assoiated to a preise node (round

node). Theationofhoosing asub-taskwhenen-

teringabox(whentheinputsignalmovesfromone

boxtoanother)islinkedtoasquarenode. Thear

fromaroundnodetoasquarenodesymbolisesthe

hoieoftheCbMUwhereasthearfromasquare

to around noderepresentstheresponseofthe dy-

namisofthesystemwhenhavinghosenapreise

reahedwhenevertheonstraintisnotfullled.

Whenenteringabox,asub-taskisseleted. Thede-

isionistakenregardingthebinaryqualityofeah

ationnode.

Consisteny law Eah node of the pereptive

graphpossessesabinaryquality(-or+).Atthebe-

ginningofthelearningproess,thequalityof eah

nodeis+,exeptthequalityoftheendingnode(-).

Forweonsider the learningof the dynamis asa

twoplayersgames(the CbMU andthedynamis),

thequalitiesmaybeturnedto-usingaonsisteny

law between the onneted nodes, derivated from

theAIminimaxalgorithm. Thismayhappenwhen

a new ar is disovered. So, a CbMU may learn

(modies its quality values) only when a new fea-

turein thedynamis indisovered.

Main hypothesis: the yleswithin the per-

eptive graph are of speial interest Remem-

berthatwewanttofullaonstraintateahtime.

Forthepereptivegraphpossessesanite number

ofnodes,someylesmayappear. Ourhypothesis

isthat theylesmaybeusedtobuildtheinternal

ontextsoftheCbMU.

Let'stakethe exampleof thepole-balaningprob-

lemwitha1-dimensionalinputspaegeneratedby

. Theonstraintis 2[ 0:2rad;0:2rad℄ and the

twobasitasksare\pushontheleft"and\pushon

theright".[-0.2,0.2℄isdividedinto10statesS

1 ::S

10

(g. 3). Theproblem islearlynon-Markovianbe-

ausewedonotknowtheangularspeedofthepole;

weannot build asuessful poliy if onlyonea-

tionisassoiatedtoeahstate.

Let onsider a short term memory ontaining the

last 5 states reahed. If a state appears twie in

thismemory,aylehasbeenperformedandanew

ontext is reated (with its own poliy). We an

buildasuessfulpoliywiththefollowingrule:

(At the beginning of the trial, no speial ontext)

IF>0PUSHONTHERIGHTELSEPUSHON

THELEFT

(Aylehasbeenperformed)LetS=[S

min

;S

max

℄

bethelaststateintheshorttermmemory. Thepol-

iyis: IF S

max

>0(IF >0:04PUSH ONTHE

RIGHT ELSE PUSH ON THE LEFT) ELSE (IF

> 0:04 PUSH ON THE RIGHT ELSE PUSH

ONTHE LEFT)

Althoughthisruleisverysimple,itpermitstobal-

anethepolefor100000stepsatleast,evenwitha

15perentnoiseupon.

So,thebasiideaisthatwhenayleisdisovered

in a pereptive graph, a speial node and a new

meta-ationarereated: thespeialnodemeans\I

have just done this yle" and the meta-ation is

(5)

1

2

3 4

5 E

a b

a

a a b

b b

b Perceptive graph

a 4(a)

5(a) 2(a) 3(b) 1(b) 2 dimensional input space

boundary of the constraint domain

Figure2: ThepereptivegraphofaCbMU.

the sequeneof state/ation performed in this y-

le. And a ontext is the ombination of the last

yleenounteredandanareaintheinputspae.

Advantages of theproposedmethod

The hierarhial deomposition of the task

through the dierent onstraints permits to

bringsomeaprioriknowledge,reduingthein-

put ortheoutputspae foreah learningpro-

ess. At eah time, the deision involves dif-

ferentlevels of ontextswhih lterthe input

data

TheCbMUsanopewithPartiallyobservable

Markovdeisionproblems(POMDPs): thede-

tetion of yles into the pereptive graph is

used to build new internal ontexts. A y-

le is a kind of sub-goal whih is memorised,

likein theHQ-Learningmethod (Wieringand

Shmidhuber,1997). Butthenumberofpossi-

ble sub-goalsdoesnotneed to bexed at the

beginningofthelearningstage.

ACbMUisabletoadaptitselfwheneveranew

arisreatedinitspereptivegraph,breaking

theonsistenylawuponthequalitiesofsome

nodes.

Therearenointernalparameters.

The learning proess is not CPU onsuming,

beauseitonlyonsistsonaddingnodesorars

and performing min or max operations upon

thequalitiesofthenodes.

Drawbaks of the proposedmethod

Thelearningproessisdesignedforonstraint

basedtasks(nooptimalpoliy)

Thenumberofinputsignalsmustbesmallto

haveareasonablenumberofnodes.

0 0.02 0.06 0.10 0.16 0.20

-0.20 -0.16 -0.10 -0.06 -0.02

S1 S2 S3 S4 S5S6 S7 S8 S9 S10

Figure 3: The pole-balaning problem with a 1-

dimensionalinputspae.

CST ACT

CbMU

O1 I1

I2 I3

O2 V1

V2 FAIL CNX

Figure4: TheexternalstrutureofaCbMU.

2.2 External struture of a CbMU.

ACbMU(g. 4) isablakboxomposed ofthree

kindsofinputs: thepereptivedata,whihisave-

tor (I

1

;:::;I

n ) 2 R

n

, the CST bit, whih is the

binary value of the onstraint at time t, and the

ACT bit, whih is the urrent state of ativation

of the CbMU. An output hannel among the ve-

tor (O

1

;:::;O

p

) 2 f0;1g p

may be red only if the

ACT bit is set to 1 (the CbMU is ativated). At

eah time, one and onlyone hannel may beati-

vated. ItrepresentsthehoieoftheCbMU,given

thepereptivedata(I

1

;:::;I

n

), in order to respet

theonstraintgivenbytheCSTbit.

The hoie leads to a modiation of the CbMU

environment, so that it hanges the values of the

inputdata,leadingtoapossiblehangeoftheCST

bit (g. 5 ). The external available informations

are:

the binaryqualities(V

1

;:::;V

p

)2f0;1g p

asso-

iated to the ring of theoutput hannels. If

V

k

is set to 0, it meansthat theativation of

the hannel k isonsidered to lead(sooner or

later) to a non-respet of the onstraintCST

(see paragraph2.4).

TheFAILbit,whihis setto1ifthelearning

proedure of the CbMU has failed (see para-

graph2.4)

TheCNXbit,whihissetto1iftheonnexion

totheCbMUisallowed(theACTbitmodia-

tionispermittedbytheCbMU).Theallowane

onditionis: FAIL=0andCST=1. IftheCNX

bit equals0,theACTbit isautomatially set

to 0(the CbMUdisonnetsitself).

(6)

CbMU feedback signal CST

output channel activated

modification environment

input vector I

Figure 5: Diagram showing the links between a

CbMUanditsenvironment,whiletheCNXbitre-

mainsequalto1.

2.3 Internal struture of a CbMU.

The CbMU is internally omposed of two main

items(g. 6),whihgoalistoprovideateahtime

aqualitytotheringofeahavailableoutputhan-

nel,givenapartiularinputdata:

asetofpereptualareasfZ

1

;:::;Z

p

g,eahone

linked to a partiular output hannel. Eah

Z

k 2R

n

k n

isonnetedtosomeoftheinput

hannels of theCbMU and isdivided apriori

into aset of b

k

boxes Box

j;j2f1;:::;bkg

k

reated

aordingly to the following set of equations:

8

>

<

>

:

8k2f1;:::;pg S

j2f 1;:::;bkg Box

j

k

=Z

k

8fj;lg2f1;:::;pg 2

;j6=l;

Box j

k T

Box l

k

=;

Box j

k

=fI =(I

1

;:::;I

n

k )/

8l2f1;:::;n

k g;m

j

l I

l

<M j

l g

Thus, eah box Box j

k

is parameterised by

n

k

ouples of values (m j

l

;M j

l

) whih are the

boundaryvaluesforeahpereptiveinputsig-

nalused by thepereptiveareaZ

k

assoiated

withthehannelkoftheCbMU.

aset ofpre-onnetedbits, whoseinitialvalue

is1,dividedinto twoategories:

1. thepereptualstatebits P,eahof them

may be assoiated to a set of p boxes

fBox j

1

;:::;Box jp

p g.

2. the hoie bits C, eah of them pre-

onnetedto apereptualstatebit.

The pre-existing ending state E orresponds

to a non-respet of the CST onstraint (CST

turnsto0).

Thewaythepereptualareasaredividedisonsid-

eredtobeanaprioriknowledge: itisnotmodied

duringthelearningstageoftheCbMU.

ShorttermmemoryandyledetetionItre-

allsthelast5ationnodestheCbMUhasreahed.

Ifthelastelementoftheshorttermmemoryisequal

tooneofthefourothers,aylehasjustbeenper-

formed and theCbMU swithesto a newontext.

All the ontexts are assoiated to a preise yle

andpossesstheirownpereptivegraph;thesystem

swithesfromaontextK

i

toaontextK

j

byper-

formingtheyleassoiatedtoK

j

inthepereptive

graphofK

i .

2.4 Learning proedure of a CbMU.

Introdution The proposed learning algorithm

hassome hard links with the reinforement learn-

ing onept: it is a trial/failure method, it does

notneedapriorknowledgeoftheproessmodel,it

opeswiththetemporalreditassignmentproblem

anditisinremental.

However,itisnotbasedonanoptimisationmethod,

butontherespetofbinarypereptiveonstraints.

Moreover, eah CbMU may learn (that is to say

\adaptsitselftoorretadetetedinonsistenybe-

tweentherealfatsandthepreditedones")when-

everitisativated.

Theobjetiveofeahpre-onnetedsetofbitsisto

evaluate the impat of ahoie among the O

k on

theevolutionofthepereptionsignalIreeivedby

theCbMU. The binary valueof a P bit expresses

thequalityof theassoiatedpereptivestate,that

isto saythe apabilityof the CbMU to nda se-

queneofhoiesfromthisstateinordertorespet

theonstraintoftheCbMU.ThebinaryvalueofaC

bitexpressesthequalityofahoie fromapreise

pereptivestate.

Thelearningalgorithmisbasedontwoitems:

theon-linebuildingofonnexionsbetweenthe

setsofpre-onnetedbits(soalledtheperep-

tivegraph),makinganinternal representation

of thedynamisofthesystem.

a onsistenylawbetween two onnetedbits

of the pereptive graph, derived from the AI

minimaxalgorithm(Rih,1983).

Thepereptivegraph. Theobjetiveistoeval-

uate the impat of ahoie among the O

k

on the

evolutionofthepereptionsignalI reeivedbythe

CbMU.Todoso,whiletheCSTbitremainsequal

to1,theCbMUpossessesateahtimeasinglea-

tivepereptualstateP. Whenafailureisdeteted

(CSTturnsto0), theCbMUisin thespeialstate

E.

Thus, the CbMUA

i

hasanitenumberof states,

inludinganending stateE. Thedynamis of the

systemmakesthe agent movefrom one stateP to

anotherstate P',aordingto the hoie C made