• Aucun résultat trouvé

Sensitivity Analysis in Particle Filters. Application to Policy Optimization in POMDPs

N/A
N/A
Protected

Academic year: 2021

Partager "Sensitivity Analysis in Particle Filters. Application to Policy Optimization in POMDPs"

Copied!
19
0
0

Texte intégral

(1)

HAL Id: inria-00336203

https://hal.inria.fr/inria-00336203

Submitted on 3 Nov 2008

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Sensitivity Analysis in Particle Filters. Application to Policy Optimization in POMDPs

Pierre Arnaud Coquelin, Romain Deguest, Rémi Munos

To cite this version:

Pierre Arnaud Coquelin, Romain Deguest, Rémi Munos. Sensitivity Analysis in Particle Filters.

Application to Policy Optimization in POMDPs. [Research Report] RR-6710, INRIA. 2008. �inria- 00336203�

(2)

a p p o r t

d e r e c h e r c h e

ISSN0249-6399ISRNINRIA/RR--6710--FR+ENG

Thème COG

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

Sensitivity Analysis in Particle Filters.

Application to Policy Optimization in POMDPs

Pierre-Arnaud Coquelin , Romain Deguest, Rémi Munos

N° 6710

November 2008

(3)
(4)

Centre de recherche INRIA Futurs Parc Orsay Université, ZAC des Vignes, 4, rue Jacques Monod, 91893 Orsay Cedex (France)

Téléphone : +33 1 72 92 59 00

Appliation to Poliy Optimization in POMDPs

Pierre-Arnaud Coquelin

, RomainDeguest

, Rémi Munos

ThèmeCOGSystèmesognitifs

Équipes-ProjetsSEQUEL

Rapportdereherhe 6710November200815pages

Résumé:NousonsidéronsunProessusdeDéisionMarkovienPartiellement

Observable(POMDP) ave espaes d'état, d'observation et d'ation ontinus.

Lesdéisionssontprisesàpartird'unepolitiquequiutiliseunltreàpartiules

permettantdeontruireunefontionderoyanesurl'étatourantsahantles

observationspassées.Nousonsidéronsunalgorithmedetypegradientpourop-

timiser les paramètres de la politique. Pour ela nous suivonsune analyse de

sensibilitédelamesuredeperformaneparrapportauxparamètresdelapoli-

tique,seonentrantsurlesméthodesdetypeDiérenesFinies.Nousmontrons

quel'approhenaivesoured'uneexplosion delavariane,àausede lanon-

diérentiabilité de l'étapede ré-éhantillonnage.Nous proposons une variante

qui résoudeproblème,etétablissons laonsistenedel'estimateur résultant.

Mots-lés : Poessusdedéisionmarkovienpartiellementobservable,analyse

desensibilité,ltragepartiulaire,optimisationparamétrique

CMAP,EolePolytehnique,oquelinmapx.polyteh niq ue.f r

CMAP,EolePolytehniqueandColombiauniversity,deguestmapx.polytehni que. fr

INRIALille-NordEurope,SequeLprojet,remi.munosinria.fr

(5)

Appliation to Poliy Optimization in POMDPs

Abstrat: OursettingisaPartiallyObservableMarkovDeisionProesswith

ontinuousstate,observation andationspaes.DeisionsarebasedonaPar-

tileFilter forestimating thebeliefstategivenpastobservations.Weonsider

apoliygradientapproahforparameterizedpoliyoptimization.Forthatpur-

pose,weinvestigatesensitivityanalysisoftheperformanemeasurewithrespet

totheparametersof thepoliy, fousingonFiniteDierene(FD)tehniques.

WeshowthatthenaiveFDissubjettovarianeexplosionbeauseofthenon-

smoothnessof theresamplingproedure.WeproposeamoresophistiatedFD

methodwhihoveromesthisproblemandestablishitsonsisteny.

Key-words: PartiallyObservableMarkovDeisionProblems,sensitivityanal-

ysis,partileltering,parametrioptimization

(6)

1 Introdution

We onsider a Partially Observable Markov Deision Problem (POMDP)

(see e.g. (Lovejoy, 1991; Kaelbling et al., 1998)) dened by a state proess

(Xt)t≥1X,anobservationproess(Yt)t≥1Y,adeision(oration)proess (At)t≥1A whihdependsonapoliy (mappingfromallpossibleobservation histories to ations),and a rewardfuntion r : X R. Our goalis to nd a

poliyπthatmaximizesaperformanemeasureJ(π),funtionoffuturerewards,

forexampleinanite horizonsetting:

J(π)def= EXn

t=1

r(Xt)

. (1)

Other performane measures (suh as in innite horizon with disounted

rewards)ouldbehandledaswell.Inthispaper,weonsidertheaseofonti-

nuous state,observation, and ationspaes.

ThestateproessisaMarkovdeisionproesstakingitsvaluesina(mea-

surable)statespaeX,withinitialprobabilitymeasureµ∈ M(X)(i.e.X1µ),

andwhihanbesimulatedusingatransitionfuntionF andindependentran- domnumbers,i.e.forallt1,

Xt+1=F(Xt, At, Ut), withUt i.i.d.

ν, (2)

where F : X×A×U X and (U, σ(U), ν) is a probability spae. In many pratialsituations U = [0,1]p andUtis ap-uple ofpseudo randomnumbers.

Forsimpliity, weadopt thenotations F(x0, a0, u) def= Fµ(u), where Fµ is the

rsttransitionfuntion (i.e.X1=Fµ(U0)withU0ν).

The observation proess (Yt)t≥1 lies in a (measurable) spae Y and is

linked with the state proess by the onditional probability measure P(Yt dyt|Xt =xt) = g(xt, yt)dyt, whereg :X×Y [0,1]is themarginaldensity

funtion of Yt given Xt. We assume that observations are onditionally inde- pendentgiventhestateproess.Here also,weassumethatweansimulatean

observation using a transition funtion G and independent random numbers, i.e.∀t1,Yt=G(Xt, Vt), whereVti.i.d.

ν (forthesakeofsimpliityweonsi-

derthesameprobabilityspae(U, σ(U), ν)).Now,theationproess(At)t≥1

depends onapoliy πwhih assignsto eah possibleobservation historyY1:t

(whereweadopt theusualnotation1 :t to denotetheolletionofintegerss

suhthat1st),anationAtA.

Inthispaperwewillonsiderpoliiesthatdependonthebeliefstate(also

alled ltering distribution) onditionally to past observations. The belief

state,writtenbt,belongstoM(X)(thespaeofallprobabilitymeasuresonX)

andisdenedbybt(dxt, Y1:t)def= P(Xtdxt|Y1:t),andwillbewrittenbt(dxt)or

evenbtforsimpliitywhenthereisnoriskofonfusion.BeauseoftheMarkov

propertyofthestatedynamis,thebeliefstatebt(·, Y1:t)isthemostinformative representationabouttheurrentstateXtgiventhehistoryofpastobservations

Y1:t.Itrepresentssuientstatistisfordesigninganoptimalpoliyinthelass

ofobservations-basedpoliies.

ThetemporalandausaldependeniesofthedynamisofageneriPOMDP

usingbelief-basedpoliiesissummarizedinFigure1(left):attimet,thestate Xtisunknown,onlyYtisobserved,whihenables(atleastintheory)toupdate

RR 6710

(7)

btbasedonthepreviousbeliefbt−1.Thepoliyπtakesasinputthebeliefstate bt and returns an ation At (the poliy may be deterministi or stohasti).

However,sinethebeliefstateisaninnitedimensionalobjet,andthusannot

be represented in a omputer, we rst simplify the lass of poliies that we

onsiderhere to be dened overanite dimensional spae ofbelief-features

f :M(X)RKwhihrepresentsrelevantstatistisofthelteringdistribution.

We write bt(fk) for the value of the k-th feature (among K) (where we use

theusual notation b(f)def= R

Xf(x)b(dx) for any funtion f dened on X and

measureb∈ M(X)),anddenotebt(f)thevetor(ofsizeK)withomponents bt(fk). Examplesof features are: f(x) =x(mean value),f(x) =xx(forthe

ovarianematrix).Other moreomplexfeatures (e.g.entropymeasure)ould

be used aswell. Suh a poliy π : RK A selets an ation At = π(bt(f)),

whihin turn,yieldsanewstateXt+1.

Exept for simpleases, suh asin nite-state nite-observation proesses

(whereaViterbi algorithm ouldbeapplied (Rabiner,1989)),and theaseof

lineardynamisandGaussiannoise(whereaKalmanlterouldbeused),there

is no losed-formrepresentation of the belief state. Thus bt must be approxi-

matedinourgeneralsetting.Apopularmethodforapproximatingtheltering

distributionisknownasPartile Filters(PF)(alsoalledInterating Par-

tileSystemsorSequentialMonte-Carlo).Suhpartile-basedapproahes

havebeenusedinmanyappliations(seee.g.(Douetet al.,2001)and(DelMo-

ral,2004)foraFeynman-Kaframework)forexampleforparameterestimation

inHiddenMarkovModelsandontrol(Andrieuet al.,2004)andmobilerobot

loalization(Foxet al.,2001).AnPFapproximatesthebeliefstatebt∈ M(X)

by a set of partiles (x1:Nt ) (points of X), whih are updated sequentially at eahnewobservationbyatransition-seletionproedure.Inpartiular,thebe-

lief feature bt(f) is approximated by

1 N

PN

i=1f(xit), and the poliy is thus a

funtion that takes asinput the ativation of thefeature f at theposition of

thepartiles:At=π(N1 PN

i=1f(xit)).Forsuhmethods,thegeneralshemefor

POMDPsusingPartileFilter-basedpoliiesisdesribedinFigure 1(right).

Inthis paper,weonsider alass ofpoliies πθ parameterizedbya(multi- dimensional)parameterθ andwesearhforthevalueofθ thatmaximizesthe

resultingriterionJθ),nowwritten J(θ)forsimpliity.Wefousonapoliy

gradient approah : the POMDP is replaed by an optimization problem on

the spae of poliy parameters, and a (stohasti) gradient asent on J(θ) is

onsidered.Forthatpurpose(andthisistheobjetofthiswork)weinvestigate

theestimationof∇J(θ)(wherethegradientreferstothederivativew.r.t.θ),

withanemphasisonFinite-Dierenetehniques.Therearemanyworksabout

suh poliy gradient approah in the eld of ReinforementLearning, see e.g.

(Baxter&Bartlett,1999),butthepoliiesonsideredaregenerallynotbasedon

theresultofanPF.Here,weexpliitlyonsideralassofpoliiesthatarebased

onabeliefstateonstrutedbyaPF.Ourmotivationsforinvestigatingthisase

are based on two fats : (1) the belief state represents suient statistisfor

optimality,asmentionedabove.(2)PFsareaverypopularandeienttoolfor

onstrutingthebeliefstateinontinuousdomains.

AfterreallingthegeneralapproahforevaluatingtheperformaneofaPF-

basedpoliy (Setion 2),wedesribe(inSetion 3.1)anaiveFinite-Dierene

(FD)approah(denedbyastepsize h)forestimating∇J(θ). Wedisussthe

biasandvarianetradeoandexplainthe problemofvarianeexplosionwhen

INRIA

(8)

hissmall.Thisproblemisaonsequeneofthedisontinuityoftheresampling

operation w.r.t. theparameterθ. Ourontribution is detailed in Setion 3.2 : WeproposeamodiedFDestimatefor∇J(θ)whih(alongtherandomsample

path)hasbiasO(h2)andvarianeO(1/N),thusoveromesthedrawbakofthe

previousnaivemethod.AnalgorithmisdesribedandillustratedinSetion4on

asimpleproblem where theoptimal poliy exhibits atradeo betweengreedy

rewardoptimizationandloalization.

Xt Yt

At

πθ πθ

πθ

Belief Reward

Observation

Belief features State

Policy Action A

X Y

X Y

A t−1

t−1

t−1

t−1

t+1

t+1

t+1

t+1

b bt b

t−1 t b (f )t+1

rt−1 rt rt+1

b (f) b (f )

Xt Yt

At

πθ πθ πθ

Reward

Particles

Features Policy Action State

Observation

A X Y

X Y

A t−1

t−1

t−1

t+1

t+1

t+1

rt−1 rt rt+1

t−1

1:N 1:N

t t+1

1:N

1:N t−1

1:N

t t+1

1:N

x x

f( )x f( )x f( )x x

Fig.1 Left gure : Causal and temporal dependeniesin a POMDP. Right

gure:PF-basedshemeforPOMDPswherethebelieffeaturebt(f)isapproxi-

matedby

1 N

PN

i=1f(xit).

2 Partile Filters (PF)

WerstdesribeageneriPFforestimating thebelief statebasedonpast

observations.InSubsetion 2.1wedetailhowto ontrol areal-worldPOMDP

and in Subsetion 2.2 how to estimate the performane of a given poliy in

simulation. In both ases,weassume that the models of thedynamis (state,

observation) are known. The basi PF, alled Bootstrap Filter, see (Douet

et al., 2001)fordetails, approximates thebelief statebn by anempirialdis-

tributionbNn def= PN

i=1winδxin (whereδdenotesaDiradistribution)madeofN

partilesx1:Nn . Itonsists initeratingthetwofollowingsteps:at timet, given

observationyt,

Transition step : (also alled importane sampling or mutation)

asuessorpartiles populationex1:Nt isgenerated aordingto thestate

dynamisfromthepreviouspopulationx1:Nt−1.The(importanesampling)

weightswt1:N def= Pg(Nxe1:Nt ,yt)

j=1g(exjt,yt) areevaluated,

Seletionstep:Resample(withreplaement)N partilesx1:Nt fromthe

set xe1:Nt aording to the weights wt1:N. We write x1:Nt def= xekt1:Nt where kt1:N aretheseletionindies.

Resampling is used to avoid the problem of degeneray of the algorithm,

i.e.that mostof theweightsdereasestozero.Itonsistsin seletingnewpar-

tile positionssuh asto preserveaonsistenyproperty(i.e.

PN

i=1wtiφ(exit) = E[N1 PN

i=1φ(xit)]). The simplest version introdued in (Gordon et al., 1993)

hoosestheseletionindiesk1:Nt byanindependentsamplingfromtheset1 :N

aordingtoamultinomialdistribution withparametersw1:Nt ,i.e.P(kit=j) =

RR 6710

(9)

wtj, for all 1 i N. The ideais to repliate the partiles in proportion to

theirweights.Manyvariantshavebeenproposedintheliterature,amongwhih

thestratiedresamplingmethod(Kitagawa,1996)whihisoptimalintermsof

variane,see e.g.(Cappéet al.,2005).

ConvergeneissuesofbNn(f)tobn(f)(e.g.LawofLargeNumbersorCentral

LimitTheorems)aredisussedin(DelMoral,2004)or(Dou&Moulines,2008).

Forourpurposewenotethatunder weakonditionsonthefeature f,wehave

theonsistenyproperty:bN(f)b(f),almostsurely.

2.1 Control of a real system by an PF-based poliy

WedesribeinAlgorithm1howonemayuseanPF-basedpoliyπθ forthe

ontrolofareal-worldsystem.NotethatfromourdenitionofFµ,thepartiles

areinitializedwith:ex1:N1 iidµ.

Algorithm1Controlofareal-worldPOMDP

fort= 1to ndo

Observe :yt,

Partile transitionstep :

Setxe1:Nt =F(x1:Nt−1, at−1, u1:Nt−1)withu1:Nt−1iidν. Setw1:Nt = Pg(Nex1:Nt ,yt) j=1g(exjt,yt),

Partile resamplingstep :

Setx1:Nt =xekt1:Nt where kt1:N are given bythe seletionstep aordingto

theweightswt1:N.

Selet ation: at=πθ(N1 PN

i=1f(xit)),

endfor

2.2 Estimation of J(θ) in simulation

Now, forthepurpose ofpoliy optimization, oneshould beapable ofeva-

luatingtheperformaneofapoliyinsimulation.J(θ),dened by(1), maybe

estimated in simulation provided that the dynamis of thestate and observa-

tionareknown.Makingexpliitthedependenyw.r.t.therandomsamplepath,

writtenω(whihaountsforthestateandobservationstohastidynamisand therandomnumbersusedinthePF-basedpoliy), wewriteJ(θ) =Eω[Jω(θ)],

where Jω(θ)def= Pn

t=1r(Xt,ω(θ)), makingthedependeny ofthe statew.r.t.ω

andθexpliit.

Algorithm 2 desribes how to evaluate an PF-based poliy in simulation.

The funtion returns an estimate, written JωN(θ), of Jω(θ). Using previously

mentioned asymptoti onvergene results for PF, one has limN→∞JωN(θ) = Jω(θ), almost surely (a.s.). In order to approximate J(θ), one would perform

severalallstothealgorithm,reeivingJωNm(θ)(for1mM),andalulate

theirempirialmean

1 M

PM

m=1JωNm(θ),whihtendsto J(θ)a.s.,whenM, N

.

INRIA

Références

Documents relatifs

We focus on a policy gradient approach: the POMDP is replaced by an optimization problem on the space of policy parameters, and a (stochastic) gradient ascent on J(θ) is

Because of no improvement after 2 weeks, malignancy was sus- pected, and 18-FDG-PET-CT was performed ( Fig. 1 a) showing intense FDG uptake affecting mediastinal and hilar lymph

The (P, R, p) frame- work should certainly refer to literature to (1) exploit at best the identified structural similarities between very different hydrology‐erosion models, to ensure

The powertrain operation and fuel consumption on a selected driving cycle will depend on the architecture chosen, the components and their sizing, and lastly on the

In this section we propose an adaptation of the CE method for Dec-POMDP policy search, which we dub direct CE (D ICE ) policy search for Dec-POMDPs because it directly searches

Morgan Bertin, François Hild, Stéphane Roux, Florent Mathieu, Hugo Leclerc.. OPTIMIZATION AND IDENTIFICATION OF A BIAXIAL TENSILE TEST BASED UPON SENSITIVITY TO MA-

Keywords: Partially observable Markov decision process, stochastic controller, bilinear program, computational complexity, Motzkin-Straus theorem, sum-of-square-roots problem,

Par contre le type de stratégie suivante vise une dépersonnalisation et une décontextualisation des savoirs (Chevallard 1985). Le fait que les formateurs cherchent à contrôler