Sensitivity Analysis in Particle Filters. Application to Policy Optimization in POMDPs

(1)

HAL Id: inria-00336203

https://hal.inria.fr/inria-00336203

Submitted on 3 Nov 2008

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Sensitivity Analysis in Particle Filters. Application to Policy Optimization in POMDPs

Pierre Arnaud Coquelin, Romain Deguest, Rémi Munos

To cite this version:

Pierre Arnaud Coquelin, Romain Deguest, Rémi Munos. Sensitivity Analysis in Particle Filters.

Application to Policy Optimization in POMDPs. [Research Report] RR-6710, INRIA. 2008. �inria- 00336203�

(2)

a p p o r t

d e r e c h e r c h e

ISSN0249-6399ISRNINRIA/RR--6710--FR+ENG

Thème COG

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

Sensitivity Analysis in Particle Filters.

Application to Policy Optimization in POMDPs

Pierre-Arnaud Coquelin , Romain Deguest, Rémi Munos

N° 6710

November 2008

(3)

(4)

Centre de recherche INRIA Futurs Parc Orsay Université, ZAC des Vignes, 4, rue Jacques Monod, 91893 Orsay Cedex (France)

Téléphone : +33 1 72 92 59 00

Appliation to Poliy Optimization in POMDPs

Pierre-Arnaud Coquelin

∗

, RomainDeguest

†

, Rémi Munos

‡

ThèmeCOGSystèmesognitifs

Équipes-ProjetsSEQUEL

Rapportdereherhe n°6710November200815pages

Résumé:NousonsidéronsunProessusdeDéisionMarkovienPartiellement

Observable(POMDP) ave espaes d'état, d'observation et d'ation ontinus.

Lesdéisionssontprisesàpartird'unepolitiquequiutiliseunltreàpartiules

permettantdeontruireunefontionderoyanesurl'étatourantsahantles

observationspassées.Nousonsidéronsunalgorithmedetypegradientpourop-

timiser les paramètres de la politique. Pour ela nous suivonsune analyse de

sensibilitédelamesuredeperformaneparrapportauxparamètresdelapoli-

tique,seonentrantsurlesméthodesdetypeDiérenesFinies.Nousmontrons

quel'approhenaivesoured'uneexplosion delavariane,àausede lanon-

diérentiabilité de l'étapede ré-éhantillonnage.Nous proposons une variante

qui résoudeproblème,etétablissons laonsistenedel'estimateur résultant.

Mots-lés : Poessusdedéisionmarkovienpartiellementobservable,analyse

desensibilité,ltragepartiulaire,optimisationparamétrique

∗

CMAP,EolePolytehnique,oquelinmapx.polyteh niq ue.f r

†

CMAP,EolePolytehniqueandColombiauniversity,deguestmapx.polytehni que. fr

‡

INRIALille-NordEurope,SequeLprojet,remi.munosinria.fr

(5)

Appliation to Poliy Optimization in POMDPs

Abstrat: OursettingisaPartiallyObservableMarkovDeisionProesswith

ontinuousstate,observation andationspaes.DeisionsarebasedonaPar-

tileFilter forestimating thebeliefstategivenpastobservations.Weonsider

apoliygradientapproahforparameterizedpoliyoptimization.Forthatpur-

pose,weinvestigatesensitivityanalysisoftheperformanemeasurewithrespet

totheparametersof thepoliy, fousingonFiniteDierene(FD)tehniques.

WeshowthatthenaiveFDissubjettovarianeexplosionbeauseofthenon-

smoothnessof theresamplingproedure.WeproposeamoresophistiatedFD

methodwhihoveromesthisproblemandestablishitsonsisteny.

Key-words: PartiallyObservableMarkovDeisionProblems,sensitivityanal-

ysis,partileltering,parametrioptimization

(6)

1 Introdution

We onsider a Partially Observable Markov Deision Problem (POMDP)

(see e.g. (Lovejoy, 1991; Kaelbling et al., 1998)) dened by a state proess

(Xt)t≥1∈X^,ânobservationproess(Yt)t≥1∈Y^,â^deision^(orâtion)^proess (At)t≥1∈A ^whih^dependsônâ^poliy ^(mapping^fromâll^possibleobservation histories to ations),and a rewardfuntion r : X → R^. Ôur ^goalîs ^to ^nd â

poliyπ^that^maximizes^a^performane^measureJ(π)^,^funtion^of^future^rewards,

forexampleinanite horizonsetting:

J(π)^def= EXⁿ

t=1

r(Xt)

. ⁽¹⁾

Other performane measures (suh as in innite horizon with disounted

rewards)ouldbehandledaswell.Inthispaper,weonsidertheaseofonti-

nuous state,observation, and ationspaes.

ThestateproessisaMarkovdeisionproesstakingitsvaluesina(mea-

surable)statespaeX^,^with^initialprobabilitymeasureµ∈ M(X)^(i.e.X1∼µ^),

andwhihanbesimulatedusingatransitionfuntionF ^andindependentran- domnumbers,i.e.forallt≥1^,

Xt+1=F(Xt, At, Ut), ^withUt i.i.d.

∼ ν, ⁽²⁾

where F : X×A×U → X ând (U, σ(U), ν) îs â probability spae. In many pratialsituations U = [0,1]^p ândUtîs âp^-uple ôf^pseudo ^random^numbers.

Forsimpliity, weadopt thenotations F(x0, a0, u) ^def= Fµ(u)^, ^where Fµ ^is ^the

rsttransitionfuntion (i.e.X1=Fµ(U0)^withU0∼ν^).

The observation proess (Yt)t≥1 ^lies ⁱⁿ â (measurable) spae Y ând îs

linked with the state proess by the onditional probability measure P(Yt ∈ dyt|Xt =xt) = g(xt, yt)dyt, ^whereg :X×Y → [0,1]^is ^the^marginal^density

funtion of Yt ^given Xt^. ^We ^assume ^that observations are onditionally inde- pendentgiventhestateproess.Here also,weassumethatweansimulatean

observation using a transition funtion G ^and independent random numbers, i.e.∀t≥1^,Yt=G(Xt, Vt)^, ^whereVti.i.d.

∼ ν ^(for^the^sake^of^simpliity^we^onsi-

derthesameprobabilityspae(U, σ(U), ν)^).^Now,^the^ation^proess(At)t≥1

depends onapoliy π^whih ^assigns^to ^eah ^possibleobservation historyY1:t

(whereweadopt theusualnotation1 :t ^to ^denote^theôlletionôfîntegerss

suhthat1≤s≤t^),^an^ationAt∈A^.

Inthispaperwewillonsiderpoliiesthatdependonthebeliefstate(also

alled ltering distribution) onditionally to past observations. The belief

state,writtenbt^,^belongs^toM(X)^(the^spae^of^allprobabilitymeasuresonX⁾

andisdenedbybt(dxt, Y1:t)^def= P(Xt∈dxt|Y1:t),^and^will^be^writtenbt(dxt)^or

evenbt^for^simpliity^when^thereîs^no^riskôfônfusion.^Beauseôf^the^Markov

propertyofthestatedynamis,thebeliefstatebt(·, Y1:t)^is^the^mostinformative representationabouttheurrentstateXt^given^the^history^of^pastobservations

Y1:t^.Ît^represents^suient^statistis^for^designingânôptimal^poliyⁱⁿ^the^lass

ofobservations-basedpoliies.

ThetemporalandausaldependeniesofthedynamisofageneriPOMDP

usingbelief-basedpoliiesissummarizedinFigure1(left):attimet^,^the^state Xtîsûnknown,ônlyYtîsôbserved,^whihênables^(at^leastⁱⁿ^theory)^toûpdate

RR n°6710

(7)

bt^basedôn^the^previous^beliefbt−1^.^The^poliyπ^takesâsînput^the^belief^state bt ând ^returns ân âtion At ^(the ^poliy ^may ^be deterministi or stohasti).

However,sinethebeliefstateisaninnitedimensionalobjet,andthusannot

be represented in a omputer, we rst simplify the lass of poliies that we

onsiderhere to be dened overanite dimensional spae ofbelief-features

f :M(X)→R^K^whih^represents^relevant^statistis^of^the^lteringdistribution.

We write bt(fk) ^for ^the ^value ^of ^the k^-th ^feature ^(among K⁾ ^(where ^we ^use

theusual notation b(f)^def= R

Xf(x)b(dx) ^for âny ^funtion f ^dened ôn X ând

measureb∈ M(X)^),ând^denotebt(f)^the^vetor^(of^sizeK⁾^withômponents bt(fk)^. Êxamplesôf ^features âre^: f(x) =x^(mean ^value),f(x) =x^′x^(for^the

ovarianematrix).Other moreomplexfeatures (e.g.entropymeasure)ould

be used aswell. Suh a poliy π : R^K → A ^selets ^an ^ation At = π(bt(f))^,

whihin turn,yieldsanewstateXt+1^.

Exept for simpleases, suh asin nite-state nite-observation proesses

(whereaViterbi algorithm ouldbeapplied (Rabiner,1989)),and theaseof

lineardynamisandGaussiannoise(whereaKalmanlterouldbeused),there

is no losed-formrepresentation of the belief state. Thus bt ^must ^be ^approxi-

matedinourgeneralsetting.Apopularmethodforapproximatingtheltering

distributionisknownasPartile Filters(PF)(alsoalledInterating Par-

tileSystemsorSequentialMonte-Carlo).Suhpartile-basedapproahes

havebeenusedinmanyappliations(seee.g.(Douetet al.,2001)and(DelMo-

ral,2004)foraFeynman-Kaframework)forexampleforparameterestimation

inHiddenMarkovModelsandontrol(Andrieuet al.,2004)andmobilerobot

loalization(Foxet al.,2001).AnPFapproximatesthebeliefstatebt∈ M(X)

by a set of partiles (x^1:N_t ) ^(points ôf X^), ^whih âre ûpdated sequentially at eahnewobservationbyatransition-seletionproedure.Inpartiular,thebe-

lief feature bt(f) ^is approximated by

1 N

PN

i=1f(xⁱ_t)^, ând ^the ^poliy îs ^thus â

funtion that takes asinput the ativation of thefeature f ^at ^the^position ^of

thepartiles:At=π(_N¹ PN

i=1f(xⁱ_t))^.^F^or^suh^methods,^the^general^sheme^for

POMDPsusingPartileFilter-basedpoliiesisdesribedinFigure 1(right).

Inthis paper,weonsider alass ofpoliies πθ parameterizedbya(multi- dimensional)parameterθ ^and^we^searh^for^the^value^ofθ ^that^maximizes^the

resultingriterionJ(πθ)^,^now^written J(θ)^for^simpliity.^We^fous^on^a^poliy

gradient approah : the POMDP is replaed by an optimization problem on

the spae of poliy parameters, and a (stohasti) gradient asent on J(θ) ^is

onsidered.Forthatpurpose(andthisistheobjetofthiswork)weinvestigate

theestimationof∇J(θ)^(where^the^gradient∇^refers^to^the^derivative^w.r.t.θ^),

withanemphasisonFinite-Dierenetehniques.Therearemanyworksabout

suh poliy gradient approah in the eld of ReinforementLearning, see e.g.

(Baxter&Bartlett,1999),butthepoliiesonsideredaregenerallynotbasedon

theresultofanPF.Here,weexpliitlyonsideralassofpoliiesthatarebased

onabeliefstateonstrutedbyaPF.Ourmotivationsforinvestigatingthisase

are based on two fats : (1) the belief state represents suient statistisfor

optimality,asmentionedabove.(2)PFsareaverypopularandeienttoolfor

onstrutingthebeliefstateinontinuousdomains.

AfterreallingthegeneralapproahforevaluatingtheperformaneofaPF-

basedpoliy (Setion 2),wedesribe(inSetion 3.1)anaiveFinite-Dierene

(FD)approah(denedbyastepsize h⁾^for^estimating∇J(θ)^. ^We^disuss^the

biasandvarianetradeoandexplainthe problemofvarianeexplosionwhen

INRIA

(8)

hîs^small.^This^problemîsâônsequeneôf^thedisontinuityoftheresampling

operation w.r.t. theparameterθ^. ^Ourontribution is detailed in Setion 3.2 : WeproposeamodiedFDestimatefor∇J(θ)^whih^(along^the^random^sample

path)hasbiasO(h²)ând^varianeO(1/N)^,^thusôveromes^the^drawbakôf^the

previousnaivemethod.AnalgorithmisdesribedandillustratedinSetion4on

asimpleproblem where theoptimal poliy exhibits atradeo betweengreedy

rewardoptimizationandloalization.

X_t Y_t

A_t

π_θ π_θ

π_θ

Belief Reward

Observation

Belief features State

Policy Action A

X Y

A t−1

t−1

t+1

b b_t b

t−1 t b (f )t+1

r_t−1 r_t r_t+1

b (f) b (f )

X_t Y_t

A_t

π_θ π_θ π_θ

Reward

Particles

Features Policy Action State

Observation

A X Y

X Y

A t−1

t−1

t+1

r_t−1 r_t r_t+1

t−1

1:N 1:N

t t+1

1:N

1:N t−1

1:N

t t+1

1:N

x x

f( )x f( )x f( )x x

Fig.1 Left gure : Causal and temporal dependeniesin a POMDP. Right

gure:PF-basedshemeforPOMDPswherethebelieffeaturebt(f)^is^approxi-

matedby

1 N

PN

i=1f(xⁱ_t)^.

2 Partile Filters (PF)

WerstdesribeageneriPFforestimating thebelief statebasedonpast

observations.InSubsetion 2.1wedetailhowto ontrol areal-worldPOMDP

and in Subsetion 2.2 how to estimate the performane of a given poliy in

simulation. In both ases,weassume that the models of thedynamis (state,

observation) are known. The basi PF, alled Bootstrap Filter, see (Douet

et al., 2001)fordetails, approximates thebelief statebn ^by ^an^empirial^dis-

tributionb^N_n ^def= PN

i=1wⁱ_nδ_xⁱ_n ^(whereδ^denotes^a^Diradistribution)madeofN

partilesx^1:N_n ^. Îtônsists ⁱⁿîterating^the^two^following^steps^:ât ^timet^, ^given

observationyt^,

Transition step : (also alled importane sampling or mutation)

asuessorpartiles populationex^1:N_t ^is^generated ^aording^to ^the^state

dynamisfromthepreviouspopulationx^1:N_t−1^.^The^(importane^sampling)

weightsw_t^1:N ^def= P^g(N^x^e^1:N^t ^,y^t⁾

j=1g(ex^j_t,yt) ^are^evaluated,

Seletionstep:Resample(withreplaement)N ^partilesx^1:N_t ^from^the

set xe^1:N_t âording ^to ^the ^weights w_t^1:N^. ^We ^write x^1:N_t ^def= xe^k_t^1:N^t ^where k_t^1:N âre^the^seletionîndies.

Resampling is used to avoid the problem of degeneray of the algorithm,

i.e.that mostof theweightsdereasestozero.Itonsistsin seletingnewpar-

tile positionssuh asto preserveaonsistenyproperty(i.e.

PN

i=1w_tⁱφ(exⁱ_t) = E[_N¹ PN

i=1φ(xⁱ_t)]^). ^The ^simplest ^version întrodued ⁱⁿ ^(Gordon êt âl., ¹⁹⁹³⁾

hoosestheseletionindiesk^1:N_t ^by^anindependentsamplingfromtheset1 :N

aordingtoamultinomialdistribution withparametersw^1:N_t ^,^i.e.P(kⁱ_t=j) =

RR n°6710

(9)

w_t^j^, ^for âll 1 ≤ i ≤ N^. ^The îdeaîs ^to ^repliate ^the ^partiles ⁱⁿ ^proportion ^to

theirweights.Manyvariantshavebeenproposedintheliterature,amongwhih

thestratiedresamplingmethod(Kitagawa,1996)whihisoptimalintermsof

variane,see e.g.(Cappéet al.,2005).

Convergeneissuesofb^N_n(f)^tobn(f)^(e.g.^Law^of^Large^Numbers^or^Central

LimitTheorems)aredisussedin(DelMoral,2004)or(Dou&Moulines,2008).

Forourpurposewenotethatunder weakonditionsonthefeature f^,^we^have

theonsistenyproperty:b^N(f)→b(f)^,^almost^surely.

2.1 Control of a real system by an PF-based poliy

WedesribeinAlgorithm1howonemayuseanPF-basedpoliyπθ ^for^the

ontrolofareal-worldsystem.NotethatfromourdenitionofFµ^,^the^partiles

areinitializedwith:ex^1:N₁ ^iid∼µ^.

Algorithm1Controlofareal-worldPOMDP

fort= 1^to n^do

Observe :yt^,

Partile transitionstep :

Setxe^1:N_t =F(x^1:N_t−1, at−1, u^1:N_t−1)^withu^1:N_t−1^iid∼ν^. ^Setw^1:N_t = P^g(N^e^x^1:N^t ^,y^t⁾ j=1g(ex^j_t,yt)^,

Partile resamplingstep :

Setx^1:N_t =xe^k_t^1:N^t ^where k_t^1:N ^are ^given ^by^the ^seletion^step ^aording^to

theweightsw_t^1:N^.

Selet ation: at=πθ(_N¹ PN

i=1f(xⁱ_t))^,

endfor

2.2 Estimation of J(θ) ⁱⁿ ^simulation

Now, forthepurpose ofpoliy optimization, oneshould beapable ofeva-

luatingtheperformaneofapoliyinsimulation.J(θ)^,^dened ^by^(1), ^may^be

estimated in simulation provided that the dynamis of thestate and observa-

tionareknown.Makingexpliitthedependenyw.r.t.therandomsamplepath,

writtenω^(whih^aounts^for^the^state^andobservationstohastidynamisand therandomnumbersusedinthePF-basedpoliy), wewriteJ(θ) =Eω[Jω(θ)]^,

where Jω(θ)^def= Pn

t=1r(Xt,ω(θ)), ^making^the^dependeny ^of^the ^state^w.r.t.ω

andθ^expliit.

Algorithm 2 desribes how to evaluate an PF-based poliy in simulation.

The funtion returns an estimate, written J_ω^N(θ)^, ^of Jω(θ)^. ^Using ^previously

mentioned asymptoti onvergene results for PF, one has limN→∞J_ω^N(θ) = Jω(θ)^, âlmost ^surely ^(a.s.). În ôrder ^to approximate J(θ)^, ône ^would ^perform

severalallstothealgorithm,reeivingJ_ω^N_m(θ)^(for1≤m≤M^),^and^alulate

theirempirialmean

1 M

PM

m=1J_ω^N_m(θ)^,^whih^tends^to J(θ)^a.s.,^whenM, N→

∞^.

INRIA