HAL Id: inria-00336203
https://hal.inria.fr/inria-00336203
Submitted on 3 Nov 2008
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Sensitivity Analysis in Particle Filters. Application to Policy Optimization in POMDPs
Pierre Arnaud Coquelin, Romain Deguest, Rémi Munos
To cite this version:
Pierre Arnaud Coquelin, Romain Deguest, Rémi Munos. Sensitivity Analysis in Particle Filters.
Application to Policy Optimization in POMDPs. [Research Report] RR-6710, INRIA. 2008. �inria- 00336203�
a p p o r t
d e r e c h e r c h e
ISSN0249-6399ISRNINRIA/RR--6710--FR+ENG
Thème COG
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
Sensitivity Analysis in Particle Filters.
Application to Policy Optimization in POMDPs
Pierre-Arnaud Coquelin , Romain Deguest, Rémi Munos
N° 6710
November 2008
Centre de recherche INRIA Futurs Parc Orsay Université, ZAC des Vignes, 4, rue Jacques Monod, 91893 Orsay Cedex (France)
Téléphone : +33 1 72 92 59 00
Appliation to Poliy Optimization in POMDPs
Pierre-Arnaud Coquelin
∗
, RomainDeguest
†
, Rémi Munos
‡
ThèmeCOGSystèmesognitifs
Équipes-ProjetsSEQUEL
Rapportdereherhe n°6710November200815pages
Résumé:NousonsidéronsunProessusdeDéisionMarkovienPartiellement
Observable(POMDP) ave espaes d'état, d'observation et d'ation ontinus.
Lesdéisionssontprisesàpartird'unepolitiquequiutiliseunltreàpartiules
permettantdeontruireunefontionderoyanesurl'étatourantsahantles
observationspassées.Nousonsidéronsunalgorithmedetypegradientpourop-
timiser les paramètres de la politique. Pour ela nous suivonsune analyse de
sensibilitédelamesuredeperformaneparrapportauxparamètresdelapoli-
tique,seonentrantsurlesméthodesdetypeDiérenesFinies.Nousmontrons
quel'approhenaivesoured'uneexplosion delavariane,àausede lanon-
diérentiabilité de l'étapede ré-éhantillonnage.Nous proposons une variante
qui résoudeproblème,etétablissons laonsistenedel'estimateur résultant.
Mots-lés : Poessusdedéisionmarkovienpartiellementobservable,analyse
desensibilité,ltragepartiulaire,optimisationparamétrique
∗
CMAP,EolePolytehnique,oquelinmapx.polyteh niq ue.f r
†
CMAP,EolePolytehniqueandColombiauniversity,deguestmapx.polytehni que. fr
‡
INRIALille-NordEurope,SequeLprojet,remi.munosinria.fr
Appliation to Poliy Optimization in POMDPs
Abstrat: OursettingisaPartiallyObservableMarkovDeisionProesswith
ontinuousstate,observation andationspaes.DeisionsarebasedonaPar-
tileFilter forestimating thebeliefstategivenpastobservations.Weonsider
apoliygradientapproahforparameterizedpoliyoptimization.Forthatpur-
pose,weinvestigatesensitivityanalysisoftheperformanemeasurewithrespet
totheparametersof thepoliy, fousingonFiniteDierene(FD)tehniques.
WeshowthatthenaiveFDissubjettovarianeexplosionbeauseofthenon-
smoothnessof theresamplingproedure.WeproposeamoresophistiatedFD
methodwhihoveromesthisproblemandestablishitsonsisteny.
Key-words: PartiallyObservableMarkovDeisionProblems,sensitivityanal-
ysis,partileltering,parametrioptimization
1 Introdution
We onsider a Partially Observable Markov Deision Problem (POMDP)
(see e.g. (Lovejoy, 1991; Kaelbling et al., 1998)) dened by a state proess
(Xt)t≥1∈X,anobservationproess(Yt)t≥1∈Y,adeision(oration)proess (At)t≥1∈A whihdependsonapoliy (mappingfromallpossibleobservation histories to ations),and a rewardfuntion r : X → R. Our goalis to nd a
poliyπthatmaximizesaperformanemeasureJ(π),funtionoffuturerewards,
forexampleinanite horizonsetting:
J(π)def= EXn
t=1
r(Xt)
. (1)
Other performane measures (suh as in innite horizon with disounted
rewards)ouldbehandledaswell.Inthispaper,weonsidertheaseofonti-
nuous state,observation, and ationspaes.
ThestateproessisaMarkovdeisionproesstakingitsvaluesina(mea-
surable)statespaeX,withinitialprobabilitymeasureµ∈ M(X)(i.e.X1∼µ),
andwhihanbesimulatedusingatransitionfuntionF andindependentran- domnumbers,i.e.forallt≥1,
Xt+1=F(Xt, At, Ut), withUt i.i.d.
∼ ν, (2)
where F : X×A×U → X and (U, σ(U), ν) is a probability spae. In many pratialsituations U = [0,1]p andUtis ap-uple ofpseudo randomnumbers.
Forsimpliity, weadopt thenotations F(x0, a0, u) def= Fµ(u), where Fµ is the
rsttransitionfuntion (i.e.X1=Fµ(U0)withU0∼ν).
The observation proess (Yt)t≥1 lies in a (measurable) spae Y and is
linked with the state proess by the onditional probability measure P(Yt ∈ dyt|Xt =xt) = g(xt, yt)dyt, whereg :X×Y → [0,1]is themarginaldensity
funtion of Yt given Xt. We assume that observations are onditionally inde- pendentgiventhestateproess.Here also,weassumethatweansimulatean
observation using a transition funtion G and independent random numbers, i.e.∀t≥1,Yt=G(Xt, Vt), whereVti.i.d.
∼ ν (forthesakeofsimpliityweonsi-
derthesameprobabilityspae(U, σ(U), ν)).Now,theationproess(At)t≥1
depends onapoliy πwhih assignsto eah possibleobservation historyY1:t
(whereweadopt theusualnotation1 :t to denotetheolletionofintegerss
suhthat1≤s≤t),anationAt∈A.
Inthispaperwewillonsiderpoliiesthatdependonthebeliefstate(also
alled ltering distribution) onditionally to past observations. The belief
state,writtenbt,belongstoM(X)(thespaeofallprobabilitymeasuresonX)
andisdenedbybt(dxt, Y1:t)def= P(Xt∈dxt|Y1:t),andwillbewrittenbt(dxt)or
evenbtforsimpliitywhenthereisnoriskofonfusion.BeauseoftheMarkov
propertyofthestatedynamis,thebeliefstatebt(·, Y1:t)isthemostinformative representationabouttheurrentstateXtgiventhehistoryofpastobservations
Y1:t.Itrepresentssuientstatistisfordesigninganoptimalpoliyinthelass
ofobservations-basedpoliies.
ThetemporalandausaldependeniesofthedynamisofageneriPOMDP
usingbelief-basedpoliiesissummarizedinFigure1(left):attimet,thestate Xtisunknown,onlyYtisobserved,whihenables(atleastintheory)toupdate
RR n°6710
btbasedonthepreviousbeliefbt−1.Thepoliyπtakesasinputthebeliefstate bt and returns an ation At (the poliy may be deterministi or stohasti).
However,sinethebeliefstateisaninnitedimensionalobjet,andthusannot
be represented in a omputer, we rst simplify the lass of poliies that we
onsiderhere to be dened overanite dimensional spae ofbelief-features
f :M(X)→RKwhihrepresentsrelevantstatistisofthelteringdistribution.
We write bt(fk) for the value of the k-th feature (among K) (where we use
theusual notation b(f)def= R
Xf(x)b(dx) for any funtion f dened on X and
measureb∈ M(X)),anddenotebt(f)thevetor(ofsizeK)withomponents bt(fk). Examplesof features are: f(x) =x(mean value),f(x) =x′x(forthe
ovarianematrix).Other moreomplexfeatures (e.g.entropymeasure)ould
be used aswell. Suh a poliy π : RK → A selets an ation At = π(bt(f)),
whihin turn,yieldsanewstateXt+1.
Exept for simpleases, suh asin nite-state nite-observation proesses
(whereaViterbi algorithm ouldbeapplied (Rabiner,1989)),and theaseof
lineardynamisandGaussiannoise(whereaKalmanlterouldbeused),there
is no losed-formrepresentation of the belief state. Thus bt must be approxi-
matedinourgeneralsetting.Apopularmethodforapproximatingtheltering
distributionisknownasPartile Filters(PF)(alsoalledInterating Par-
tileSystemsorSequentialMonte-Carlo).Suhpartile-basedapproahes
havebeenusedinmanyappliations(seee.g.(Douetet al.,2001)and(DelMo-
ral,2004)foraFeynman-Kaframework)forexampleforparameterestimation
inHiddenMarkovModelsandontrol(Andrieuet al.,2004)andmobilerobot
loalization(Foxet al.,2001).AnPFapproximatesthebeliefstatebt∈ M(X)
by a set of partiles (x1:Nt ) (points of X), whih are updated sequentially at eahnewobservationbyatransition-seletionproedure.Inpartiular,thebe-
lief feature bt(f) is approximated by
1 N
PN
i=1f(xit), and the poliy is thus a
funtion that takes asinput the ativation of thefeature f at theposition of
thepartiles:At=π(N1 PN
i=1f(xit)).Forsuhmethods,thegeneralshemefor
POMDPsusingPartileFilter-basedpoliiesisdesribedinFigure 1(right).
Inthis paper,weonsider alass ofpoliies πθ parameterizedbya(multi- dimensional)parameterθ andwesearhforthevalueofθ thatmaximizesthe
resultingriterionJ(πθ),nowwritten J(θ)forsimpliity.Wefousonapoliy
gradient approah : the POMDP is replaed by an optimization problem on
the spae of poliy parameters, and a (stohasti) gradient asent on J(θ) is
onsidered.Forthatpurpose(andthisistheobjetofthiswork)weinvestigate
theestimationof∇J(θ)(wherethegradient∇referstothederivativew.r.t.θ),
withanemphasisonFinite-Dierenetehniques.Therearemanyworksabout
suh poliy gradient approah in the eld of ReinforementLearning, see e.g.
(Baxter&Bartlett,1999),butthepoliiesonsideredaregenerallynotbasedon
theresultofanPF.Here,weexpliitlyonsideralassofpoliiesthatarebased
onabeliefstateonstrutedbyaPF.Ourmotivationsforinvestigatingthisase
are based on two fats : (1) the belief state represents suient statistisfor
optimality,asmentionedabove.(2)PFsareaverypopularandeienttoolfor
onstrutingthebeliefstateinontinuousdomains.
AfterreallingthegeneralapproahforevaluatingtheperformaneofaPF-
basedpoliy (Setion 2),wedesribe(inSetion 3.1)anaiveFinite-Dierene
(FD)approah(denedbyastepsize h)forestimating∇J(θ). Wedisussthe
biasandvarianetradeoandexplainthe problemofvarianeexplosionwhen
INRIA
hissmall.Thisproblemisaonsequeneofthedisontinuityoftheresampling
operation w.r.t. theparameterθ. Ourontribution is detailed in Setion 3.2 : WeproposeamodiedFDestimatefor∇J(θ)whih(alongtherandomsample
path)hasbiasO(h2)andvarianeO(1/N),thusoveromesthedrawbakofthe
previousnaivemethod.AnalgorithmisdesribedandillustratedinSetion4on
asimpleproblem where theoptimal poliy exhibits atradeo betweengreedy
rewardoptimizationandloalization.
Xt Yt
At
πθ πθ
πθ
Belief Reward
Observation
Belief features State
Policy Action A
X Y
X Y
A t−1
t−1
t−1
t−1
t+1
t+1
t+1
t+1
b bt b
t−1 t b (f )t+1
rt−1 rt rt+1
b (f) b (f )
Xt Yt
At
πθ πθ πθ
Reward
Particles
Features Policy Action State
Observation
A X Y
X Y
A t−1
t−1
t−1
t+1
t+1
t+1
rt−1 rt rt+1
t−1
1:N 1:N
t t+1
1:N
1:N t−1
1:N
t t+1
1:N
x x
f( )x f( )x f( )x x
Fig.1 Left gure : Causal and temporal dependeniesin a POMDP. Right
gure:PF-basedshemeforPOMDPswherethebelieffeaturebt(f)isapproxi-
matedby
1 N
PN
i=1f(xit).
2 Partile Filters (PF)
WerstdesribeageneriPFforestimating thebelief statebasedonpast
observations.InSubsetion 2.1wedetailhowto ontrol areal-worldPOMDP
and in Subsetion 2.2 how to estimate the performane of a given poliy in
simulation. In both ases,weassume that the models of thedynamis (state,
observation) are known. The basi PF, alled Bootstrap Filter, see (Douet
et al., 2001)fordetails, approximates thebelief statebn by anempirialdis-
tributionbNn def= PN
i=1winδxin (whereδdenotesaDiradistribution)madeofN
partilesx1:Nn . Itonsists initeratingthetwofollowingsteps:at timet, given
observationyt,
Transition step : (also alled importane sampling or mutation)
asuessorpartiles populationex1:Nt isgenerated aordingto thestate
dynamisfromthepreviouspopulationx1:Nt−1.The(importanesampling)
weightswt1:N def= Pg(Nxe1:Nt ,yt)
j=1g(exjt,yt) areevaluated,
Seletionstep:Resample(withreplaement)N partilesx1:Nt fromthe
set xe1:Nt aording to the weights wt1:N. We write x1:Nt def= xekt1:Nt where kt1:N aretheseletionindies.
Resampling is used to avoid the problem of degeneray of the algorithm,
i.e.that mostof theweightsdereasestozero.Itonsistsin seletingnewpar-
tile positionssuh asto preserveaonsistenyproperty(i.e.
PN
i=1wtiφ(exit) = E[N1 PN
i=1φ(xit)]). The simplest version introdued in (Gordon et al., 1993)
hoosestheseletionindiesk1:Nt byanindependentsamplingfromtheset1 :N
aordingtoamultinomialdistribution withparametersw1:Nt ,i.e.P(kit=j) =
RR n°6710
wtj, for all 1 ≤ i ≤ N. The ideais to repliate the partiles in proportion to
theirweights.Manyvariantshavebeenproposedintheliterature,amongwhih
thestratiedresamplingmethod(Kitagawa,1996)whihisoptimalintermsof
variane,see e.g.(Cappéet al.,2005).
ConvergeneissuesofbNn(f)tobn(f)(e.g.LawofLargeNumbersorCentral
LimitTheorems)aredisussedin(DelMoral,2004)or(Dou&Moulines,2008).
Forourpurposewenotethatunder weakonditionsonthefeature f,wehave
theonsistenyproperty:bN(f)→b(f),almostsurely.
2.1 Control of a real system by an PF-based poliy
WedesribeinAlgorithm1howonemayuseanPF-basedpoliyπθ forthe
ontrolofareal-worldsystem.NotethatfromourdenitionofFµ,thepartiles
areinitializedwith:ex1:N1 iid∼µ.
Algorithm1Controlofareal-worldPOMDP
fort= 1to ndo
Observe :yt,
Partile transitionstep :
Setxe1:Nt =F(x1:Nt−1, at−1, u1:Nt−1)withu1:Nt−1iid∼ν. Setw1:Nt = Pg(Nex1:Nt ,yt) j=1g(exjt,yt),
Partile resamplingstep :
Setx1:Nt =xekt1:Nt where kt1:N are given bythe seletionstep aordingto
theweightswt1:N.
Selet ation: at=πθ(N1 PN
i=1f(xit)),
endfor
2.2 Estimation of J(θ) in simulation
Now, forthepurpose ofpoliy optimization, oneshould beapable ofeva-
luatingtheperformaneofapoliyinsimulation.J(θ),dened by(1), maybe
estimated in simulation provided that the dynamis of thestate and observa-
tionareknown.Makingexpliitthedependenyw.r.t.therandomsamplepath,
writtenω(whihaountsforthestateandobservationstohastidynamisand therandomnumbersusedinthePF-basedpoliy), wewriteJ(θ) =Eω[Jω(θ)],
where Jω(θ)def= Pn
t=1r(Xt,ω(θ)), makingthedependeny ofthe statew.r.t.ω
andθexpliit.
Algorithm 2 desribes how to evaluate an PF-based poliy in simulation.
The funtion returns an estimate, written JωN(θ), of Jω(θ). Using previously
mentioned asymptoti onvergene results for PF, one has limN→∞JωN(θ) = Jω(θ), almost surely (a.s.). In order to approximate J(θ), one would perform
severalallstothealgorithm,reeivingJωNm(θ)(for1≤m≤M),andalulate
theirempirialmean
1 M
PM
m=1JωNm(θ),whihtendsto J(θ)a.s.,whenM, N→
∞.
INRIA