Any correspondence concerning this service should be sent
to the repository administrator:
tech-oatao@listes-diff.inp-toulouse.fr
This is a publisher’s version published in:
http://oatao.univ-toulouse.fr/22626
To cite this version:
Ben Amor, Nahla and El Khalfi,
Zeineb and Fargier, Hélène and Sabbadin, Regis Lexicographic
refinements in stationary possibilistic Markov Decision
Processes. (2018) International Journal of Approximate
Reasoning, 103. 343-363. ISSN 0888-613X
Official URL
DOI :
https://doi.org/10.1016/j.ijar.2018.10.011
Open Archive Toulouse Archive Ouverte
OATAO is an open access repository that collects the work of Toulouse
researchers and makes it freely available over the web where possible
Lexicographic
refinements
in
stationary
possibilistic
Markov
Decision
Processes
✩
Nahla
Ben Amor
a,∗,
Zeineb
El Khalfi
a,b,∗∗,
Hélène
Fargier
b,∗,
Régis
Sabbadin
c,∗ a LARODEC,UniversityofTunis,TunisiabIRIT,UPS-CNRS,UniversitédeToulouse3,118routedeNarbonne,F-31062Toulouse,France cMIAT,UR875,UniversitédeToulouse,INRA,F-31320Castanet-Tolosan,France
a b s t r a c t
Keywords:
MarkovDecisionProcess Possibilitytheory Lexicographiccomparisons Possibilisticqualitativeutilities
PossibilisticMarkovDecisionProcessesofferacompactandtractablewaytorepresentand solveproblemsofsequentialdecisionunderqualitativeuncertainty.Eventhoughappealing for itsability to handlequalitative problems,this model suffers fromthe drowningeffect
thatisinherenttopossibilisticdecisiontheory.Thepresent1paperproposestoescapethe drowningeffectbyextendingtostationarypossibilisticMDPsthelexicographicpreference relations definedbyFargier and Sabbadin[13] fornon-sequentialdecision problems.We propose a valueiteration algorithm and apolicy iteration algorithm tocomputepolicies thatareoptimalforthesenewcriteria.Thepracticalfeasibilityofthesealgorithmsisthen experimentedondifferentsamplesofpossibilisticMDPs.
1. Introduction
The classical paradigm for sequential decision making underuncertainty is the expected utility-based MarkovDecision Processes (MDPs) framework[3,21],which assumesthat the uncertaineffects ofactions can berepresentedby probability distributionsandthat utilitiesareadditive.ButtheEUmodeldoesnosuit problemswhereuncertaintyandpreferencesare ordinalinessence.
AlternativestotheEU-basedmodelhavebeenproposedtohandleordinalpreferences/uncertainty.Remainingwithinthe probabilistic, quantitative,framework whileconsideringordinalpreferences hasledto quantile-based approaches[15,18,27,
29,33].Purely ordinalapproaches tosequential decisionunder uncertaintyhave alsobeen considered. In particular, possi-bilistic MDPs[1,6,22,24] formapurelyqualitativedecisionmodel withanordinalevaluationofplausibilityand preference. Inthismodel,uncertaintyabouttheconsequencesofactionsisrepresentedbypossibilitydistributionsandutilitiesarealso ordinal.The decisioncriteriaareeitherthepessimistic qualitativeutilityoritsoptimisticcounterpart[9].Suchdegreescan beeitherelicited fromexperts,or byautomaticlearningapproaches[23].However, itisnowwellknown that possibilistic
✩ ThispaperispartoftheVirtualspecialissueonthe14thEuropeanConferenceonSymbolicandQuantitativeApproachestoReasoningwithUncertainty (ECSQARU2017),editedbyAlessandroAntonucci,LaurenceCholvyandOdilePapini.
*
Correspondingauthors.**
Correspondingauthorat:IRIT,UPS-CNRS,118routedeNarbonne,31062Toulouse,France.E-mailaddresses:nahla.benamor@gmx.com(N. Ben Amor),zeineb.khalfi@gmail.com(Z. El Khalfi),fargier@irit.fr(H. Fargier),regis.sabbadin@inra.fr
(R. Sabbadin).
1 Thispaperisanextendedandrevisedversionoftwoconferencepapers[4,5].Itincludesthefullproofsofthepropositionspresentedinthese
prelimi-narypapers,newalgorithms(basedonpolicyiteration)andnewexperiments.
decision criteria sufferfrom adrowningeffect [13]: plausible enough bad or good consequences may completelyblur the comparisonbetweenpolicies,thatwouldotherwisebeclearlydifferentiable.
In[13],FargierandSabbadinhaveproposedlexicographicrefinements ofpossibilisticcriteriafortheone-stepdecisioncase, inordertoremedythedrowningeffect.Thisworkhasrecentlybeenextendedfor(finitehorizon)possibilisticdecisiontrees [4].Inthepresentpaper,weproposetostudytheinterestofthelexicographicpreferencerelationstostationarypossibilistic MarkovDecisionProcesses,amodelthat ismorecompactthandecisiontreesandnotlimitedtoafinitehorizon.
The paperisstructuredasfollows: ThenextSection recallsthebackgroundabout possibilistic decisiontheoryand sta-tionarypossibilisticMDPs,includingthedrowningeffectproblem.Section3definesthelexicographiccomparisonofpolicies and presentsavalueiterationalgorithmwhichcomputesanearlyoptimalstrategyinalimited numberofiterations. Then, Section4proposesalexicographicvalueiterationalgorithmandalexicographicpolicyiterationalgorithmusing approxima-tion ofutilityfunctions.Lastly,Section5presentsourexperimentalresults.
2. Backgroundandnotations
2.1. Basicsofpossibilisticdecisiontheory
Most of available decision models refer to probability theory for the representation of uncertainty [20,25]. Despite its success, probability theory is not appropriate when numerical information is not available. When information about un-certainty cannot be quantified in a probabilistic way, possibilistic theory [8,34] is a natural field to consider. The basic component ofthistheoryisthenotionofpossibilitydistribution.Itisarepresentationofastateofknowledgeofanagent about thestateof theworld.Apossibility distribution
π
isamapping fromtheuniverse ofdiscourse S (theset ofallthe possible worlds) to abounded linearly ordered scale L exemplified (without loss ofgenerality) bythe unitinterval [0,1], wedenotethefunctionby:π
:S→ [0,1].Forstate s∈S,
π
(s)=1 means thatrealization s istotallypossibleandπ
(s)=0 meansthat s isanimpossiblestate.It isgenerallyassumedthat thereexistsatleastonestates whichistotallypossible:π
isthensaidtobenormalized.Inthepossibilisticframework,extremeformsofknowledgecanbecaptured, namely: • Completeknowledgei.e.∃s s.t.
π
(s)=1 and∀ s′6=s,π
(s′)=0.• Totalignorancei.e.∀s∈S,
π
(s)=1 (allvaluesin S are possible).From
π
one cancomputethepossibilitymeasure5(A)and thenecessitymeasure N(A) ofanyevent A⊆S:5(
A) =
sup s∈Aπ
(
s)
N(
A) =
1− 5( ¯
A) =
1−
sup s∈/Aπ
(
s)
Measure5(A) evaluatestowhichextent A isconsistentwiththeknowledgerepresentedby
π
whileN(A)corresponds totheextenttowhich¬A is impossibleand thusevaluatesatwhichlevel A iscertainlyimpliedbytheknowledge.In decision theory acts are functions f : S7→X , where X is a finite set of outcomes. In possibilistic decision making, an act f can beviewed asa possibility distribution
π
f over X [9], whereπ
f(x)= 5(f−1(x)). In asingle stage decisionmaking problem, autility function u:X7→U maps outcomesto utilityvalues in a totally ordered scale U= {u1,...,un}.
Thisfunctionmodelstheattractivenessofeachoutcomeforthedecision-maker.
Under theassumptionthat the utilityscaleand thepossibility scalearecommensurate and purelyordinal (i.e.U =L),
Duboisand Prade[9,7] haveproposedpessimisticandoptimisticdecisioncriteria.
First,thepessimisticcriterionwas originallyproposedbyWhalen[30] anditgeneralizestheWaldcriterion[28].Itsuits cautious decisionmakers whoare happywhen bad consequences arehardlyplausible. It summarizesto what extent itis certain(i.e.necessaryaccordingtomeasure N)thattheactreachesagoodutility.Thedefinitionofthepessimisticcriterion isasfollows[10]:
Definition1.Givenapossibilitydistribution
π
overasetofstates S andautilityfunctionu onthesetofconsequences X ,thepessimisticutilityofanact f isdefinedby:
upes
(
f) =
min xj∈X max(
u(
xj),
1−
π
f(
xj)),
=
min si∈S max(
u(
f(
si)),
1−
π
(
si)).
(1)Therefore,wecancomparetwoacts f and g onthebasisoftheirpessimisticutilities:
The secondcriterionistheoptimistic possibilisticcriterionoriginally proposedbyYager [32,31].Thiscriterioncaptures thebehaviorofanadventurous decisionmakerwhoishappy assoonasatleast onegoodconsequence ishighlyplausible. It summarizes towhat extent itispossible that anact reachesagood utility. The definitionofthis criterionisasfollows [10]:
Definition2.Givena possibilitydistribution
π
over aset ofstates S andautilityfunctionu on aset ofconsequences X ,theoptimisticutilityofanact f isdefined by:
uopt
(
f) =
max xj∈X min(
u(
xj),
π
f(
xj)),
=
max si∈S min(
u(
f(
si)),
π
(
si)).
(2)Hence,wecancomparetwo acts f and g onthebasisoftheiroptimisticutilities:
f
º
uopt g⇔
uopt(
f) ≥
uuopt(
g).
Example1.Let S= {s1,s2}and f and g betwoactswhoseutilitiesofconsequencesinthestatess1and s2 arelistedinthe
followingtable,aswellasthedegreesofpossibilityofs1 ands2:
s1 s2 u(f(s)) 0.3 0.5
u(g(s)) 0.4 0.6
π 1 0.2
Comparing f and g withrespecttothepessimisticcriterion,we get: • upes(f)=min(max(0.3,0),max(0.5,0.8))=0.3,
• upes(g)=min(max(0.4,0),max(0.6,0.8))=0.4.
Thus, gºupes f .
Letusnowcomparethetwoactswithrespecttotheoptimisticcriterion: • uopt(f)=max(min(0.3,1),min(0.5,0.2))=0.3,
• uopt(g)=max(min(0.4,1),min(0.6,0.2))=0.4.
Thus, gºuopt f .
Itisimportanttonotethat whiletransitionprobabilitiescanbeestimatedthroughsimulationsoftheprocess,transition possibilitiesmaynot.Ontheotherhand,expertsmaybeinvolvedfortheelicitationofthepossibilitydegreesandutilitiesof transitions.Inthepossibilisticframework,utilityanduncertaintylevelscanbeelicitedjointly,bycomparisonofpossibilistic lotteries,for example(e.g.byusingcertaintyequivalents,asin [11]).Simulationcanalsobeused jointlywithexpert eval-uationwhentheunderlyingprocess istoocostlytosimulatealargenumberoftimes: simulationmaybeusedtogenerate samples onwhich expertelicitationisapplied. Anotheroptionisto usepossibilisticreinforcementlearningprocedure (for moredetailssee[23]),inparticularmodel-based reinforcementlearningalgorithm.The latteruses auniformsimulationof trajectories(withrandomchoiceofactions)inordertogenerateanapproximationofthepossibilisticdecisionmodel.
2.2. StationaryPossibilisticMarkovDecisionProcesses
AstationaryPossibilisticMarkovDecisionProcess(5MDP)[22] isdefinedby: • Afiniteset S of states;
• Afiniteset A of actions, As denotestheset ofactionsavailableinstates;
• A possibilistic transition function: for each action a∈ As and each state s∈S the possibility distribution
π
(s′|s,a)evaluatestowhatextent eachs′ isapossiblesuccessor ofs whenactiona isapplied; • Autilityfunction
µ
:µ
(s)istheintermediatesatisfactiondegreeobtainedinstates.Theuncertainty abouttheeffectofanaction a takeninstate s iscaptured byapossibilitydistribution
π
(.|s,a).Inthe present paper, weconsider stationaryproblems,i.e. problemsinwhich thestates,theactionsand thetransition functions do not dependon thestageof theproblem.Sucha possibilisticMDPmay defineagraphwhere statesarerepresentedby circles andeachstate“s”islabeledwithautilitydegree,andactionsarerepresentedbysquares.Anedgelinkinganaction toastatedenotesapossibletransitionandislabeledbythepossibilityofthat stategiventheactionisexecuted.Fig. 1. The stationary5MDPof Example2.
Example2.Let ussuppose thata“RichandUnknown” personrunsastartupcompany.Initially,s/hemustchoosebetween
Saving money (Sav) or Advertising ( Adv) and may then get Rich (R) or Poor (P ) and Famous (F ) or Unknown (U ). In the otherstates, Sav is theonlypossible action.Fig. 1showsthestationary5MDP thatcaptures thisproblem,formally described asfollows:
S= {RU,R F,P U},
ARU= {Adv,Sav}, AR F =AP U= {Sav},
π
(P U|RU,Sav)=0.2,π
(RU|RU,Sav)=π
(R F|RU,Adv)=π
(R F|R F,Sav)=π
(RU|R F,Sav)=1,µ
(RU)=0.5,µ
(R F)=0.7,µ
(P U)=0.3.SolvingastationaryMDPconsistsinfindinga(stationary)policy,i.e.afunctionδ:S→A whichisoptimalwithrespect to adecisioncriterion.Inthepossibilisticcase,asintheprobabilisticcase,thevalueofapolicydependsontheutilityand on the likelihoodofits trajectories.Formally, let1bethe setof allpoliciesthat can bebuiltfor the5MDP (theset of all thefunctions thatassociate an elementof As to eachs).Each δ ∈1defines a listof scenarioscalledtrajectories. Each
trajectory
τ
isasequenceofstatesandactionsi.e.τ
= (s0,a0,s1,. . . ,st−1, at−1,st).Tosimplifynotations,wewillassociatethevectorvτ = (
µ
0,π
1,µ
1,π
2,. . . ,π
t−1,µ
t)toeachtrajectoryτ
,whereπ
i+1=π
(si+1|si,ai) isthepossibility degreetoreach thestate si+1 att=i+1,applying theaction ai att=i andµ
i=µ
(si) istheutilityobtainedinthei-th statesi ofthetrajectory.
Thepossibilityandtheutilityoftrajectory
τ
given thatδ isappliedfroms0 aredefinedby:π
(
τ
|
s0, δ) =
mini=1...t
π
(
si|
si−1, δ(
si−1))
andµ
(
τ
) =
imin=0...tµ
(
si).
(3)Twocriteria,anoptimisticandapessimisticone,canthenbeusedtoevaluate δ [24,9]:
uopt
(δ,
s0) =
maxτ min
{
π
(
τ
|
s0, δ),
µ
(
τ
)},
(4)upes
(δ,
s0) =
minτ max
{
1−
π
(
τ
|
s0, δ),
µ
(
τ
)}.
(5)The policies optimizing these criteria can be computed by applying, for every state s and time step i=0,...,t, the followingcounterpartsoftheBellmanupdates[22]:
uopt
(
s,
i) ←
max a∈As min{
µ
(
s),
max s′∈S min(
π
(
s ′|
s,
a),
uopt(
s′,
i+
1))},
(6) upes(
s,
i) ←
max a∈As min{
µ
(
s),
min s′∈Smax(
1−
π
(
s ′|
s,
a),
upes(
s′,
i+
1))},
(7)δ
opt(
s,
i) ←
arg max a∈Asmin
{
µ
(
s),
maxs′∈S min
(
π
(
s′
|
s,
a),
uopt(
s′,
i+
1))},
(8)δ
pes(
s,
i) ←
arg max a∈Asmin
{
µ
(
s),
mins′∈Smax
(
1−
π
(
s′
|
s,
a),
upes(
s′,
i+
1))}.
(9)There weset,arbitrarily, uopt(s′,t+1))=1 and upes(s′,t+1))=1.
Ithasallowedthedefinitionofa(possibilistic)valueiteration algorithm(seeAlgorithm1fortheoptimisticvariantofthis algorithm)whichconvergestoanoptimalpolicy inpolytime[22].
This algorithm proceeds byiterated modifications ofa possibilisticvalue function Q(s,a) which evaluates the“utility” (pessimisticoroptimistic)ofperforminga in s.
Anotheralgorithm, (possibilistic)PolicyIteration(Algorithm2fortheoptimisticvariant) isproposed in[22] forsolving possibilisticstationary,infinite horizonMDPs.PolicyIterationalternatesstepsofevaluationofthecurrent policywithsteps ofgreedyimprovementofthecurrentpolicy.
Algorithm 1: V I-M D P :Possibilistic(Optimistic)Valueiteration.
Data: Astationary5MDP
Result: Apolicyδoptimalforuopt
1 begin 2 foreach s∈S do uopt(s)←µ(s); 3 repeat 4 foreach s∈S do 5 uold(s)←uopt(s); 6 foreach a∈A do
7 Q(s,a)←minnµ(s),maxs′∈Smin{(π(s′|s,a),uopt(s′)}
o ;
8 uopt(s)←maxaQ(s,a);
9 δ(s)←arg maxaQ(s,a);
10 until uopt(s)==uold(s)foreachs;
11 returnδ;
Algorithm 2: P I-M D P :Possibilistic(Optimistic)Policyiteration.
Data: Astationary5MDP
Result: Apolicyδoptimalforuopt
1 begin
2 // Initialization of δ and uopt
3 foreach s∈S do
4 δ(s)←chooseanyas∈As;
5 uopt(s)←µ(s);
6 repeat
7 // Evaluation of δ until stabilization of uopt
8 repeat
9 foreach s∈S do
10 uold(s)←uopt(s);
11 uopt(s)←min
n
µ(s),maxs′∈Smin{π(s′|s,δ(s)).uold(s′)} o
; 12 until uopt==uold;
13 // Improvement of δ
14 foreach s∈S do
15 δold(s)← δ(s);
16 δ(s)←arg maxa∈Amin
n
µ(s),maxs′∈Smin{π(s′|s,a).uopt(s′)}
o
; 17 untilδ(s)== δold(s)foreachs;
18 // stabilization of δ
19 returnδ;
2.3. Thedrowningeffectinstationarysequentialdecisionproblems
Unfortunately, possibilisticutilities sufferfrom animportantdrawback calledthe drowningeffect:plausible enough bad orgood consequencesmaycompletelyblurthecomparisonbetweenactsthatwouldotherwisebeclearlydifferentiated;as aconsequence,anoptimalpolicy δisnotnecessarilyParetoefficient.RecallthatapolicyδisParetoefficientwhennoother policy δ′ dominatesit(i.e. thereisno policy δ′ suchthat (i)∀ s∈S,upes(δ′,s)ºupes(δ,s) and (ii)∃s∈S s.t. upes(δ′,s)≻ upes(δ,s)).Thefollowingexample showsthat itcansimultaneouslyhappenthatδ′ dominates δ andupes(δ)=upes(δ′).
Example3.The5MDP ofExample2admitstwopoliciesδ and δ′:
• δ(RU)=Sav;δ(P U)=Sav;δ(R F)=Sav;
• δ′(RU)=Adv;δ′(P U)=Sav;δ′(R F)=Sav.
Considerafixedhorizon H=2: • δ has3 trajectories:
τ
1= (RU,P U,P U)with vτ1= (0.5,0.2,0.3,1,0.3);τ
2= (RU,RU,P U) with vτ2= (0.5,1,0.5,0.2,0.3);• δ′has 2 trajectories:
τ
4= (RU,R F,R F)with vτ4= (0.5,1,0.7,1,0.7);τ
5= (RU,R F,RU) with vτ5= (0.5,1,0.7,1,0.5).Thus uopt(δ,RU)=uopt(δ′,RU)=0.5. However δ′ seems better than δ since it provides utility 0.5 for sure while δ
providesabadutility(0.3)insomenon-impossibletrajectories(
τ
1 andτ
2).τ
3 whichisgoodandtotally possible“drowns”τ
1 andτ
2:δ isconsideredasgoodasδ′.3. Boundediterationssolutionstolexicographicfinitehorizon5MDPs
Possibilistic decisioncriteria,especially pessimisticand optimisticutilities,aresimple and realisticasillustratedin Sec-tion2,buttheyhaveanimportantshortcoming:theprincipleofParetoefficiencyisviolatedsincethesecriteriasufferfrom the drowning effect.Indeed, one decisionmay dominate another one while notbeing strictlypreferred. Inorder to over-come the drowning effect,some refinements ofpossibilistic utilities have been proposed in the non-sequentialcase such aslexicographicrefinements,proposed by[12,13].Theserefinementsarefullyinaccordancewithordinalutilitytheoryand satisfytheprincipleofParetodominance,thatiswhywehavechosentofocusonthem.
The present section defines an extension of lexicographic refinements to finite horizon possibilistic Markov decision processesand proposesavalueiterationalgorithmthatlooksforpoliciesoptimalwithrespecttothesecriteria.
3.1. Lexi-refinementsofordinalaggregations
In ordinal(i.e.min-based and max-based) aggregationasolution tothe drowningeffectbased on leximinand leximax comparisons has been proposed by[19].It has thenbeen extended to non-sequential decisionmaking under uncertainty [13] and,inthesequentialcase,todecisiontrees[4].Letusfirstrecallthebasicdefinitionofthesetwopreferencerelations. For anytwovectorst andt′ oflengthm builtonthescale L:
t
º
lmint′iff∀
i,
tσ(i)=
t′σ(i)or∃
i ∗, ∀
i<
i∗,
tσ(i)=
tσ′(i)and tσ(i∗)>
t′σ (i∗),
(10) tº
lmaxt′iff∀
i,
tµ(i)=
tµ′(i)or∃
i ∗, ∀
i<
i∗,
tµ(i)=
t′µ(i)and tµ(i∗)>
tµ′(i∗),
(11)where, forany vectorv (here,v=t or v=t′), v
µ(i) (resp.vσ(i))isthei-th best(resp.worst)elementof v.
[13,4] haveextendedtheseprocedurestothecomparisonofmatricesbuilton L,definingpreferencerelationsºlmin(lmax)
and ºlmax(lmin):
A
º
lmin(lmax)B⇔ ∀
j,
a(lmax,j)∼
=
b(lmax,j)or
∃
i s.t.∀
j>
i,
a(lmax,j)∼
lminb(lmax,j)and a(lmax,i)≻
lminb(lmax,i),
(12)A
º
lmax(lmin)B⇔ ∀
j,
a(lmin,j)∼
lmaxb(lmin,j)or
∃
i s.t.∀
j<
i,
a(lmin,j)∼
lmaxb(lmin,j)and a(lmin,i)≻
lmaxb(lmin,i),
(13)where a(☎,i) (resp.b(☎,i))isthei-th largestsub-vectorof A (resp. B)accordingto ☎∈ {lmax,lmin}.
Like in(finite-horizon) possibilisticdecisiontrees[4] ourideaistoidentifythestrategiesoftheMDPwiththematrices of their trajectories,and to compare such matrices witha ºlmax(lmin) (resp. ºlmin(lmax)) procedure for theoptimistic(resp.
pessimistic)case.
3.2. Lexicographiccomparisonsofpolicies
Letusfirstdefinelexicographiccomparisonsofpoliciesoveragivenhorizon E.
Atrajectoryoverhorizon E beingasequenceofstatesandactions,any stationarypolicy canbeidentifiedwithamatrix where eachline correspondsto a distinct trajectory oflength E. In the optimisticcase each linecorresponds to a vector
vτ = (
µ
0,π
1,µ
1,π
2,. . . ,π
E−1,µ
E)andinthepessimisticcaseto wτ = (µ
0,1−π
1,µ
1,1−π
2,. . . ,1−π
E−1,µ
E).Thisallowsustodefinethecomparisonoftrajectoriesusingleximaxandleximinasfollows:
τ
º
lminτ
′iff(
µ
0,
π
1, . . . ,
π
E,
µ
E) º
lmin(
µ
′0,
π
2′, . . . ,
π
E′,
µ
′E),
(14)τ
º
lmaxτ
′iff(
µ
0,
1−
π
1, . . . ,
1−
π
E,
µ
E) º
lmax(
µ
′0,
1−
π
1′, . . .
1−
π
E′,
µ
′E).
(15) Note that theabovepreference relationsimplicitlydependon thehorizon E andthesameholdsfor stationarypolicies comparison.Weleaveasideanyreferenceto E asthedependence willbeclearfromthecontext.Using(14) and(15),wecancompare policiesby:
δ º
lmax(lmin)δ
′iff∀
i,
τ
µ(i)∼
lminτ
µ′(i)or
∃
i∗, ∀
i<
i∗,
τ
µ(i)∼
lminτ
µ′(i) andτ
µ(i∗)≻
lminτ
µ′(i∗)
,
(16)δ º
lmin(lmax)δ
′iff∀
i,
τ
σ(i)∼
lmaxτ
σ′(i)or
∃
i∗, ∀
i<
i∗,
τ
σ(i)∼
lmaxτ
σ′(i) andτ
σ(i∗)≻
lmaxτ
σ′(i∗)
,
(17)where
τ
µ(i) (resp.τ
µ′(i))isthei-th best trajectoryof δ (resp.δ′)accordingtoºlmin andτ
σ(i) (resp.τ
σ′(i))isthe i-th worsttrajectory ofδ (resp.δ′)accordingtoºlmax.
Hence,theutilitydegreeofapolicyδ canberepresentedbyamatrixUδwithn lines,s.t.n isthenumberoftrajectories, andm=2E+1 columns.Indeed,comparingtwopoliciesw.r.t.ºlmax(lmin)(resp.ºlmin(lmax))consistsinfirstorderingthetwo
correspondingmatricesoftrajectoriesasfollows:
• ordertheelementsofeachtrajectory(i.e.theelementsofeachline) inincreasingorderw.r.t. ºlmin (resp.indecreasing
orderw.r.t. ºlmax),
• then order all the trajectories. The lines of each policy are arranged lexicographically top-down in decreasing order (resp.top-downinincreasingorder).
Then,itisenoughtolexicographicallycomparethetwonewmatricesoftrajectories,denotedUδ (resp.Uδ′),elementby
element.Thefirstpairofdifferentelementsdeterminesthebestmatrix/policy.Notethat theorderedmatrix Uδ (resp.Uδ′)
canbeseenastheutilityofapplyingpolicy δ (resp.δ′)overalengthE horizon.
Example4. Let us consider the Counter-Example 3 with the same 5MDP of Example 2. We consider, once again, the
policiesδ and δ′ definedby:
• δ(RU)=Sav;δ(P U)=Sav;δ(R F)=Sav;
• δ′(RU)=Adv;δ′(P U)=Sav;δ′(R F)=Sav.
For horizon H=2: • δ has3 trajectories:
τ
1= (RU,P U,P U)with vτ1= (0.5,0.2,0.3,1,0.3);τ
2= (RU,RU,P U) with vτ2= (0.5,1,0.5,0.2,0.3);τ
3= (RU,RU,RU) withvτ3= (0.5,1,0.5,1,0.5).Thematrixoftrajectoriesis:Uδ= 0.5 0.2 0.3 1 0.3 0.5 1 0.5 0.2 0.3 0.5 1 0.5 1 0.5 ∼ 0.2 0.3 0.3 0.5 1 0.2 0.3 0.5 0.5 1 0.5 0.5 0.5 1 1 .
So,theorderedmatrix oftrajectoriesis:Uδ= 0.5 0.5 0.5 1 1 0.2 0.3 0.3 0.5 1 0.2 0.3 0.5 0.5 1 . • δ′has 2 trajectories:
τ
4= (RU,R F,R F)with vτ4= (0.5,1,0.7,1,0.7);τ
5= (RU,R F,RU) with vτ5= (0.5,1,0.7,1,0.5). Theorderedmatrixoftrajectoriesis: Uδ′=·
0.5 0.7 0.7 1 1 0.5 0.5 0.7 1 1 ¸
.
GiventhetwoorderedmatricesUδ andUδ′,δand δ′ areindifferentforoptimisticutilitysincethetwofirst(i.e.top-left)
elements ofthematrices areequali.e. uopt(δ)=uopt(δ′)=0.5.For lmax(lmin) wecompare successivelythe nextelements
(left toright thentop tobottom)until wefindapairofdifferent values.In particular,wehave thesecondelement ofthe first (i.e.thebest)trajectory ofδ′ isstrictlygreaterthan thesecondelement ofthefirsttrajectory ofδ (0.7>0.5).So,the firsttrajectoryofδ′ isstrictlypreferredtothefirsttrajectoryofδ accordingtoºlmin.Wededucethat δ′ isstrictlypreferred
to δ:
δ
′≻
lmax(lmin)δ
since(
0.
5,
0.
7,
0.
7,
1,
1) ≻
lmin(
0.
5,
0.
5,
0.
5,
1,
1).
Thefollowingpropositionscanbeshown,concerningthefixedhorizoncomparisonofstationarypolicies.Noteagainthat thedependenceon E isleftimplicit.
Proposition1.
Ifuopt(δ)>uopt(δ′)thenδ ≻lmax(lmin)δ′.
Ifupes(δ)>upes(δ′)thenδ ≻lmin(lmax)δ′.
Proposition2.ºlmax(lmin)andºlmin(lmax)satisfytheprincipleofParetoefficiency.
Now,inordertodesigndynamicprogrammingalgorithms,i.e.toextendthevalueiterationalgorithmtolexicomparison, we showthat thecomparison ofpoliciesisa preorderandsatisfies theprincipleofstrict monotonicitydefined asfollows forany optimizationcriterion O by:∀δ,δ′,δ′′∈ 1,
δ º
Oδ
′⇐⇒ δ + δ
′′º
Oδ
′+ δ
′′,
where δ (resp. δ′) and δ′′ denote two disjointsets oftrajectories and δ + δ′′ (resp. δ′+ δ′′) is the set of trajectories that gatherstheonesofδ (resp.δ′)and theonesofδ′′.
Then,addingorremovingidenticaltrajectoriestotwosetsoftrajectoriesdoesnotchangetheircomparisonbyºlmax(lmin)
(resp.ºlmin(lmax)).
Proposition3.Relationsºlmin(lmax)andºlmax(lmin)arecomplete,transitiveandsatisfytheprincipleofstrictmonotonicity.
Note that uopt and upes satisfyonly aweakform ofmonotonicitysincetheaddition ortheremovaloftrajectoriesmay
transformastrictpreferenceintoanindifferenceif uopt orupes isused.
Let usdefine thecomplementary MDP (S,A,
π
,µ
¯) of agiven 5MDP (S,A,π
,µ
) whereµ
¯(s)=1−µ
(s),∀s∈S.The complementary MDPsimplygivescomplementary utilities.Fromthedefinitionsofºlmax and ºlmin,wecancheckthat:Proposition4.
τ
ºlmaxτ
′⇔ ¯τ
′ºlminτ
¯andδ ºlmin(lmax)δ′⇔ ¯δ′ºlmax(lmin)δ¯.There
τ
¯ and ¯δ areobtainedbyreplacingµ
withµ
¯ inthetrajectory/5MDP.Therefore, allthe results whichwe will provefor ºlmax(lmin) also hold for ºlmin(lmax), if wetake care toapply them to
complementary policies. Since considering ºlmax(lmin) involves less cumbersome expressions (no 1− ·), we will give the
results forthiscriterion.Aconsequence ofProposition4isthat theresultsholdforthepessimisticcriterionaswell. This monotonicity of the lmin(lmax) and lmax(lmin) criteria is sufficient to allow us to use a dynamic programming algorithm such asvalue iterationorpolicy iteration[2]. Thealgorithms wepropose in thepresent paper performexplicit Bellmanupdatesinthelexicographicframework(lines12–13ofAlgorithms3and4,line11ofAlgorithm5);thecorrectness oftheiruseis provedinPropositions6to10.
3.3. Basicoperationsonmatricesoftrajectories
Before going further, in order to give more explicit and compact descriptions of the algorithms and the proofs, let us introduce the following notations and some basic operations on matrices (typically, on the matrix U(s) representing trajectoriesissued from state s).Abusing notationsslightly,weidentifytrajectories
τ
(resp.policies)with their vτ vectors(resp.matricesof vτ vectors)whenthereisnoambiguity.Forany matrixU ,[U]l,c denotestherestrictionofU toitsfirstl
lines andfirstc columnsand Ui,j denotestheelementatlinei andcolumn j.
• Composition:Let U beaa×b matrixand N1,. . . ,Na beaseriesofa matricesofdimensionni×c (theyallsharethe
samenumberofcolumns).ThecompositionofU with(N1,. . . ,Na)denoted U× (N1,. . . ,Na)isamatrixofdimension
( 6
1≤i≤ani)× (b+c).Forany i≤a,j≤nj,the((6i
′<ini′)+ j)-th lineofU× (N1,. . . ,Na) istheconcatenation ofthei-th
lineofU andthe j-th lineof Ni.
The composition of U× (N1,. . . ,Na) is done in O(n·m) operations, where n= 6
1≤i≤ani and m=b+c. The matrix U(s), matrix of trajectories out of state s when making decision a, is typicallythe concatenation of the matrix U = ((
π
(s′|s,a),µ
(s′)),s′∈succ(s,a))withthematricesNs′=U(s′).Thisprocedureaddstwocolumnstoeachmatrix U(s′),
filledwith
π
(s′|s,a) andµ
(s′) the possibility degrees and the utility of reaching s′;then the matrices are verticallyconcatenated to get the matrix U(s) when making decision a. Then it is possible to lexicographically compare the resultingmatricesinordertogettheoptimalactioninstate s.
• Orderingmatrices: LetU bean×m matrix, Ulmaxlmin isthematrixobtained byorderingtheelements ofthelines of
U inincreasingorderand thelines ofU accordingtolmax in decreasingorder(see Example4). Thisoperationallows tocomparematricesoftrajectories Q(s,a) ofeveryactionin order tocomparethemandchoosetheoptimaldecision. The complexity of the operation depends on the sorting algorithm: if we use QuickSort then ordering the elements withinaline isperformed in O(m·log(m)), and the inter-ranking ofthelines is donein O(n·log(n)·m) operations. Hence,theoverallcomplexityis O(n·m·log(n·m)).
• Comparisonoforderedmatrices:Giventwoorderedmatrices Ulmaxlmin and Vlmaxlmin, wesaythat Ulmaxlmin>Vlmaxlmin
iff∃i,j such that∀i′<i,∀j′,Ulmaxlmini′,j′ =Vlmaxlmini′,j′ and∀j′<j, Ulmaxlmini,j′ =Vlmaxlmini,j′ andUlmaxlmini,j >Vlmaxlmini,j .
Ulmaxlmin∼Vlmaxlmin iff they are identical (comparison complexity: O(n·m)). Once matrices Q(s,a) are ordered, the lexicographiccomparison oftwo decisionsis performedbyscanning theelements oftheirmatrices, lineby linefrom thefirstone.Thefirstpairofdifferentvaluesdeterminesthebestmatrixandthebestcorrespondingactiona isselected (seeExample4).
If thepolicies (orsub-policies) have different numbersof trajectories,thecomparison oftwo matrices isbased on the numberoftrajectoriesoftheshortestmatrix.Twocasesmayarise:
• If wehavea strictpreference betweenthetwo matricesbefore reachingthelast lineoftheshortest matrix,we geta strictpreferencebetweenthepolicies(orbetweenthesub-policies).
• If we have an indifference up to the last line, the shortest matrix is the best for the lexicographic criterion, since it expresseslessuncertaintyinthecorrespondingpolicy (orinthesub-policy).
3.4. Boundediterationslexicographicvalueiteration
In this section, we propose aniterative value iteration-typealgorithm (Algorithm 3). This algorithm follows the same principle as in the possibilistic case (Eqs. (6)–(9)). Repeated Bellman updates are performed successively E times. This algorithm will provide anapproximation of alexi optimalstrategy in the infinite horizon case (by considering thepolicy returned forthefirst timestep).This algorithmissub-optimal for anyfixed E,but wewill seeinSection 4that letting E
grow,anoptimallexicographicpolicywillbeobtainedforfinite E.
Weproposetwoversionsofthevalueiterationalgorithm:Thefirstonecomputestheoptimalpolicywithrespecttothe lmax(lmin)criterionand thesecondoneprovidestheoptimalpolicy withrespecttothelmin(lmax)criterion.Inthispaper, wepresentand detailonlythefirstalgorithm,sincethesecondisverysimilar.2
Algorithm 3: Boundediterations lmax(lmin)-valueiteration(BI-VI).
Data: ApossibilisticMDPandmaximumnumberofiterationsE
Result: TheδE strategyobtainedafter E iterations
1 begin 2 e←0;
3 foreach s∈S do U(s)← ((µ(s))); 4 foreach s∈S,a∈A do
5 T Us,a←Ts,a× ((µ(s′)),s′∈succ(s,a));
6 repeat 7 e←e+1; 8 foreach s∈S do 9 Uold(s)=U(s); 10 Q∗← ((0)); 11 foreach a∈A do
12 Future← (Uold(s′),s′∈succ(s,a));// Gather the matrices provided by the successors of s;
13 Q(s,a)←¡T Us,a×Future¢lmaxlmin;
14 if Q∗≤ lmaxlminQ(s,a)then 15 Q∗←Q(s,a); 16 δ(s)←a 17 U(s)←Q∗(s,δ(s)) 18 until e==E; 19 δ(s)←argmaxaQ(s,a) 20 returnδE= δ;
This algorithm is aniterative procedure that performs aprescribed number of updates, E, ofthe utilityof each state, representedbyafinitematrixoftrajectories,usingtheutilitiesoftheneighboringstates.
Atstage1≤e≤E, theprocedureupdatestheutilityofeverystates∈S asfollows:
• Foreachactiona∈A,amatrix Q(s,a)isbuiltto evaluatethe“utility”ofperforminga in s atstagee:thisisdoneby combining T Us,a (combinationofthetransitionmatrix Ts,a=
π
(·|s,a) and theutilitiesµ
(s′) ofthestatess′ that mayfollows when a isexecuted)with thematrices Uold(s′) oftrajectoriesprovidedbythese s′ atthepreviousstage. The
matrix Q(s,a) isthenordered(theoperationismade less complexbythefactthat thematricesUold(s′) have already
beenorderedate−1).
• Thelmax(lmin) comparisonisperformedontheflytomemorizethebest Q(s,a).
• Thevalueofstates atstagee,U(s),istheonegivenbytheactiona whichprovidesthebest Q(s,a).δ isupdated,U is
memorized(andUold canbediscarded).
Timeandspacecomplexitiesofthisalgorithmareneverthelessexpensive,sinceiteventuallymemorizesallthe trajecto-ries.At eachstepe itssizemaygrowtobe· (2·e+1),whereb isthemaximalnumberofpossiblesuccessorsofanaction;
theoverallcomplexityofthealgorithmis O(|S|· |A|·E·bE),whichisaproblem.
Algorithm 3 is provided with a number of iterations, E. Does it converge when E tends to infinity? That is, are the returnedpoliciesidenticalforany E exceedingagiventhreshold?Beforeanswering(positively)thisquestioninSection4.4, wearegoingtodefineboundedutilitymatrices solutionstolexicographicpossibilisticMDPs.Thesesolutionconceptswillbe usefultoanswertheabovequestion.
4. Boundedutilitysolutionstolexicographic5MDPs
Wehave just proposedalexicographic value iterationalgorithmfor thecomputationoflexicographic policiesbased on thewholematricesoftrajectories.Asaconsequence,thespatial/temporalcomplexityofthealgorithmisexponentialinthe number of iterations. This section presents an alternative wayto get lexicographic policies. Rather than limiting the size of thematrices oftrajectories bylimitingthe numberof iterations, wepropose to “forget”the less significantpart ofthe matricesofutilityandtodecideonlybasedonthemostsignificant(l,c)sub-matrices–we“bound”theutilitymatrices.We proposeinthepresentsectiontwoalgorithmsbasedonthisidea,namelyavalueiterationandapolicyiterationalgorithms.
4.1. Boundedlexicographiccomparisonsofutilitymatrices
Recallthat, forany matrixU , [U]l,c denotestherestriction ofU to itsfirstl linesand first c columns.Noticenowthat,
atany stage e andforany states [U(s)]1,1 (i.e. thetopleftvalue inU(s))ispreciselyequalto uopt(s).Wehave seen that
making the choices on thisbasis isnot discriminantenough. On theother hand, taking thewhole matrix into account is discriminant,butexponentiallycostly.Hencetheideaofconsideringmorethan onelineand onecolumn,butlessthan the whole matrix–namelythefirstl linesand c columnsofUt(s)lmaxlmin;hencethedefinitionofthefollowingpreference:
δ ≥
lmaxlmin,l,cδ
′iff[δ
lmaxlmin]
l,c≥ [δ
′lmaxlmin]
l,c.
(18)≥lmaxlmin,1,1correspondstoºopt and≥lmaxlmin,+∞,+∞correspondsto≥lmaxlmin.
Thefollowingpropositionshowsthatthisapproachissoundand that≻lmaxlmin,l,c refinesuopt:
Proposition5.
• Foranyl,l′,c suchthatl′>l,δ ≻
lmaxlmin,l,cδ′⇒ δ ≻lmaxlmin,l′,cδ′.
• Foranyl,cδ ≻optδ′⇒ δ ≻lmaxlmin,l,cδ′.
Inotherwords,theorderoverthepoliciesisrefinedforafixedc whenl increases.Ittendsto≻lmaxlmin whenc=2.E+1
and l tendstobE.
Noticethatthecombinatorialexplosionisduetothenumberoflines(thenumberofcolumns isboundedby2·E+1), henceweshallbound thenumberofconsideredlines only.
Up to this point, the comparison by ≥lmaxlmin,l,c is made on the basis of the first l lines and c columns of the full
matrices of trajectories.This doesobviously notreduce their size. Theimportant following Propositionallows usto make thel,c reductionoftheordered matricesateachstep (aftereachcomposition),and notonly attheveryend,thus keeping spaceandtimecomplexitiespolynomial.
Proposition6.LetU beaa×b matrixandN1,. . . ,Nabeaseriesofa matricesofdimensionai×c.Itholdsthat:
[(
U× (
N1, . . . ,
Na))
lmaxlmin]
l,c= [(
U× ([
Nlmaxlmin1]
l,c, . . . , [
Nlmaxlmina]
l,c))
lmaxlmin)]
l,c.
4.2. Boundedutilitylexicographicvalueiteration
It is now easy to design a generalization of the possibilistic algorithm of value iteration (Algorithm 1) by keeping a submatrix ofeachcurrentvaluematrix–namelythefirstl linesandc columns.WecallthisalgorithmBoundedUtilityValue Iteration (BU -V I) (seeAlgorithm4).
Algorithm 4: BoundedUtilityLmax(lmin)ValueIteration(BU-VI).
Data: ApossibilisticMDP,bounds(l,c);δ,thepolicybuiltbythealgorithm,isaglobalvariable Result: Apolicyδoptimalforºlmaxlmin,l,c
1 begin
2 foreach s∈S do U(s)← ((µ(s))); 3 foreach s∈S,a∈A do
4 T Us,a←Ts,a× ((µ(s′)),s′∈succ(s,a));
5 repeat
6 until U(s)==Uold(s)foreachs;
7 foreach s∈S do
8 Uold(s)←U(s); 9 Q∗← ((0));
10 foreach a∈A do
11 Future← (Uold(s′),s′∈succ(s,a));// Gather the matrices provided by the successors of s;
12 Q(s,a)← [¡T Us,a×Future¢ lmaxlmin ]l,c; 13 if Q∗≤ lmaxlminQ(s,a)then 14 Q∗←Q(s,a); 15 δ(s)←a 16 U(s)←Q∗(s,δ(s)) 17 δ(s)←argmaxaQ(s,a); 18 U(s) ←maxaQ(s,a) 19 returnδ;
WhenthehorizonoftheMDPisfinitethisalgorithmprovidesinpolynomialtimeapolicythatisalwaysatleast asgood astheoneprovidedbyuopt (according tolmax(lmin))and tendstolexicographic optimalitywhen c=2·E+1 and l tends
tobE.
Let us now study the time complexity. The number of iterations is bounded by the size of the set of possible ma-trices of trajectories which is in O(|S|· |A|·E). One iteration of the algorithm requires composition, ordering and com-paring operations on b matrices of size (l,c). Since the composition and comparison of matrices are linear operations, the complexity of one iteration in worst case is in b· (l·c)·log(l·c). Therefore, the complexity of the algorithm is in
O(|S|· |A|·E·b· (l·c)·log(l·c)).
When the horizon of the MDP is not finite, equations (16) and (17) are not enough to rank-order the policies. The length of the trajectoriesmay be infinite, as well as their number. This problem is well known in classical probabilistic MDPs whereadiscountfactorisusedto attenuatetheinfluenceoflaterutilitydegrees–thus allowingtheconvergence ofthealgorithm[21].Onthecontrary,classical5MDPsdonotneedanydiscountfactorandValueIteration,basedonthe evaluation forl=c=1,converges forinfinite horizon case [22]. Ina sense, thislimitationtol=c=1 plays therole ofa discount factor–but avery drastic one. Extendingthecomparison byusing≥lmaxlmin,l,c with larger(l,c) as shownbelow
allowstousealessdrastic discount.
Inotherterms,≥lmaxlmin,l,c canbeusedintheinfinitecase,asshownbythefollowingproposition.
Proposition7(Boundedutilitylmax(lmin)-policyevaluationconverges).LetUt(s)bethematrixissuedfroms atinstantt whena strategyδisexecuted.Itholdsthat:
∀
l,
c, ∃
t,
such that∀
t′≥
t, (
Ut)
llmaxlmin,c(
s) = (
Ut′)
lmaxlminl,c(
s) ∀
s.
Hencethereexistsastaget,wherethevalueofapolicybecomesstableifcomputedwiththeboundedutilitylmax(lmin) evaluation algorithm.This criterionisthus soundlydefined and can beusedin theinfinite horizon case(and ofcoursein thefinitehorizon case).
ThenumberofiterationsofAlgorithm4isnotexplicitlyboundedbut theconvergence ofthealgorithmisguaranteed– thisisadirectconsequenceofProposition7.
Corollary1(Boundedutilitylmax(lmin)-valueiterationconverges).∀l,c,∃t suchthat,∀t′≥t,(Ut)lmaxlmin
l,c (s) = (U t′
)lmaxlminl,c (s) ∀s.
Theoverallcomplexityofboundedutilitylmax(lmin)-valueiteration (Algorithm 4)isboundedby O(|S|· |A|· |L|·b· (l·c)·
4.3. Boundedutilitylexicographic-policyiteration
In Ref. [17], Howard shows that a policy often becomes optimal long before the convergence of the value estimates. ThatiswhyPuterman[21] hasproposedapolicyiterationalgorithm.Thisalgorithmhasbeenadapted topossibilisticMDPs by [22].
Likewise, weproposea(boundedutility) lexicographicpolicyiteration algorithm(Algorithm5),denotedhere BU -P I that
alternatesimprovementandevaluationphases,asany policyiterationalgorithm.
Algorithm 5: Lmax(lmin)-BoundedUtilityPolicyIteration.
Data: ApossibilisticMDP,bounds(l,c)
Result: Apolicyδ∗optimalwhenl,c grows
1 begin
2 // Arbitrary initialization of δ on S
3 foreach s∈S doδ(s)←chooseanyas∈As;
4 repeat 5 // Evaluation of δ 6 foreach s∈S do U(s)←µ(s); 7 repeat 8 foreach s∈S do 9 Uold(s)←U(s);
10 // Gather the matrices of the successors of s given δ 11 Future← (U(s′),s′∈succ(s,δ(s)));
U(s)←h¡T Us,δ(s)×Future¢lmaxlmin
i
l,c;
12 until U(s)==Uold(s) foreachs;
13 δold← δ;
14 // Improvement of δ
15 foreach s∈S do
16 // Compute the utility of the strategy playing a (for each a), given what was chosen for the other states
17 foreach a∈A do
18 Future← (U(s′),s′∈succ(s,δold(s)));
Q(s,a)←£¡T Us,a×Future¢¤lmaxlminl,c
19 // Update the choice of an action for S
20 δ(s)←arg maxlmax(lmin)a∈A Q(s,a)
21 untilδ == δold; 22 returnδ;
In line 3 of Algorithm 5, an arbitrary initial policy is chosen. The algorithm then proceeds by evaluating the current policy,through successiveupdatesofthevaluefunction(lines8to11);theconvergenceofthisevaluationiseasilyderived from that of the boundedutilitylmax(Lmin)-valueiteration algorithm. Then the algorithm enters the improvement phase: Lines 17–18 compute Q(s,a), the (bounded lexicographic) utility of playing action a in state s and then applying policy δold insubsequent states(thepolicy computedduring thelast iteration); asusual in Policy Iterationstyle algorithms, the updated policy (δ) is thenobtained by greedilyimproving the current action, which is donein line 20. Sincethe actions consideredatline20doincludetheoneprescribedbyδold,eithernothingischanged, andthealgorithmstops, orthenew policy,δ,isbetterthan thepreviousoneδold.
Proposition8.Boundedutilitylmax(lmin)-policyiterationconvergestoanoptimalpolicyforºlmaxlmin,l,cinfinitetime.
Policyiteration (Algorithm 5) converges and is guaranteed to find a policy optimal for the (l,c) lexicographic criterion in finitetimeand usually inafew iterations. As forthealgorithmic complexity oftheclassical, stochastic, policy iteration algorithm (whichis still not well understood[16]), atight bound worst-case complexity oflexicographicpolicyiteration is
hardtoobtain.Therefore,weprovideanupper-boundofthiscomplexity.
The policyiteration algorithm never visits a policy twice:in the worst case, the number oftrial iterations before con-vergence is exponential but it is dominated by the number of distinct policies. So, the complexity of this algorithm is dominated by (|A||S|). Besides, each iteration has a cost, the evaluation phase relying on a bounded utility value itera-tion algorithm that costs O(|S|· |A|· |L|·b· (l·c)·b·log(l·c)) when many actions are possible ata given step, and cost
O(|S|· |L|·b· (l·c)·b·log(l·c))herebecauseoneactionisselected(by thecurrent policy)foreachstate.Thus,theoverall complexity ofthealgorithmisin O(|A||S|· |S|· |L| ·b· (l·c)·b·log(l·c)).
4.4. Backtolexicographic-valueiteration:fromfinitetoinfinitehorizon5-MDPs
The bounded iterations algorithm defined insection 3(Algorithm 3, (B I-V I)) can beused for both finitehorizon and infinite horizonMDPs,becauseitfixesanumberofiterations E;if E islow,thepolicy reachedinnotnecessarilyoptimal– thealgorithmisanapproximationalgorithm.
Now, exploiting the above propositions, we are able to show that the bounded iterations Lmax(lmin) value iteration algorithm(Algorithm3)convergeswhen E tendstoinfinity. Todo so,wefirstprovethefollowingproposition:
Proposition9.Letanarbitrarystationary5MDPbegiven.Then,thereexisttwopositivenaturalnumbers(l∗,c∗),suchthatforany
pair(δ,δ′)ofarbitrarypoliciesandanystates∈S,andforanypair(l,c)suchthatl≥l∗andc≥c∗,
δ(
s) ≻
lmaxlmin,l∗,c∗δ
′(
s) ⇔ δ(
s) ≻
lmaxlmin,l,cδ
′(
s)
Now, this proposition can be used to prove the convergence of the boundediterationsLmax(lmin)-valueiteration
algo-rithm. For this,let usdefine ≻lmaxlmin=de f≻lmaxlmin,l∗,c∗, theunique preference relation between policiesthat results from
Proposition9.
Proposition10.IfweletδEbethepolicyreturnedbyAlgorithm3foranyfixedE,wecanshowthatthesequence(δE)convergesand thatthereexistsafiniteE∗,suchthat:
lim
E→∞
δ
E= δ
E∗
.
Furthermore,δE∗isoptimalwithrespectto≻lmaxlmin.
Thesequenceofpoliciesobtainedfor(B I-V I)(Algorithm3)whenE tendstoinfinityconverges.Furthermore,thelimitis attainedforafinite(butunknown inadvance) E∗.Alternately,itisalsoattainedforthe(BU -V I) and(BU -P I) algorithms,
withfinitebutunknown(l∗,c∗).
Now, let us summarize the theoreticalresults that we have obtained so far. We have shown that possibilistic utilities (optimistic and pessimistic) are special cases of bounded lexicographic utilities, which can be represented by matrices. Possibilistic utilitiesareobtainedwhenl=c=1.
The possibilisticvalue iterationand policy iterationalgorithmscanbeextended tocompute policieswhich areoptimal accordingto≻lmaxlmin,l,c.
Finally,ifinfinitehorizonlexicographicoptimalpoliciesaredefinedasthelimitingpoliciesobtainedfromanon-bounded lexicographicvalue-iterationalgorithm,wehaveshownthatsuchpoliciescanbecomputedbyapplyingourboundedutility lmax(lmin) value iterationalgorithm and that only a finite number of iterations (even though not known in advance) is required.
5. Experiments
In orderto evaluate thepreviousalgorithms, wepropose, in thefollowing, twoexperimental analyses: in thefirst one we will compare the bounded iterations algorithm of value iteration (Algorithm 3) with the bounded utility one and in the second we propose to compare the bounded utilitylexicographic policy iteration algorithm with the bounded utility lexicographicvalueiterationone.ThealgorithmshavebeenimplementedinJavaandtheexperimentshavebeenperformed on anIntel Corei5processorcomputer(1.70 GHz)with8GBDDR3LofRAM.
5.1. Boundedutilityvsboundediterationsvalueiteration
Experimentalprotocol. We nowcompare the performanceofboundedutilitylexicographicvalueiteration (BU -V I) asan
approximation of lexicographicvalueiteration (B I-V I) for finite horizon problems, in the Lmax(lmin) variant. Because the horizon is finite, the number of steps of (B I-V I) can be set equal to thehorizon and the algorithm provides a solution optimal accordingto Lmax(lmin). (BU -V I) on the other side limits thesize on thematrices, and can lead tosub-optimal solutions.
We evaluate the performance of the algorithms by carrying out simulations on randomly generated finite horizon 5MDPs with 25 states – we generate five series of problems, letting E varying form 5 to 25. The number of actions in eachstateisequalto 4.The outputofeachaction isadistribution ontwo statesrandomly sampled(i.e.thebranching factor is equal to 2). The utility values are uniformly randomly sampled in the set L= {0.1,0.3,0.5,0.7,1}. Conditional possibilities relative to decisions should be normalized. To this end, one choice is fixed to possibility degree 1 and the possibility degreeof theother one isuniformlysampled in L. For eachexperience, 1005MDPs aregenerated. The two algorithms are compared w.r.t. two measures: (i) CPUtime and (ii) Pairwise success rate (Success) i.e. the percentage of optimalsolutionsprovidedbyBU -V I withfixed(l,c)w.r.t.thelmax(lmin)criterioninitsfullgenerality.ThehigherSuccess,
Fig. 2. Bounded utility lexicographic value iteration vs lexicographic value iteration.
Table 1
AverageCPUtime(inseconds) andaveragenumberofiterations. Bounded utility policy iteration
(l,c) (2,2) (4,4) (6,6) (10,10)
CPU time (s) 0.029 0.042 0.064 0.091 Average number of iterations 3.2 4.33 5.6 9.7
Bounded utility value iteration
(l,c) (2,2) (4,4) (6,6) (10,10)
CPU time (s) 0.03 0.052 0.082 0.1
Average number of iterations 6.75 9.25 16.11 20.2
themoreimportanttheeffectivenessofcuttingmatriceswithBU -V I;thelowerthisrate,themoreimportantthedrowning effect.
Results. Fig.2(a)presentstheaverageexecutionCPUtimeforthetwoalgorithms.Obviously,forbothB I-V I and BU -V I,
the execution time increases with the horizon.Also, we observe that theCPU time of BU -V I increasesaccording to the valuesof(l,c) butitremainsaffordable,asthemaximalCPUtimeislowerthan 1 s forMDPswith25statesand4actions when (l,c)= (40,40) and E=25. Unsurprisingly,wecancheck thatthe BU -V I (regardlessofthevaluesof(l,c))isfaster than B I-V I especially when the horizon increases: the manipulation ofl,c-matrices is obviouslyless expensive than the one offullmatrices.Thesavingincreases withthehorizon.
Aswiththesuccessrate,theresultsaredescribedinFig.2(b).Itappearsthat BU -V I providesaverygoodapproximation especially whenincreasing(l,c).Itprovidesthesameoptimalsolutionasthe B I-V I inabout 90%ofcases,withan(l,c)= (200,200).Moreover, even whenthesuccessrateof BU -V I decreases(when E increases), thequalityofapproximationis still good: neverless than 70% ofoptimalactionsreturned, with E=25.These experimentsconclude in favorof bounded valueiteration:thequalityofitsapproximatedsolutionsarecomparablewiththoseoftheunboundedversionforhigh(l,c) and increaseswhen(l,c) increase,whileitismuchfaster.
5.2. Boundedutilitylexicographicpolicyiterationvsboundedutilitylexicographicvalueiteration
Experimentalprotocol. In what follows we evaluate the performances of boundedutilitylexicographicpolicy iteration
(BU -P I) andboundedlexicographicvalueiteration (BU -V I), inthe lmax(lmin)variant. Weevaluate theperformance ofthe algorithmson randomlygenerated5MDPsasthoseofSection5.1 with|S|=25 and|As|=4,∀s.
Weran thetwo algorithmsfordifferent valuesof (l,c) (100 5MDPs areconsideredin eachsample).For eachofthe two algorithmswemeasure theCPUtimeneededtoconverge.Wealsomeasuretheaveragenumberofvalueiterations for (BU -V I)and theaveragenumberofpolicy iterationsfor(BU -P I).
Results. Table1presentstheaverageexecutionCPUtimeandtheaveragenumberofiterationsforthetwoalgorithms.
Obviously, for both BU -P I and BU -V I, the execution time increases according to the values of (l,c) but it remains affordable, asthe maximal CPUtime is lower than 0.1 s for MDPs with 25 states and 4 actions when (l,c)= (10,10). It appears that BU -P I (regardlessofthevaluesof(l,c))isslightly fasterthan BU -V I.
Consider now the number of iterations. At each iteration, BU -P I considers one policy, explicitly, and updates it at line 20. And so does value iteration: for each state, the current policy is updatedat line 15. Table 1 shows that BU -P I
always considers fewer policies than BU -V I. This experiment provides an empirical evidence in favor of policyiteration
over valueiteration,astheformerconvergestotheapproximatesolutionfaster.However,thisconclusionmayvarywiththe experiments, sobothalgorithmsareworthconsidering whentacklingagivenproblem.