Lexicographic refinements in stationary possibilistic Markov Decision Processes

(1)

HAL Id: hal-02124080

https://hal.archives-ouvertes.fr/hal-02124080

Submitted on 9 May 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Lexicographic refinements in stationary possibilistic

Markov Decision Processes

Nahla Ben Amor, Zeineb El Khalfi, Hélène Fargier, Regis Sabbadin

To cite this version:

(2)

Any correspondence concerning this service should be sent

to the repository administrator:

tech-oatao@listes-diff.inp-toulouse.fr

This is a publisher’s version published in:

http://oatao.univ-toulouse.fr/22626

To cite this version:

Ben Amor, Nahla and El Khalfi,

Zeineb and Fargier, Hélène and Sabbadin, Regis Lexicographic

refinements in stationary possibilistic Markov Decision

Processes. (2018) International Journal of Approximate

Reasoning, 103. 343-363. ISSN 0888-613X

Official URL

DOI :

https://doi.org/10.1016/j.ijar.2018.10.011

Open Archive Toulouse Archive Ouverte

(3)

Lexicographic

refinements

in

stationary

possibilistic

Markov

Decision

Processes

✩

Nahla

Ben Amor

a,∗

_,

_Zeineb

_{El Khalfi}

a,b,∗∗

_,

_Hélène

_Fargier

b,∗

_,

_Régis

_Sabbadin

c,∗

a _LARODEC,_University_of_Tunis,_Tunisia

b_IRIT,_UPS-CNRS,_Université_de_Toulouse_3,₁₁₈_route_de_Narbonne,_F-31062_Toulouse,_France c_MIAT,_UR_875,_Université_de_Toulouse,_INRA,_F-31320_{Castanet-Tolosan,}_France

a b s t r a c t

Keywords:

MarkovDecisionProcess Possibilitytheory Lexicographiccomparisons Possibilisticqualitativeutilities

PossibilisticMarkovDecisionProcessesofferacompactandtractablewaytorepresentand solveproblemsofsequentialdecisionunderqualitativeuncertainty.Eventhoughappealing for itsability to handlequalitative problems,this model suffers fromthe drowningeffect

thatisinherenttopossibilisticdecisiontheory.Thepresent1paperproposestoescapethe drowningeffectbyextendingtostationarypossibilisticMDPsthelexicographicpreference relations definedbyFargier and Sabbadin[13] fornon-sequentialdecision problems.We propose a valueiteration algorithm and apolicy iteration algorithm tocomputepolicies thatareoptimalforthesenewcriteria.Thepracticalfeasibilityofthesealgorithmsisthen experimentedondifferentsamplesofpossibilisticMDPs.

1. Introduction

The classical paradigm for sequential decision making underuncertainty is the expected utility-based MarkovDecision Processes (MDPs) framework[3,21],which assumesthat the uncertaineffects ofactions can berepresentedby probability distributionsandthat utilitiesareadditive.ButtheEUmodeldoesnosuit problemswhereuncertaintyandpreferencesare ordinalinessence.

AlternativestotheEU-basedmodelhavebeenproposedtohandleordinalpreferences/uncertainty.Remainingwithinthe probabilistic, quantitative,framework whileconsideringordinalpreferences hasledto quantile-based approaches[15,18,27,

29,33].Purely ordinalapproaches tosequential decisionunder uncertaintyhave alsobeen considered. In particular, possi-bilistic MDPs[1,6,22,24] formapurelyqualitativedecisionmodel withanordinalevaluationofplausibilityand preference. Inthismodel,uncertaintyabouttheconsequencesofactionsisrepresentedbypossibilitydistributionsandutilitiesarealso ordinal.The decisioncriteriaareeitherthepessimistic qualitativeutilityoritsoptimisticcounterpart[9].Suchdegreescan beeitherelicited fromexperts,or byautomaticlearningapproaches[23].However, itisnowwellknown that possibilistic

✩ _This_paper_is_part_of_the_Virtual_special_issue_on_the_14th_European_Conference_on_Symbolic_and_Quantitative_Approaches_to_Reasoning_with_Uncertainty

(ECSQARU2017),editedbyAlessandroAntonucci,LaurenceCholvyandOdilePapini.

*

Correspondingauthors.

**

Correspondingauthorat:IRIT,UPS-CNRS,118routedeNarbonne,31062Toulouse,France.

E-mailaddresses:nahla.benamor@gmx.com(N. Ben Amor),zeineb.khalfi@gmail.com(Z. El Khalfi),fargier@irit.fr(H. Fargier),regis.sabbadin@inra.fr

(R. Sabbadin).

1 _This_paper_is_an_extended_and_revised_version_of_two_conference_papers_[₄_,₅_]._It_includes_the_full_proofs_of_the_propositions_presented_in_these

prelimi-narypapers,newalgorithms(basedonpolicyiteration)andnewexperiments.

(4)

decision criteria sufferfrom adrowningeffect [13]: plausible enough bad or good consequences may completelyblur the comparisonbetweenpolicies,thatwouldotherwisebeclearlydifferentiable.

In[13],FargierandSabbadinhaveproposedlexicographicrefinements ofpossibilisticcriteriafortheone-stepdecisioncase, inordertoremedythedrowningeffect.Thisworkhasrecentlybeenextendedfor(finitehorizon)possibilisticdecisiontrees [4].Inthepresentpaper,weproposetostudytheinterestofthelexicographicpreferencerelationstostationarypossibilistic MarkovDecisionProcesses,amodelthat ismorecompactthandecisiontreesandnotlimitedtoafinitehorizon.

The paperisstructuredasfollows: ThenextSection recallsthebackgroundabout possibilistic decisiontheoryand sta-tionarypossibilisticMDPs,includingthedrowningeffectproblem.Section3definesthelexicographiccomparisonofpolicies and presentsavalueiterationalgorithmwhichcomputesanearlyoptimalstrategyinalimited numberofiterations. Then, Section4proposesalexicographicvalueiterationalgorithmandalexicographicpolicyiterationalgorithmusing approxima-tion ofutilityfunctions.Lastly,Section5presentsourexperimentalresults.

2. Backgroundandnotations 2.1. Basicsofpossibilisticdecisiontheory

Most of available decision models refer to probability theory for the representation of uncertainty [20,25]. Despite its success, probability theory is not appropriate when numerical information is not available. When information about un-certainty cannot be quantified in a probabilistic way, possibilistic theory [8,34] is a natural field to consider. The basic component ofthistheoryisthenotionofpossibilitydistribution.Itisarepresentationofastateofknowledgeofanagent about thestateof theworld.Apossibility distribution

π

isamapping fromtheuniverse ofdiscourse S (theset ofallthe possible worlds) to abounded linearly ordered scale L exemplified (without loss ofgenerality) bythe unitinterval [0,1], wedenotethefunctionby:

π

:S→ [0,1].

Forstate s∈S,

π

(s)=1 means thatrealization s istotallypossibleand

π

(s)=0 meansthat s isanimpossiblestate.It isgenerallyassumedthat thereexistsatleastonestates whichistotallypossible:

π

isthensaidtobenormalized.

Inthepossibilisticframework,extremeformsofknowledgecanbecaptured, namely:

• Completeknowledgei.e.∃s s.t.

π

(s)=1 and∀ s′₆₌_s_,

_π

₍_s′₎₌_0.

• Totalignorancei.e.∀s∈S,

π

(s)=1 (allvaluesin S are possible).

From

π

one cancomputethepossibilitymeasure5(A)and thenecessitymeasure N(A) ofanyevent A⊆S:

5(

A

) =

sup s∈A

π

(

s

)

N

(

A

) =

1

− 5( ¯

A

) =

1

−

sup s∈A/

π

(

s

)

Measure5(A) evaluatestowhichextent A isconsistentwiththeknowledgerepresentedby

π

whileN(A)corresponds totheextenttowhich¬A is impossibleand thusevaluatesatwhichlevel A iscertainlyimpliedbytheknowledge.

In decision theory acts are functions f : S7→X , where X is a finite set of outcomes. In possibilistic decision making, an act f can beviewed asa possibility distribution

π

f over X [9], where

π

f(x)= 5(f−1(x)). In asingle stage decision

making problem, autility function u:X7→U maps outcomesto utilityvalues in a totally ordered scale U= {u1,...,un}.

Thisfunctionmodelstheattractivenessofeachoutcomeforthedecision-maker.

Under theassumptionthat the utilityscaleand thepossibility scalearecommensurate and purelyordinal (i.e.U =L),

Duboisand Prade[9,7] haveproposedpessimisticandoptimisticdecisioncriteria.

First,thepessimisticcriterionwas originallyproposedbyWhalen[30] anditgeneralizestheWaldcriterion[28].Itsuits cautious decisionmakers whoare happywhen bad consequences arehardlyplausible. It summarizesto what extent itis certain(i.e.necessaryaccordingtomeasure N)thattheactreachesagoodutility.Thedefinitionofthepessimisticcriterion isasfollows[10]:

Definition1.Givenapossibilitydistribution

π

overasetofstates S andautilityfunctionu onthesetofconsequences X ,

thepessimisticutilityofanact f isdefinedby:

upes

(

f

) =

min xj∈X max

(

u

(

xj

),

1

−

π

f

(

xj

)),

=

min si∈S max

(

u

(

f

(

si

)),

1

−

π

(

si

)).

(1)

Therefore,wecancomparetwoacts f and g onthebasisoftheirpessimisticutilities:

(5)

The secondcriterionistheoptimistic possibilisticcriterionoriginally proposedbyYager [32,31].Thiscriterioncaptures thebehaviorofanadventurous decisionmakerwhoishappy assoonasatleast onegoodconsequence ishighlyplausible. It summarizes towhat extent itispossible that anact reachesagood utility. The definitionofthis criterionisasfollows [10]:

Definition2.Givena possibilitydistribution

π

over aset ofstates S andautilityfunctionu on aset ofconsequences X ,

theoptimisticutilityofanact f isdefined by:

u_opt

(

f

) =

max xj∈X min

(

u

(

x_j

),

π

f

(

xj

)),

=

max si∈S min

(

u

(

f

(

si

)),

π

(

si

)).

(2)

Hence,wecancomparetwo acts f and g onthebasisoftheiroptimisticutilities:

f

º

uopt g

⇔

uopt

(

f

) ≥

uuopt

(

g

).

Example1.Let S= {s1,s2}and f and g betwoactswhoseutilitiesofconsequencesinthestatess1and s2 arelistedinthe

followingtable,aswellasthedegreesofpossibilityofs1 ands2:

s1 s2

u(f(s)) 0.3 0.5 u(g(s)) 0.4 0.6

π 1 0.2

Comparing f and g withrespecttothepessimisticcriterion,we get:

• upes(f)=min(max(0.3,0),max(0.5,0.8))=0.3,

• upes(g)=min(max(0.4,0),max(0.6,0.8))=0.4.

Thus, gºupes f .

Letusnowcomparethetwoactswithrespecttotheoptimisticcriterion:

• uopt(f)=max(min(0.3,1),min(0.5,0.2))=0.3,

• uopt(g)=max(min(0.4,1),min(0.6,0.2))=0.4.

Thus, gºuopt f .

Itisimportanttonotethat whiletransitionprobabilitiescanbeestimatedthroughsimulationsoftheprocess,transition possibilitiesmaynot.Ontheotherhand,expertsmaybeinvolvedfortheelicitationofthepossibilitydegreesandutilitiesof transitions.Inthepossibilisticframework,utilityanduncertaintylevelscanbeelicitedjointly,bycomparisonofpossibilistic lotteries,for example(e.g.byusingcertaintyequivalents,asin [11]).Simulationcanalsobeused jointlywithexpert eval-uationwhentheunderlyingprocess istoocostlytosimulatealargenumberoftimes: simulationmaybeusedtogenerate samples onwhich expertelicitationisapplied. Anotheroptionisto usepossibilisticreinforcementlearningprocedure (for moredetailssee[23]),inparticularmodel-based reinforcementlearningalgorithm.The latteruses auniformsimulationof trajectories(withrandomchoiceofactions)inordertogenerateanapproximationofthepossibilisticdecisionmodel.

2.2. StationaryPossibilisticMarkovDecisionProcesses

AstationaryPossibilisticMarkovDecisionProcess(5MDP)[22] isdefinedby:

• Afiniteset S of states;

• Afiniteset A of actions, As denotestheset ofactionsavailableinstates;

• A possibilistic transition function: for each action a∈ As and each state s∈S the possibility distribution

π

(s′|s,a)

evaluatestowhatextent eachs′ _is_a_possible_successor _of_{s when}_action_{a is}_applied;

• Autilityfunction

µ

:

µ

(s)istheintermediatesatisfactiondegreeobtainedinstates.

(6)

Fig. 1. The stationary5MDPof Example2.

Example2.Let ussuppose thata“RichandUnknown” personrunsastartupcompany.Initially,s/hemustchoosebetween Saving money (Sav) or Advertising ( Adv) and may then get Rich (R) or Poor (P ) and Famous (F ) or Unknown (U ). In the otherstates, Sav is theonlypossible action.Fig. 1showsthestationary5MDP thatcaptures thisproblem,formally described asfollows:

S= {RU,R F,P U},

ARU= {Adv,Sav}, AR F =AP U= {Sav},

π

(P U|RU,Sav)=0.2,

π

(RU|RU,Sav)=

π

(R F|RU,Adv)=

π

(R F|R F,Sav)=

π

(RU|R F,Sav)=1,

µ

(RU)=0.5,

µ

(R F)=0.7,

µ

(P U)=0.3.

SolvingastationaryMDPconsistsinfindinga(stationary)policy,i.e.afunctionδ:S→A whichisoptimalwithrespect to adecisioncriterion.Inthepossibilisticcase,asintheprobabilisticcase,thevalueofapolicydependsontheutilityand on the likelihoodofits trajectories.Formally, let1bethe setof allpoliciesthat can bebuiltfor the5MDP (theset of all thefunctions thatassociate an elementof As to eachs).Each δ ∈1defines a listof scenarioscalledtrajectories. Each

trajectory

τ

isasequenceofstatesandactionsi.e.

τ

= (s0,a0,s1,. . . ,st−1, at−1,st).

Tosimplifynotations,wewillassociatethevectorv_τ = (

µ

0,

π

1,

µ

1,

π

2,. . . ,

πt

−1,

µt

)toeachtrajectory

τ

,where

πi

+1=

π

(si+1|si,ai) isthepossibility degreetoreach thestate si+1 att=i+1,applying theaction ai att=i and

µi

=

µ

(si) is

theutilityobtainedinthei-th statesi ofthetrajectory.

Thepossibilityandtheutilityoftrajectory

τ

given thatδ isappliedfroms0 aredefinedby:

π

(

τ

|

s0

, δ) =

min

i=1...t

π

(

si

|

si−1

, δ(

si−1

))

and

µ

(

τ

) =

i=min0...t

µ

(

si

).

(3)

Twocriteria,anoptimisticandapessimisticone,canthenbeusedtoevaluate δ [24,9]:

uopt

(δ,

s0

) =

max

τ min

{

π

(

τ

|

s0

, δ),

µ

(

τ

)},

(4)

upes

(δ,

s0

) =

min

τ max

{

1

−

π

(

τ

|

s0

, δ),

µ

(

τ

)}.

(5)

The policies optimizing these criteria can be computed by applying, for every state s and time step i=0,...,t, the followingcounterpartsoftheBellmanupdates[22]:

uopt

(

s

,

i

) ←

max a∈As min

{

µ

(

s

),

max s′_∈S min

(

π

(

s ′

|

s

,

a

),

uopt

(

s′

,

i

+

1

))},

(6) upes

(

s

,

i

) ←

max a∈As min

{

µ

(

s

),

min s′_∈Smax

(

1

−

π

(

s ′

|

s

,

a

),

upes

(

s′

,

i

+

1

))},

(7)

δ

opt

(

s

,

i

) ←

arg max a∈As

min

{

µ

(

s

),

max

s′_∈S min

(

π

(

s

′

|

s

,

a

),

uopt

(

s′

,

i

+

1

))},

(8)

δ

pes

(

s

,

i

) ←

arg max a∈As

min

{

µ

(

s

),

min

s′_∈Smax

(

1

−

π

(

s

′

|

s

,

a

),

upes

(

s′

,

i

+

1

))}.

(9)

There weset,arbitrarily, uopt(s′,t+1))=1 and upes(s′,t+1))=1.

Ithasallowedthedefinitionofa(possibilistic)valueiteration algorithm(seeAlgorithm1fortheoptimisticvariantofthis algorithm)whichconvergestoanoptimalpolicy inpolytime[22].

This algorithm proceeds byiterated modifications ofa possibilisticvalue function Q(s,a) which evaluates the“utility” (pessimisticoroptimistic)ofperforminga in s.

(7)

Algorithm 1: V I-M D P :Possibilistic(Optimistic)Valueiteration.

Data: Astationary5MDP

Result: Apolicyδoptimalforuopt

1 begin 2 foreach s∈S do uopt(s)←µ(s); 3 repeat 4 foreach s∈S do 5 uold(s)←uopt(s); 6 foreach a∈A do

7 Q(s,a)←minnµ(s),maxs′_∈_Smin{(π(s′|s,a),uopt(s′)}

o ;

8 uopt(s)←maxaQ(s,a);

9 δ(s)←arg maxaQ(s,a);

10 until uopt(s)==uold(s)foreachs;

11 returnδ;

Algorithm 2: P I-M D P :Possibilistic(Optimistic)Policyiteration.

Data: Astationary5MDP

Result: Apolicyδoptimalforuopt

1 begin

2 // Initialization of δ and uopt

3 foreach s∈S do

4 δ(s)←chooseanyas∈As;

5 uopt(s)←µ(s);

6 repeat

7 // Evaluation of δ until stabilization of uopt

8 repeat

10 uold(s)←uopt(s);

11 uopt(s)←min

n

µ(s),maxs′_∈_Smin_{π₍s′_|s_,_δ(s_)).u_old₍s′_)}

o

; 12 until uopt==uold;

13 // Improvement of δ

15 δold(s)← δ(s);

16 δ(s)←arg maxa∈Amin

n

µ(s),maxs′_∈_Smin{π(s′|s,a).uopt(s′)}

o

; 17 untilδ(s)== δold(s)foreachs;

18 // stabilization of δ

19 returnδ;

2.3. Thedrowningeffectinstationarysequentialdecisionproblems

Unfortunately, possibilisticutilities sufferfrom animportantdrawback calledthe drowningeffect:plausible enough bad orgood consequencesmaycompletelyblurthecomparisonbetweenactsthatwouldotherwisebeclearlydifferentiated;as aconsequence,anoptimalpolicy δisnotnecessarilyParetoefficient.RecallthatapolicyδisParetoefficientwhennoother policy δ′ dominatesit(i.e. thereisno policy δ′ suchthat (i)∀ s∈S,upes(δ′,s)ºupes(δ,s) and (ii)∃s∈S s.t. upes(δ′,s)≻

upes(δ,s)).Thefollowingexample showsthat itcansimultaneouslyhappenthatδ′ dominates δ andupes(δ)=upes(δ′).

Example3.The5MDP ofExample2admitstwopoliciesδ and δ′:

• δ(RU)=Sav;δ(P U)=Sav;δ(R F)=Sav;

• δ′(RU)=Adv;δ′(P U)=Sav;δ′(R F)=Sav.

Considerafixedhorizon H=2:

• δ has3 trajectories:

(8)

• δ′has 2 trajectories:

τ

4= (RU,R F,R F)with vτ4= (0.5,1,0.7,1,0.7);

τ

5= (RU,R F,RU) with vτ5= (0.5,1,0.7,1,0.5).

Thus uopt(δ,RU)=uopt(δ′,RU)=0.5. However δ′ seems better than δ since it provides utility 0.5 for sure while δ

providesabadutility(0.3)insomenon-impossibletrajectories(

τ

1 and

τ

2).

τ

3 whichisgoodandtotally possible“drowns”

τ

1 and

τ

2:δ isconsideredasgoodasδ′.

3. Boundediterationssolutionstolexicographicfinitehorizon5MDPs

Possibilistic decisioncriteria,especially pessimisticand optimisticutilities,aresimple and realisticasillustratedin Sec-tion2,buttheyhaveanimportantshortcoming:theprincipleofParetoefficiencyisviolatedsincethesecriteriasufferfrom the drowning effect.Indeed, one decisionmay dominate another one while notbeing strictlypreferred. Inorder to over-come the drowning effect,some refinements ofpossibilistic utilities have been proposed in the non-sequentialcase such aslexicographicrefinements,proposed by[12,13].Theserefinementsarefullyinaccordancewithordinalutilitytheoryand satisfytheprincipleofParetodominance,thatiswhywehavechosentofocusonthem.

The present section defines an extension of lexicographic refinements to finite horizon possibilistic Markov decision processesand proposesavalueiterationalgorithmthatlooksforpoliciesoptimalwithrespecttothesecriteria.

3.1. Lexi-refinementsofordinalaggregations

In ordinal(i.e.min-based and max-based) aggregationasolution tothe drowningeffectbased on leximinand leximax comparisons has been proposed by[19].It has thenbeen extended to non-sequential decisionmaking under uncertainty [13] and,inthesequentialcase,todecisiontrees[4].Letusfirstrecallthebasicdefinitionofthesetwopreferencerelations. For anytwovectorst andt′ _of_length_{m built}_on_the_{scale L:}

t

º

lmint′iff

∀

i

,

tσ(i)

=

t′σ(i)or

∃

i∗

, ∀

i

<

i∗

,

tσ(i)

=

t′σ(i)and tσ(i∗₎

>

t′_σ_(i∗₎

,

(10)

t

º

lmaxt′iff

∀

i

,

tµ(i)

=

tµ′(i)or

∃

i ∗

, ∀

i

<

i∗

,

t_µ_(i)

=

t′_µ_(i)and t_µ_(i∗₎

>

t_µ′_(i∗₎

,

(11)

where, forany vectorv (here,v=t or v=t′_), _v

µ(i) (resp.vσ(i))isthei-th best(resp.worst)elementof v.

[13,4] haveextendedtheseprocedurestothecomparisonofmatricesbuilton L,definingpreferencerelationsºlmin(lmax)

and ºlmax(lmin):

A

º

lmin(lmax)B

⇔ ∀

j

,

a(lmax,j)

∼

=

b(lmax,j)

or

∃

i s.t.

∀

j

>

i

,

a_(lmax,j)

∼

lminb(lmax,j)and a(lmax,i)

≻

lminb(lmax,i)

,

(12)

A

º

lmax(lmin)B

⇔ ∀

j

,

a(lmin,j)

∼

lmaxb(lmin,j)

or

∃

i s.t.

∀

j

<

i

,

a_(lmin,j)

∼

lmaxb(lmin,j)and a(lmin,i)

≻

lmaxb(lmin,i)

,

(13)

where a(☎,i) (resp.b(☎,i))isthei-th largestsub-vectorof A (resp. B)accordingto ☎∈ {lmax,lmin}.

Like in(finite-horizon) possibilisticdecisiontrees[4] ourideaistoidentifythestrategiesoftheMDPwiththematrices of their trajectories,and to compare such matrices witha ºlmax(lmin) (resp. ºlmin(lmax)) procedure for theoptimistic(resp.

pessimistic)case.

3.2. Lexicographiccomparisonsofpolicies

Letusfirstdefinelexicographiccomparisonsofpoliciesoveragivenhorizon E.

Atrajectoryoverhorizon E beingasequenceofstatesandactions,any stationarypolicy canbeidentifiedwithamatrix where eachline correspondsto a distinct trajectory oflength E. In the optimisticcase each linecorresponds to a vector

v_τ = (

µ

0,

π

1,

µ

1,

π

2,. . . ,

πE

−1,

µE

)andinthepessimisticcaseto wτ = (

µ

0,1−

π

1,

µ

1,1−

π

2,. . . ,1−

πE

−1,

µE

).

Thisallowsustodefinethecomparisonoftrajectoriesusingleximaxandleximinasfollows:

τ

º

lmin

τ

′iff

(

µ

0

,

π

1

, . . . ,

π

E

,

µ

E

) º

lmin

(

µ

′0

,

π

2′

, . . . ,

π

E′

,

µ

′E

),

(14)

τ

º

lmax

τ

′iff

(

µ

0

,

1

−

π

1

, . . . ,

1

−

π

E

,

µ

E

) º

lmax

(

µ

′0

,

1

−

π

1′

, . . .

1

−

π

E′

,

µ

′E

).

(15)

(9)

Using(14) and(15),wecancompare policiesby:

δ º

lmax(lmin)

δ

′iff

∀

i

,

τ

µ(i)

∼

lmin

τ

µ′(i)

or

∃

i∗

, ∀

i

<

i∗

,

τ

µ(i)

∼

lmin

τ

µ′(i) and

τ

µ(i∗₎

≻

_lmin

τ

_µ′_(i∗₎

,

(16)

δ º

lmin(lmax)

δ

′iff

∀

i

,

τ

σ(i)

∼

lmax

τ

σ′(i)

or

∃

i∗

, ∀

i

<

i∗

,

τ

σ(i)

∼

lmax

τ

σ′(i) and

τ

σ(i∗₎

≻

_lmax

τ

_σ′_(i∗₎

,

(17)

where

τ

µ(i) (resp.

τ

_µ′₍i))isthei-th best trajectoryof δ (resp.δ′)accordingtoºlmin and

τ

σ(i) (resp.

τ

σ′(i))isthe i-th worst

trajectory ofδ (resp.δ′)accordingtoºlmax.

Hence,theutilitydegreeofapolicyδ canberepresentedbyamatrixU_δwithn lines,s.t.n isthenumberoftrajectories, andm=2E+1 columns.Indeed,comparingtwopoliciesw.r.t.ºlmax(lmin)(resp.ºlmin(lmax))consistsinfirstorderingthetwo

correspondingmatricesoftrajectoriesasfollows:

• ordertheelementsofeachtrajectory(i.e.theelementsofeachline) inincreasingorderw.r.t. ºlmin (resp.indecreasing

orderw.r.t. ºlmax),

• then order all the trajectories. The lines of each policy are arranged lexicographically top-down in decreasing order (resp.top-downinincreasingorder).

Then,itisenoughtolexicographicallycomparethetwonewmatricesoftrajectories,denotedUδ (resp.Uδ′),elementby

element.Thefirstpairofdifferentelementsdeterminesthebestmatrix/policy.Notethat theorderedmatrix Uδ (resp.Uδ′)

canbeseenastheutilityofapplyingpolicy δ (resp.δ′)overalengthE horizon.

Example4. Let us consider the Counter-Example 3 with the same 5MDP of Example 2. We consider, once again, the policiesδ and δ′ definedby:

• δ(RU)=Sav;δ(P U)=Sav;δ(R F)=Sav;

• δ′(RU)=Adv;δ′(P U)=Sav;δ′(R F)=Sav.

For horizon H=2:

• δ has3 trajectories:

τ

1= (RU,P U,P U)with vτ1= (0.5,0.2,0.3,1,0.3);

τ

2= (RU,RU,P U) with vτ2= (0.5,1,0.5,0.2,0.3);

τ

3= (RU,RU,RU) withvτ3= (0.5,1,0.5,1,0.5).

Thematrixoftrajectoriesis:Uδ=

  0.5 0.2 0.3 1 0.3 0.5 1 0.5 0.2 0.3 0.5 1 0.5 1 0.5  ∼   0.2 0.3 0.3 0.5 1 0.2 0.3 0.5 0.5 1 0.5 0.5 0.5 1 1  .

So,theorderedmatrix oftrajectoriesis:U_δ=   0.5 0.5 0.5 1 1 0.2 0.3 0.3 0.5 1 0.2 0.3 0.5 0.5 1  . • δ′_has _{2 trajectories:}

τ

4= (RU,R F,R F)with vτ4= (0.5,1,0.7,1,0.7);

τ

5= (RU,R F,RU) with vτ5= (0.5,1,0.7,1,0.5).

Theorderedmatrixoftrajectoriesis: U_δ′=

·

0.5 0.7 0.7 1 1 0.5 0.5 0.7 1 1

¸

.

GiventhetwoorderedmatricesUδ andUδ′,δand δ′ areindifferentforoptimisticutilitysincethetwofirst(i.e.top-left)

elements ofthematrices areequali.e. uopt(δ)=uopt(δ′)=0.5.For lmax(lmin) wecompare successivelythe nextelements

(left toright thentop tobottom)until wefindapairofdifferent values.In particular,wehave thesecondelement ofthe first (i.e.thebest)trajectory ofδ′ isstrictlygreaterthan thesecondelement ofthefirsttrajectory ofδ (0.7>0.5).So,the firsttrajectoryofδ′ isstrictlypreferredtothefirsttrajectoryofδ accordingtoºlmin.Wededucethat δ′ isstrictlypreferred

to δ:

δ

′

≻

lmax(lmin)

δ

since

(

0

.

5

,

0

.

7

,

0

.

7

,

1

,

1

) ≻

lmin

(

0

.

5

,

0

.

5

,

0

.

5

,

1

,

1

).

(10)

Proposition1.

Ifuopt(δ)>uopt(δ′)thenδ ≻lmax(lmin)δ′.

Ifupes(δ)>upes(δ′)thenδ ≻lmin(lmax)δ′.

Proposition2.ºlmax(lmin)andºlmin(lmax)satisfytheprincipleofParetoefficiency.

Now,inordertodesigndynamicprogrammingalgorithms,i.e.toextendthevalueiterationalgorithmtolexicomparison, we showthat thecomparison ofpoliciesisa preorderandsatisfies theprincipleofstrict monotonicitydefined asfollows forany optimizationcriterion O by:∀δ,δ′,δ′′∈ 1,

δ º

O

δ

′

⇐⇒ δ + δ

′′

º

O

δ

′

+ δ

′′

,

where δ (resp. δ′) and δ′′ denote two disjointsets oftrajectories and δ + δ′′ (resp. δ′+ δ′′) is the set of trajectories that gatherstheonesofδ (resp.δ′)and theonesofδ′′.

Then,addingorremovingidenticaltrajectoriestotwosetsoftrajectoriesdoesnotchangetheircomparisonbyºlmax(lmin)

(resp.ºlmin(lmax)).

Proposition3.Relationsºlmin(lmax)andºlmax(lmin)arecomplete,transitiveandsatisfytheprincipleofstrictmonotonicity.

Note that uopt and upes satisfyonly aweakform ofmonotonicitysincetheaddition ortheremovaloftrajectoriesmay

transformastrictpreferenceintoanindifferenceif uopt orupes isused.

Let usdefine thecomplementary MDP (S,A,

π

,

µ

¯) of agiven 5MDP (S,A,

π

,

µ

) where

µ

¯(s)=1−

µ

(s),∀s∈S.The complementary MDPsimplygivescomplementary utilities.Fromthedefinitionsofºlmax and ºlmin,wecancheckthat:

Proposition4.

τ

ºlmax

τ

′⇔ ¯

τ

′ºlmin

τ

¯andδ ºlmin(lmax)δ′⇔ ¯δ′ºlmax(lmin)δ¯.

There

τ

¯ and ¯δ areobtainedbyreplacing

µ

with

µ

¯ inthetrajectory/5MDP.

Therefore, allthe results whichwe will provefor ºlmax(lmin) also hold for ºlmin(lmax), if wetake care toapply them to

complementary policies. Since considering ºlmax(lmin) involves less cumbersome expressions (no 1− ·), we will give the

results forthiscriterion.Aconsequence ofProposition4isthat theresultsholdforthepessimisticcriterionaswell. This monotonicity of the lmin(lmax) and lmax(lmin) criteria is sufficient to allow us to use a dynamic programming algorithm such asvalue iterationorpolicy iteration[2]. Thealgorithms wepropose in thepresent paper performexplicit Bellmanupdatesinthelexicographicframework(lines12–13ofAlgorithms3and4,line11ofAlgorithm5);thecorrectness oftheiruseis provedinPropositions6to10.

3.3. Basicoperationsonmatricesoftrajectories

Before going further, in order to give more explicit and compact descriptions of the algorithms and the proofs, let us introduce the following notations and some basic operations on matrices (typically, on the matrix U(s) representing trajectoriesissued from state s).Abusing notationsslightly,weidentifytrajectories

τ

(resp.policies)with their vτ vectors

(resp.matricesof v_τ vectors)whenthereisnoambiguity.Forany matrixU ,[U]l,c denotestherestrictionofU toitsfirstl

lines andfirstc columnsand Ui,j denotestheelementatlinei andcolumn j.

• Composition:Let U beaa×b matrixand N1,. . . ,Na beaseriesofa matricesofdimensionni×c (theyallsharethe

samenumberofcolumns).ThecompositionofU with(N1,. . . ,Na)denoted U× (N1,. . . ,Na)isamatrixofdimension

( 6

1≤i≤ani)× (b+c).Forany i≤a,j≤nj,the((6i

′_<_in_i′)+ j)-th lineofU× (N₁,. . . ,N_a) istheconcatenation ofthei-th

lineofU andthe j-th lineof Ni.

The composition of U× (N1,. . . ,Na) is done in O(n·m) operations, where n= 6

1≤i≤ani and m=b+c. The matrix

U(s), matrix of trajectories out of state s when making decision a, is typicallythe concatenation of the matrix U = ((

π

(s′_|_s_,_a_),

_µ

₍_s′_)),_s′_∈_succ₍_s_,_a₎₎_with_the_matrices_N

s′=U(s′).Thisprocedureaddstwocolumnstoeachmatrix U(s′),

filledwith

π

(s′_|_s_,_a₎ _and

_µ

₍_s′₎ _the _possibility _degrees _and _the _utility _of _reaching _s′_;_then _the _matrices _are _vertically

concatenated to get the matrix U(s) when making decision a. Then it is possible to lexicographically compare the resultingmatricesinordertogettheoptimalactioninstate s.

• Orderingmatrices: LetU bean×m matrix, Ulmaxlmin isthematrixobtained byorderingtheelements ofthelines of

(11)

• Comparisonoforderedmatrices:Giventwoorderedmatrices Ulmaxlmin _and _Vlmaxlmin_, _we_say_that _Ulmaxlmin_>_Vlmaxlmin

iff∃i,j such that∀i′<i,∀j′,Ulmaxlmin_i′_,_j′ =Vlmaxlmin_i′_,_j′ and∀j′<j, Ulmaxlmin_i_,_j′ =Vlmaxlmin_i_,_j′ andUlmaxlmin_i_,_j >Vlmaxlmin_i_,_j .

Ulmaxlmin∼Vlmaxlmin iff they are identical (comparison complexity: O(n·m)). Once matrices Q(s,a) are ordered, the lexicographiccomparison oftwo decisionsis performedbyscanning theelements oftheirmatrices, lineby linefrom thefirstone.Thefirstpairofdifferentvaluesdeterminesthebestmatrixandthebestcorrespondingactiona isselected (seeExample4).

If thepolicies (orsub-policies) have different numbersof trajectories,thecomparison oftwo matrices isbased on the numberoftrajectoriesoftheshortestmatrix.Twocasesmayarise:

• If wehavea strictpreference betweenthetwo matricesbefore reachingthelast lineoftheshortest matrix,we geta strictpreferencebetweenthepolicies(orbetweenthesub-policies).

• If we have an indifference up to the last line, the shortest matrix is the best for the lexicographic criterion, since it expresseslessuncertaintyinthecorrespondingpolicy (orinthesub-policy).

3.4. Boundediterationslexicographicvalueiteration

In this section, we propose aniterative value iteration-typealgorithm (Algorithm 3). This algorithm follows the same principle as in the possibilistic case (Eqs. (6)–(9)). Repeated Bellman updates are performed successively E times. This algorithm will provide anapproximation of alexi optimalstrategy in the infinite horizon case (by considering thepolicy returned forthefirst timestep).This algorithmissub-optimal for anyfixed E,but wewill seeinSection 4that letting E

grow,anoptimallexicographicpolicywillbeobtainedforfinite E.

Weproposetwoversionsofthevalueiterationalgorithm:Thefirstonecomputestheoptimalpolicywithrespecttothe lmax(lmin)criterionand thesecondoneprovidestheoptimalpolicy withrespecttothelmin(lmax)criterion.Inthispaper, wepresentand detailonlythefirstalgorithm,sincethesecondisverysimilar.2

Algorithm 3: Boundediterations lmax(lmin)-valueiteration(BI-VI).

Data: ApossibilisticMDPandmaximumnumberofiterationsE

Result: TheδE strategyobtainedafter E iterations

1 begin 2 e←0; 3 foreach s∈S do U(s)← ((µ(s))); 4 foreach s∈S,a∈A do 5 T Us,a←Ts,a× ((µ(s′)),s′∈succ(s,a)); 6 repeat 7 e←e+1; 8 foreach s∈S do 9 Uold(s)=U(s); 10 Q∗_{← ((}₀₎₎_; 11 foreach a∈A do

12 Future← (Uold₍_s′_),_s′_∈_succ₍_s_,_a₎₎_;_{// Gather} _the _matrices _provided _by _the _successors _of _s_;

13 Q(s,a)←¡T Us,a×Future¢lmaxlmin; 14 if Q∗_≤ lmaxlminQ(s,a)then 15 Q∗_←_Q₍_s_,_a₎_; 16 δ(s)←a 17 U(s)←Q∗₍_s_,_δ(_s₎₎ 18 until e==E; 19 δ(s)←argmaxaQ(s,a) 20 returnδE= δ;

This algorithm is aniterative procedure that performs aprescribed number of updates, E, ofthe utilityof each state, representedbyafinitematrixoftrajectories,usingtheutilitiesoftheneighboringstates.

Atstage1≤e≤E, theprocedureupdatestheutilityofeverystates∈S asfollows:

• Foreachactiona∈A,amatrix Q(s,a)isbuiltto evaluatethe“utility”ofperforminga in s atstagee:thisisdoneby combining T Us,a (combinationofthetransitionmatrix Ts,a=

π

(·|s,a) and theutilities

µ

(s′) ofthestatess′ that may

(12)

follows when a isexecuted)with thematrices Uold₍_s′₎ _of_trajectories_provided_by_these _s′ _at_the_previous_stage. _The

matrix Q(s,a) isthenordered(theoperationismade less complexbythefactthat thematricesUold₍_s′₎ _have _already

beenorderedate−1).

• Thelmax(lmin) comparisonisperformedontheflytomemorizethebest Q(s,a).

• Thevalueofstates atstagee,U(s),istheonegivenbytheactiona whichprovidesthebest Q(s,a).δ isupdated,U is

memorized(andUold canbediscarded).

Timeandspacecomplexitiesofthisalgorithmareneverthelessexpensive,sinceiteventuallymemorizesallthe trajecto-ries.At eachstepe itssizemaygrowtobe_{· (}₂_·_e₊₁₎_,_where_{b is}_the_maximal_number_of_possible_successors_of_an_action;

theoverallcomplexityofthealgorithmis O(|S|· |A|·E·bE₎_,_which_is_a_problem.

Algorithm 3 is provided with a number of iterations, E. Does it converge when E tends to infinity? That is, are the returnedpoliciesidenticalforany E exceedingagiventhreshold?Beforeanswering(positively)thisquestioninSection4.4, wearegoingtodefineboundedutilitymatrices solutionstolexicographicpossibilisticMDPs.Thesesolutionconceptswillbe usefultoanswertheabovequestion.

4. Boundedutilitysolutionstolexicographic_5MDPs

Wehave just proposedalexicographic value iterationalgorithmfor thecomputationoflexicographic policiesbased on thewholematricesoftrajectories.Asaconsequence,thespatial/temporalcomplexityofthealgorithmisexponentialinthe number of iterations. This section presents an alternative wayto get lexicographic policies. Rather than limiting the size of thematrices oftrajectories bylimitingthe numberof iterations, wepropose to “forget”the less significantpart ofthe matricesofutilityandtodecideonlybasedonthemostsignificant(l,c)sub-matrices–we“bound”theutilitymatrices.We proposeinthepresentsectiontwoalgorithmsbasedonthisidea,namelyavalueiterationandapolicyiterationalgorithms.

4.1. Boundedlexicographiccomparisonsofutilitymatrices

Recallthat, forany matrixU , [U]l,c denotestherestriction ofU to itsfirstl linesand first c columns.Noticenowthat,

atany stage e andforany states [U(s)]1,1 (i.e. thetopleftvalue inU(s))ispreciselyequalto uopt(s).Wehave seen that

making the choices on thisbasis isnot discriminantenough. On theother hand, taking thewhole matrix into account is discriminant,butexponentiallycostly.Hencetheideaofconsideringmorethan onelineand onecolumn,butlessthan the whole matrix–namelythefirstl linesand c columnsofUt(s)lmaxlmin;hencethedefinitionofthefollowingpreference:

δ ≥

lmaxlmin,l,c

δ

′iff

[δ

lmaxlmin

]

l,c

≥ [δ

′lmaxlmin

]

l,c

.

(18)

≥lmaxlmin,1,1correspondstoºopt and≥lmaxlmin,+∞,+∞correspondsto≥lmaxlmin.

Thefollowingpropositionshowsthatthisapproachissoundand that≻lmaxlmin,l,c refinesuopt:

Proposition5.

• Foranyl,l′_,_{c such}_that_l′_>_l,_{δ ≻}

lmaxlmin,l,cδ′⇒ δ ≻lmaxlmin,l′_,_cδ′.

• Foranyl,cδ ≻optδ′⇒ δ ≻lmaxlmin,l,cδ′.

Inotherwords,theorderoverthepoliciesisrefinedforafixedc whenl increases.Ittendsto≻lmaxlmin whenc=2.E+1

and l tendstobE.

Noticethatthecombinatorialexplosionisduetothenumberoflines(thenumberofcolumns isboundedby2·E+1), henceweshallbound thenumberofconsideredlines only.

Up to this point, the comparison by ≥lmaxlmin,l,c is made on the basis of the first l lines and c columns of the full

matrices of trajectories.This doesobviously notreduce their size. Theimportant following Propositionallows usto make thel,c reductionoftheordered matricesateachstep (aftereachcomposition),and notonly attheveryend,thus keeping spaceandtimecomplexitiespolynomial.

Proposition6.LetU beaa×b matrixandN1,. . . ,Nabeaseriesofa matricesofdimensionai×c.Itholdsthat:

[(

U

× (

N1

, . . . ,

Na

))

lmaxlmin

]

l,c

= [(

U

× ([

Nlmaxlmin1

]

l,c

, . . . , [

Nlmaxlmina

]

l,c

))

lmaxlmin)

]

l,c

.

4.2. Boundedutilitylexicographicvalueiteration

(13)

Algorithm 4: BoundedUtilityLmax(lmin)ValueIteration(BU-VI).

Data: ApossibilisticMDP,bounds(l,c);δ,thepolicybuiltbythealgorithm,isaglobalvariable

Result: Apolicyδoptimalforºlmaxlmin,l,c

1 begin

2 foreach s∈S do U(s)← ((µ(s))); 3 foreach s∈S,a∈A do

4 T Us,a←Ts,a× ((µ(s′)),s′∈succ(s,a));

5 repeat

6 until U(s)==Uold(s)foreachs;

8 Uold(s)←U(s); 9 Q∗_{← ((}₀₎₎_;

10 foreach a∈A do

11 Future← (Uold(s′_),_s′_∈_succ₍_s_,_a₎₎_;_{// Gather} _the _matrices _provided _by _the _successors _of _s_;

12 Q(s,a)← [¡T Us,a×Future¢ lmaxlmin ]l,c; 13 if Q∗_≤ lmaxlminQ(s,a)then 14 Q∗_←_Q₍_s_,_a₎_; 15 δ(s)←a 16 U(s)←Q∗₍_s_,_δ(_s₎₎ 17 δ(s)←argmaxaQ(s,a); 18 U(s) ←maxaQ(s,a) 19 returnδ;

WhenthehorizonoftheMDPisfinitethisalgorithmprovidesinpolynomialtimeapolicythatisalwaysatleast asgood astheoneprovidedbyuopt (according tolmax(lmin))and tendstolexicographic optimalitywhen c=2·E+1 and l tends

tobE.

Let us now study the time complexity. The number of iterations is bounded by the size of the set of possible ma-trices of trajectories which is in O(|S|· |A|·E). One iteration of the algorithm requires composition, ordering and com-paring operations on b matrices of size (l,c). Since the composition and comparison of matrices are linear operations, the complexity of one iteration in worst case is in b· (l·c)·log(l·c). Therefore, the complexity of the algorithm is in

O(|S|· |A|·E·b· (l·c)·log(l·c)).

When the horizon of the MDP is not finite, equations (16) and (17) are not enough to rank-order the policies. The length of the trajectoriesmay be infinite, as well as their number. This problem is well known in classical probabilistic

MDPs whereadiscountfactorisusedto attenuatetheinfluenceoflaterutilitydegrees–thus allowingtheconvergence ofthealgorithm[21].Onthecontrary,classical5MDPsdonotneedanydiscountfactorandValueIteration,basedonthe evaluation forl=c=1,converges forinfinite horizon case [22]. Ina sense, thislimitationtol=c=1 plays therole ofa discount factor–but avery drastic one. Extendingthecomparison byusing≥lmaxlmin,l,c with larger(l,c) as shownbelow

allowstousealessdrastic discount.

Inotherterms,≥lmaxlmin,l,c canbeusedintheinfinitecase,asshownbythefollowingproposition.

Proposition7(Boundedutilitylmax(lmin)-policyevaluationconverges).LetUt(s)bethematrixissuedfroms atinstantt whena strategyδisexecuted.Itholdsthat:

∀

l

,

c

, ∃

t

,

such that

∀

t′

≥

t

, (

Ut

)

lmaxlmin_l,c

(

s

) = (

Ut′

)

lmaxlmin_l,c

(

s

) ∀

s

.

Hencethereexistsastaget,wherethevalueofapolicybecomesstableifcomputedwiththeboundedutilitylmax(lmin) evaluation algorithm.This criterionisthus soundlydefined and can beusedin theinfinite horizon case(and ofcoursein thefinitehorizon case).

ThenumberofiterationsofAlgorithm4isnotexplicitlyboundedbut theconvergence ofthealgorithmisguaranteed– thisisadirectconsequenceofProposition7.

Corollary1(Boundedutilitylmax(lmin)-valueiterationconverges).∀l,c,∃t suchthat,∀t′_≥_t,₍_Ut₎lmaxlmin

l,c (s) = (U t′

)lmaxlmin_l_,_c (s) ∀s.

Theoverallcomplexityofboundedutilitylmax(lmin)-valueiteration (Algorithm 4)isboundedby O(|S|· |A|· |L|·b· (l·c)·

(14)

4.3. Boundedutilitylexicographic-policyiteration

In Ref. [17], Howard shows that a policy often becomes optimal long before the convergence of the value estimates. ThatiswhyPuterman[21] hasproposedapolicyiterationalgorithm.Thisalgorithmhasbeenadapted topossibilisticMDPs by [22].

Likewise, weproposea(boundedutility) lexicographicpolicyiteration algorithm(Algorithm5),denotedhere BU -P I that

alternatesimprovementandevaluationphases,asany policyiterationalgorithm.

Algorithm 5: Lmax(lmin)-BoundedUtilityPolicyIteration.

Data: ApossibilisticMDP,bounds(l,c)

Result: Apolicyδ∗_optimal_when_l_,_{c grows}

1 begin

2 // Arbitrary initialization of δ on S

3 foreach s∈S doδ(s)←chooseanyas∈As;

4 repeat 5 // Evaluation of δ 6 foreach s∈S do U(s)←µ(s); 7 repeat 8 foreach s∈S do 9 Uold₍_s₎_←_U₍_s₎_;

10 // Gather the matrices of the successors of s given δ

11 Future← (U(s′_),_s′_∈_succ₍_s_,_δ(_s₎₎₎_; U(s)←h¡T Us,δ(s)×Future

¢lmaxlmini l,c;

12 until U(s)==Uold₍_s₎ _for_each_s;

13 δold← δ;

14 // Improvement of δ

16 // Compute the utility of the strategy playing a (for each a), given what was chosen for the

other states 17 foreach a∈A do

18 Future← (U(s′_),_s′_∈_succ₍_s_,_δold₍_s₎₎₎_;

Q(s,a)←£¡T Us,a×Future¢¤lmaxlmin_l_,_c

19 // Update the choice of an action for S

20 δ(s)←arg maxlmax_a_∈_A(lmin)Q(s,a)

21 untilδ == δold; 22 returnδ;

In line 3 of Algorithm 5, an arbitrary initial policy is chosen. The algorithm then proceeds by evaluating the current policy,through successiveupdatesofthevaluefunction(lines8to11);theconvergenceofthisevaluationiseasilyderived from that of the boundedutilitylmax(Lmin)-valueiteration algorithm. Then the algorithm enters the improvement phase: Lines 17–18 compute Q(s,a), the (bounded lexicographic) utility of playing action a in state s and then applying policy

δold insubsequent states(thepolicy computedduring thelast iteration); asusual in Policy Iterationstyle algorithms, the updated policy (δ) is thenobtained by greedilyimproving the current action, which is donein line 20. Sincethe actions consideredatline20doincludetheoneprescribedbyδold,eithernothingischanged, andthealgorithmstops, orthenew policy,δ,isbetterthan thepreviousoneδold.

Proposition8.Boundedutilitylmax(lmin)-policyiterationconvergestoanoptimalpolicyforºlmaxlmin,l,cinfinitetime.

Policyiteration (Algorithm 5) converges and is guaranteed to find a policy optimal for the (l,c) lexicographic criterion in finitetimeand usually inafew iterations. As forthealgorithmic complexity oftheclassical, stochastic, policy iteration algorithm (whichis still not well understood[16]), atight bound worst-case complexity oflexicographicpolicyiteration is

hardtoobtain.Therefore,weprovideanupper-boundofthiscomplexity.

The policyiteration algorithm never visits a policy twice:in the worst case, the number oftrial iterations before con-vergence is exponential but it is dominated by the number of distinct policies. So, the complexity of this algorithm is dominated by (|A||S|). Besides, each iteration has a cost, the evaluation phase relying on a bounded utility value itera-tion algorithm that costs O(|S|· |A|· |L|·b· (l·c)·b·log(l·c)) when many actions are possible ata given step, and cost

O(|S|· |L|·b· (l·c)·b·log(l·c))herebecauseoneactionisselected(by thecurrent policy)foreachstate.Thus,theoverall complexity ofthealgorithmisin O(|A||S|· |S|· |L| ·b· (l·c)·b·log(l·c)).

(15)

4.4. Backtolexicographic-valueiteration:fromfinitetoinfinitehorizon5-MDPs

The bounded iterations algorithm defined insection 3(Algorithm 3, (B I-V I)) can beused for both finitehorizon and infinite horizonMDPs,becauseitfixesanumberofiterations E;if E islow,thepolicy reachedinnotnecessarilyoptimal– thealgorithmisanapproximationalgorithm.

Now, exploiting the above propositions, we are able to show that the bounded iterations Lmax(lmin) value iteration algorithm(Algorithm3)convergeswhen E tendstoinfinity. Todo so,wefirstprovethefollowingproposition:

Proposition9.Letanarbitrarystationary5MDPbegiven.Then,thereexisttwopositivenaturalnumbers(l∗_,_c∗₎_,_such_that_for_any

pair(δ,δ′)ofarbitrarypoliciesandanystates∈S,andforanypair(l,c)suchthatl≥l∗_and_c_≥_c∗_,

δ(

s

) ≻

lmaxlmin,l∗_,c∗

δ

′

(

s

) ⇔ δ(

s

) ≻

_lmaxlmin,l,c

δ

′

(

s

)

Now, this proposition can be used to prove the convergence of the boundediterationsLmax(lmin)-valueiteration

algo-rithm. For this,let usdefine ≻lmaxlmin=de f≻lmaxlmin,l∗_,_c∗, theunique preference relation between policiesthat results from

Proposition9.

Proposition10.IfweletδEbethepolicyreturnedbyAlgorithm3foranyfixedE,wecanshowthatthesequence(δE)convergesand

thatthereexistsafiniteE∗_,_such_that:

lim

E→∞

δ

E

= δ

E

∗

.

Furthermore,δE∗isoptimalwithrespectto≻_lmaxlmin.

Thesequenceofpoliciesobtainedfor(B I-V I)(Algorithm3)whenE tendstoinfinityconverges.Furthermore,thelimitis attainedforafinite(butunknown inadvance) E∗_._Alternately,_it_is_also_attained_for_the₍_{BU -V I}₎ _and₍_{BU -P I}₎ _algorithms,

withfinitebutunknown(l∗_,_c∗₎_.

Now, let us summarize the theoreticalresults that we have obtained so far. We have shown that possibilistic utilities (optimistic and pessimistic) are special cases of bounded lexicographic utilities, which can be represented by matrices. Possibilistic utilitiesareobtainedwhenl=c=1.

The possibilisticvalue iterationand policy iterationalgorithmscanbeextended tocompute policieswhich areoptimal accordingto≻lmaxlmin,l,c.

Finally,ifinfinitehorizonlexicographicoptimalpoliciesaredefinedasthelimitingpoliciesobtainedfromanon-bounded lexicographicvalue-iterationalgorithm,wehaveshownthatsuchpoliciescanbecomputedbyapplyingourboundedutility lmax(lmin) value iterationalgorithm and that only a finite number of iterations (even though not known in advance) is required.

5. Experiments

In orderto evaluate thepreviousalgorithms, wepropose, in thefollowing, twoexperimental analyses: in thefirst one we will compare the bounded iterations algorithm of value iteration (Algorithm 3) with the bounded utility one and in the second we propose to compare the bounded utilitylexicographic policy iteration algorithm with the bounded utility lexicographicvalueiterationone.ThealgorithmshavebeenimplementedinJavaandtheexperimentshavebeenperformed on anIntel Corei5processorcomputer(1.70 GHz)with8GBDDR3LofRAM.

5.1. Boundedutilityvsboundediterationsvalueiteration

Experimentalprotocol. We nowcompare the performanceofboundedutilitylexicographicvalueiteration (BU -V I) asan approximation of lexicographicvalueiteration (B I-V I) for finite horizon problems, in the Lmax(lmin) variant. Because the horizon is finite, the number of steps of (B I-V I) can be set equal to thehorizon and the algorithm provides a solution optimal accordingto Lmax(lmin). (BU -V I) on the other side limits thesize on thematrices, and can lead tosub-optimal solutions.

We evaluate the performance of the algorithms by carrying out simulations on randomly generated finite horizon

(16)

Fig. 2. Bounded utility lexicographic value iteration vs lexicographic value iteration.

Table 1

AverageCPUtime(inseconds) andaveragenumberofiterations. Bounded utility policy iteration

(l,c) (2,2) (4,4) (6,6) (10,10)

CPU time (s) 0.029 0.042 0.064 0.091

Average number of iterations 3.2 4.33 5.6 9.7

Bounded utility value iteration

(l,c) (2,2) (4,4) (6,6) (10,10)

CPU time (s) 0.03 0.052 0.082 0.1

Average number of iterations 6.75 9.25 16.11 20.2

themoreimportanttheeffectivenessofcuttingmatriceswithBU -V I;thelowerthisrate,themoreimportantthedrowning effect.

Results. Fig.2(a)presentstheaverageexecutionCPUtimeforthetwoalgorithms.Obviously,forbothB I-V I and BU -V I,

the execution time increases with the horizon.Also, we observe that theCPU time of BU -V I increasesaccording to the valuesof(l,c) butitremainsaffordable,asthemaximalCPUtimeislowerthan 1 s forMDPswith25statesand4actions when (l,c)= (40,40) and E=25. Unsurprisingly,wecancheck thatthe BU -V I (regardlessofthevaluesof(l,c))isfaster than B I-V I especially when the horizon increases: the manipulation ofl,c-matrices is obviouslyless expensive than the one offullmatrices.Thesavingincreases withthehorizon.

Aswiththesuccessrate,theresultsaredescribedinFig.2(b).Itappearsthat BU -V I providesaverygoodapproximation especially whenincreasing(l,c).Itprovidesthesameoptimalsolutionasthe B I-V I inabout 90%ofcases,withan(l,c)= (200,200).Moreover, even whenthesuccessrateof BU -V I decreases(when E increases), thequalityofapproximationis still good: neverless than 70% ofoptimalactionsreturned, with E=25.These experimentsconclude in favorof bounded valueiteration:thequalityofitsapproximatedsolutionsarecomparablewiththoseoftheunboundedversionforhigh(l,c)

and increaseswhen(l,c) increase,whileitismuchfaster.

5.2. Boundedutilitylexicographicpolicyiterationvsboundedutilitylexicographicvalueiteration

Experimentalprotocol. In what follows we evaluate the performances of boundedutilitylexicographicpolicy iteration

(BU -P I) andboundedlexicographicvalueiteration (BU -V I), inthe lmax(lmin)variant. Weevaluate theperformance ofthe algorithmson randomlygenerated5MDPsasthoseofSection5.1 with|S|=25 and|As|=4,∀s.

Weran thetwo algorithmsfordifferent valuesof (l,c) (100 5MDPs areconsideredin eachsample).For eachofthe two algorithmswemeasure theCPUtimeneededtoconverge.Wealsomeasuretheaveragenumberofvalueiterations for

(BU -V I)and theaveragenumberofpolicy iterationsfor(BU -P I).

Results. Table1presentstheaverageexecutionCPUtimeandtheaveragenumberofiterationsforthetwoalgorithms. Obviously, for both BU -P I and BU -V I, the execution time increases according to the values of (l,c) but it remains affordable, asthe maximal CPUtime is lower than 0.1 s for MDPs with 25 states and 4 actions when (l,c)= (10,10). It appears that BU -P I (regardlessofthevaluesof(l,c))isslightly fasterthan BU -V I.

Consider now the number of iterations. At each iteration, BU -P I considers one policy, explicitly, and updates it at line 20. And so does value iteration: for each state, the current policy is updatedat line 15. Table 1 shows that BU -P I

always considers fewer policies than BU -V I. This experiment provides an empirical evidence in favor of policyiteration