• Aucun résultat trouvé

Lexicographic refinements in stationary possibilistic Markov Decision Processes

N/A
N/A
Protected

Academic year: 2021

Partager "Lexicographic refinements in stationary possibilistic Markov Decision Processes"

Copied!
22
0
0

Texte intégral

(1)

Any correspondence concerning this service should be sent

to the repository administrator:

tech-oatao@listes-diff.inp-toulouse.fr

This is a publisher’s version published in:

http://oatao.univ-toulouse.fr/22626

To cite this version:

Ben Amor, Nahla and El Khalfi,

Zeineb and Fargier, Hélène and Sabbadin, Regis Lexicographic

refinements in stationary possibilistic Markov Decision

Processes. (2018) International Journal of Approximate

Reasoning, 103. 343-363. ISSN 0888-613X

Official URL

DOI :

https://doi.org/10.1016/j.ijar.2018.10.011

Open Archive Toulouse Archive Ouverte

OATAO is an open access repository that collects the work of Toulouse

researchers and makes it freely available over the web where possible

(2)

Lexicographic

refinements

in

stationary

possibilistic

Markov

Decision

Processes

Nahla

Ben Amor

a,∗

,

Zeineb

El Khalfi

a,b,∗∗

,

Hélène

Fargier

b,∗

,

Régis

Sabbadin

c,∗ a LARODEC,UniversityofTunis,Tunisia

bIRIT,UPS-CNRS,UniversitédeToulouse3,118routedeNarbonne,F-31062Toulouse,France cMIAT,UR875,UniversitédeToulouse,INRA,F-31320Castanet-Tolosan,France

a b s t r a c t

Keywords:

MarkovDecisionProcess Possibilitytheory Lexicographiccomparisons Possibilisticqualitativeutilities

PossibilisticMarkovDecisionProcessesofferacompactandtractablewaytorepresentand solveproblemsofsequentialdecisionunderqualitativeuncertainty.Eventhoughappealing for itsability to handlequalitative problems,this model suffers fromthe drowningeffect

thatisinherenttopossibilisticdecisiontheory.Thepresent1paperproposestoescapethe drowningeffectbyextendingtostationarypossibilisticMDPsthelexicographicpreference relations definedbyFargier and Sabbadin[13] fornon-sequentialdecision problems.We propose a valueiteration algorithm and apolicy iteration algorithm tocomputepolicies thatareoptimalforthesenewcriteria.Thepracticalfeasibilityofthesealgorithmsisthen experimentedondifferentsamplesofpossibilisticMDPs.

1. Introduction

The classical paradigm for sequential decision making underuncertainty is the expected utility-based MarkovDecision Processes (MDPs) framework[3,21],which assumesthat the uncertaineffects ofactions can berepresentedby probability distributionsandthat utilitiesareadditive.ButtheEUmodeldoesnosuit problemswhereuncertaintyandpreferencesare ordinalinessence.

AlternativestotheEU-basedmodelhavebeenproposedtohandleordinalpreferences/uncertainty.Remainingwithinthe probabilistic, quantitative,framework whileconsideringordinalpreferences hasledto quantile-based approaches[15,18,27,

29,33].Purely ordinalapproaches tosequential decisionunder uncertaintyhave alsobeen considered. In particular, possi-bilistic MDPs[1,6,22,24] formapurelyqualitativedecisionmodel withanordinalevaluationofplausibilityand preference. Inthismodel,uncertaintyabouttheconsequencesofactionsisrepresentedbypossibilitydistributionsandutilitiesarealso ordinal.The decisioncriteriaareeitherthepessimistic qualitativeutilityoritsoptimisticcounterpart[9].Suchdegreescan beeitherelicited fromexperts,or byautomaticlearningapproaches[23].However, itisnowwellknown that possibilistic

ThispaperispartoftheVirtualspecialissueonthe14thEuropeanConferenceonSymbolicandQuantitativeApproachestoReasoningwithUncertainty (ECSQARU2017),editedbyAlessandroAntonucci,LaurenceCholvyandOdilePapini.

*

Correspondingauthors.

**

Correspondingauthorat:IRIT,UPS-CNRS,118routedeNarbonne,31062Toulouse,France.

E-mailaddresses:nahla.benamor@gmx.com(N. Ben Amor),zeineb.khalfi@gmail.com(Z. El Khalfi),fargier@irit.fr(H. Fargier),regis.sabbadin@inra.fr

(R. Sabbadin).

1 Thispaperisanextendedandrevisedversionoftwoconferencepapers[4,5].Itincludesthefullproofsofthepropositionspresentedinthese

prelimi-narypapers,newalgorithms(basedonpolicyiteration)andnewexperiments.

(3)

decision criteria sufferfrom adrowningeffect [13]: plausible enough bad or good consequences may completelyblur the comparisonbetweenpolicies,thatwouldotherwisebeclearlydifferentiable.

In[13],FargierandSabbadinhaveproposedlexicographicrefinements ofpossibilisticcriteriafortheone-stepdecisioncase, inordertoremedythedrowningeffect.Thisworkhasrecentlybeenextendedfor(finitehorizon)possibilisticdecisiontrees [4].Inthepresentpaper,weproposetostudytheinterestofthelexicographicpreferencerelationstostationarypossibilistic MarkovDecisionProcesses,amodelthat ismorecompactthandecisiontreesandnotlimitedtoafinitehorizon.

The paperisstructuredasfollows: ThenextSection recallsthebackgroundabout possibilistic decisiontheoryand sta-tionarypossibilisticMDPs,includingthedrowningeffectproblem.Section3definesthelexicographiccomparisonofpolicies and presentsavalueiterationalgorithmwhichcomputesanearlyoptimalstrategyinalimited numberofiterations. Then, Section4proposesalexicographicvalueiterationalgorithmandalexicographicpolicyiterationalgorithmusing approxima-tion ofutilityfunctions.Lastly,Section5presentsourexperimentalresults.

2. Backgroundandnotations

2.1. Basicsofpossibilisticdecisiontheory

Most of available decision models refer to probability theory for the representation of uncertainty [20,25]. Despite its success, probability theory is not appropriate when numerical information is not available. When information about un-certainty cannot be quantified in a probabilistic way, possibilistic theory [8,34] is a natural field to consider. The basic component ofthistheoryisthenotionofpossibilitydistribution.Itisarepresentationofastateofknowledgeofanagent about thestateof theworld.Apossibility distribution

π

isamapping fromtheuniverse ofdiscourse S (theset ofallthe possible worlds) to abounded linearly ordered scale L exemplified (without loss ofgenerality) bythe unitinterval [0,1], wedenotethefunctionby:

π

:S→ [0,1].

Forstate sS,

π

(s)=1 means thatrealization s istotallypossibleand

π

(s)=0 meansthat s isanimpossiblestate.It isgenerallyassumedthat thereexistsatleastonestates whichistotallypossible:

π

isthensaidtobenormalized.

Inthepossibilisticframework,extremeformsofknowledgecanbecaptured, namely: • Completeknowledgei.e.∃s s.t.

π

(s)=1 and∀ s6=s,

π

(s)=0.

Totalignorancei.e.∀sS,

π

(s)=1 (allvaluesin S are possible).

From

π

one cancomputethepossibilitymeasure5(A)and thenecessitymeasure N(A) ofanyevent AS:

5(

A

) =

sup sA

π

(

s

)

N

(

A

) =

1

− 5( ¯

A

) =

1

sup s∈/A

π

(

s

)

Measure5(A) evaluatestowhichextent A isconsistentwiththeknowledgerepresentedby

π

whileN(A)corresponds totheextenttowhich¬A is impossibleand thusevaluatesatwhichlevel A iscertainlyimpliedbytheknowledge.

In decision theory acts are functions f : S7→X , where X is a finite set of outcomes. In possibilistic decision making, an act f can beviewed asa possibility distribution

π

f over X [9], where

π

f(x)= 5(f−1(x)). In asingle stage decision

making problem, autility function u:X7→U maps outcomesto utilityvalues in a totally ordered scale U= {u1,...,un}.

Thisfunctionmodelstheattractivenessofeachoutcomeforthedecision-maker.

Under theassumptionthat the utilityscaleand thepossibility scalearecommensurate and purelyordinal (i.e.U =L),

Duboisand Prade[9,7] haveproposedpessimisticandoptimisticdecisioncriteria.

First,thepessimisticcriterionwas originallyproposedbyWhalen[30] anditgeneralizestheWaldcriterion[28].Itsuits cautious decisionmakers whoare happywhen bad consequences arehardlyplausible. It summarizesto what extent itis certain(i.e.necessaryaccordingtomeasure N)thattheactreachesagoodutility.Thedefinitionofthepessimisticcriterion isasfollows[10]:

Definition1.Givenapossibilitydistribution

π

overasetofstates S andautilityfunctionu onthesetofconsequences X ,

thepessimisticutilityofanact f isdefinedby:

upes

(

f

) =

min xjX max

(

u

(

xj

),

1

π

f

(

xj

)),

=

min siS max

(

u

(

f

(

si

)),

1

π

(

si

)).

(1)

Therefore,wecancomparetwoacts f and g onthebasisoftheirpessimisticutilities:

(4)

The secondcriterionistheoptimistic possibilisticcriterionoriginally proposedbyYager [32,31].Thiscriterioncaptures thebehaviorofanadventurous decisionmakerwhoishappy assoonasatleast onegoodconsequence ishighlyplausible. It summarizes towhat extent itispossible that anact reachesagood utility. The definitionofthis criterionisasfollows [10]:

Definition2.Givena possibilitydistribution

π

over aset ofstates S andautilityfunctionu on aset ofconsequences X ,

theoptimisticutilityofanact f isdefined by:

uopt

(

f

) =

max xjX min

(

u

(

xj

),

π

f

(

xj

)),

=

max siS min

(

u

(

f

(

si

)),

π

(

si

)).

(2)

Hence,wecancomparetwo acts f and g onthebasisoftheiroptimisticutilities:

f

º

uopt g

uopt

(

f

) ≥

uuopt

(

g

).

Example1.Let S= {s1,s2}and f and g betwoactswhoseutilitiesofconsequencesinthestatess1and s2 arelistedinthe

followingtable,aswellasthedegreesofpossibilityofs1 ands2:

s1 s2 u(f(s)) 0.3 0.5

u(g(s)) 0.4 0.6

π 1 0.2

Comparing f and g withrespecttothepessimisticcriterion,we get: • upes(f)=min(max(0.3,0),max(0.5,0.8))=0.3,

upes(g)=min(max(0.4,0),max(0.6,0.8))=0.4.

Thus, gºupes f .

Letusnowcomparethetwoactswithrespecttotheoptimisticcriterion: • uopt(f)=max(min(0.3,1),min(0.5,0.2))=0.3,

uopt(g)=max(min(0.4,1),min(0.6,0.2))=0.4.

Thus, gºuopt f .

Itisimportanttonotethat whiletransitionprobabilitiescanbeestimatedthroughsimulationsoftheprocess,transition possibilitiesmaynot.Ontheotherhand,expertsmaybeinvolvedfortheelicitationofthepossibilitydegreesandutilitiesof transitions.Inthepossibilisticframework,utilityanduncertaintylevelscanbeelicitedjointly,bycomparisonofpossibilistic lotteries,for example(e.g.byusingcertaintyequivalents,asin [11]).Simulationcanalsobeused jointlywithexpert eval-uationwhentheunderlyingprocess istoocostlytosimulatealargenumberoftimes: simulationmaybeusedtogenerate samples onwhich expertelicitationisapplied. Anotheroptionisto usepossibilisticreinforcementlearningprocedure (for moredetailssee[23]),inparticularmodel-based reinforcementlearningalgorithm.The latteruses auniformsimulationof trajectories(withrandomchoiceofactions)inordertogenerateanapproximationofthepossibilisticdecisionmodel.

2.2. StationaryPossibilisticMarkovDecisionProcesses

AstationaryPossibilisticMarkovDecisionProcess(5MDP)[22] isdefinedby: • Afiniteset S of states;

• Afiniteset A of actions, As denotestheset ofactionsavailableinstates;

• A possibilistic transition function: for each action aAs and each state sS the possibility distribution

π

(s′|s,a)

evaluatestowhatextent eachsisapossiblesuccessor ofs whenactiona isapplied; • Autilityfunction

µ

:

µ

(s)istheintermediatesatisfactiondegreeobtainedinstates.

Theuncertainty abouttheeffectofanaction a takeninstate s iscaptured byapossibilitydistribution

π

(.|s,a).Inthe present paper, weconsider stationaryproblems,i.e. problemsinwhich thestates,theactionsand thetransition functions do not dependon thestageof theproblem.Sucha possibilisticMDPmay defineagraphwhere statesarerepresentedby circles andeachstate“s”islabeledwithautilitydegree,andactionsarerepresentedbysquares.Anedgelinkinganaction toastatedenotesapossibletransitionandislabeledbythepossibilityofthat stategiventheactionisexecuted.

(5)

Fig. 1. The stationary5MDPof Example2.

Example2.Let ussuppose thata“RichandUnknown” personrunsastartupcompany.Initially,s/hemustchoosebetween

Saving money (Sav) or Advertising ( Adv) and may then get Rich (R) or Poor (P ) and Famous (F ) or Unknown (U ). In the otherstates, Sav is theonlypossible action.Fig. 1showsthestationary5MDP thatcaptures thisproblem,formally described asfollows:

S= {RU,R F,P U},

ARU= {Adv,Sav}, AR F =AP U= {Sav},

π

(P U|RU,Sav)=0.2,

π

(RU|RU,Sav)=

π

(R F|RU,Adv)=

π

(R F|R F,Sav)=

π

(RU|R F,Sav)=1,

µ

(RU)=0.5,

µ

(R F)=0.7,

µ

(P U)=0.3.

SolvingastationaryMDPconsistsinfindinga(stationary)policy,i.e.afunctionδ:SA whichisoptimalwithrespect to adecisioncriterion.Inthepossibilisticcase,asintheprobabilisticcase,thevalueofapolicydependsontheutilityand on the likelihoodofits trajectories.Formally, let1bethe setof allpoliciesthat can bebuiltfor the5MDP (theset of all thefunctions thatassociate an elementof As to eachs).Each δ ∈1defines a listof scenarioscalledtrajectories. Each

trajectory

τ

isasequenceofstatesandactionsi.e.

τ

= (s0,a0,s1,. . . ,st−1, at−1,st).

Tosimplifynotations,wewillassociatethevectorvτ = (

µ

0,

π

1,

µ

1,

π

2,. . . ,

π

t−1,

µ

t)toeachtrajectory

τ

,where

π

i+1=

π

(si+1|si,ai) isthepossibility degreetoreach thestate si+1 att=i+1,applying theaction ai att=i and

µ

i=

µ

(si) is

theutilityobtainedinthei-th statesi ofthetrajectory.

Thepossibilityandtheutilityoftrajectory

τ

given thatδ isappliedfroms0 aredefinedby:

π

(

τ

|

s0

, δ) =

min

i=1...t

π

(

si

|

si−1

, δ(

si−1

))

and

µ

(

τ

) =

imin=0...t

µ

(

si

).

(3)

Twocriteria,anoptimisticandapessimisticone,canthenbeusedtoevaluate δ [24,9]:

uopt

(δ,

s0

) =

max

τ min

{

π

(

τ

|

s0

, δ),

µ

(

τ

)},

(4)

upes

(δ,

s0

) =

min

τ max

{

1

π

(

τ

|

s0

, δ),

µ

(

τ

)}.

(5)

The policies optimizing these criteria can be computed by applying, for every state s and time step i=0,...,t, the followingcounterpartsoftheBellmanupdates[22]:

uopt

(

s

,

i

) ←

max aAs min

{

µ

(

s

),

max sS min

(

π

(

s

|

s

,

a

),

uopt

(

s

,

i

+

1

))},

(6) upes

(

s

,

i

) ←

max aAs min

{

µ

(

s

),

min sSmax

(

1

π

(

s

|

s

,

a

),

upes

(

s

,

i

+

1

))},

(7)

δ

opt

(

s

,

i

) ←

arg max aAs

min

{

µ

(

s

),

max

sS min

(

π

(

s

|

s

,

a

),

uopt

(

s

,

i

+

1

))},

(8)

δ

pes

(

s

,

i

) ←

arg max aAs

min

{

µ

(

s

),

min

sSmax

(

1

π

(

s

|

s

,

a

),

upes

(

s

,

i

+

1

))}.

(9)

There weset,arbitrarily, uopt(s′,t+1))=1 and upes(s′,t+1))=1.

Ithasallowedthedefinitionofa(possibilistic)valueiteration algorithm(seeAlgorithm1fortheoptimisticvariantofthis algorithm)whichconvergestoanoptimalpolicy inpolytime[22].

This algorithm proceeds byiterated modifications ofa possibilisticvalue function Q(s,a) which evaluates the“utility” (pessimisticoroptimistic)ofperforminga in s.

Anotheralgorithm, (possibilistic)PolicyIteration(Algorithm2fortheoptimisticvariant) isproposed in[22] forsolving possibilisticstationary,infinite horizonMDPs.PolicyIterationalternatesstepsofevaluationofthecurrent policywithsteps ofgreedyimprovementofthecurrentpolicy.

(6)

Algorithm 1: V I-M D P :Possibilistic(Optimistic)Valueiteration.

Data: Astationary5MDP

Result: Apolicyδoptimalforuopt

1 begin 2 foreach sS do uopt(s)←µ(s); 3 repeat 4 foreach sS do 5 uold(s)←uopt(s); 6 foreach aA do

7 Q(s,a)←minnµ(s),maxs∈Smin{(π(s′|s,a),uopt(s′)}

o ;

8 uopt(s)←maxaQ(s,a);

9 δ(s)←arg maxaQ(s,a);

10 until uopt(s)==uold(s)foreachs;

11 returnδ;

Algorithm 2: P I-M D P :Possibilistic(Optimistic)Policyiteration.

Data: Astationary5MDP

Result: Apolicyδoptimalforuopt

1 begin

2 // Initialization of δ and uopt

3 foreach sS do

4 δ(s)←chooseanyasAs;

5 uopt(s)←µ(s);

6 repeat

7 // Evaluation of δ until stabilization of uopt

8 repeat

9 foreach sS do

10 uold(s)←uopt(s);

11 uopt(s)←min

n

µ(s),maxs∈Smin{π(s|s,δ(s)).uold(s)} o

; 12 until uopt==uold;

13 // Improvement of δ

14 foreach sS do

15 δold(s)← δ(s);

16 δ(s)←arg maxa∈Amin

n

µ(s),maxs∈Smin{π(s′|s,a).uopt(s′)}

o

; 17 untilδ(s)== δold(s)foreachs;

18 // stabilization of δ

19 returnδ;

2.3. Thedrowningeffectinstationarysequentialdecisionproblems

Unfortunately, possibilisticutilities sufferfrom animportantdrawback calledthe drowningeffect:plausible enough bad orgood consequencesmaycompletelyblurthecomparisonbetweenactsthatwouldotherwisebeclearlydifferentiated;as aconsequence,anoptimalpolicy δisnotnecessarilyParetoefficient.RecallthatapolicyδisParetoefficientwhennoother policy δ′ dominatesit(i.e. thereisno policy δ′ suchthat (i)∀ sS,upes(δ′,supes(δ,s) and (ii)∃sS s.t. upes(δ′,s)≻ upes(δ,s)).Thefollowingexample showsthat itcansimultaneouslyhappenthatδ′ dominates δ andupes(δ)=upes(δ′).

Example3.The5MDP ofExample2admitstwopoliciesδ and δ′:

• δ(RU)=Sav;δ(P U)=Sav;δ(R F)=Sav;

• δ′(RU)=Adv;δ′(P U)=Sav;δ′(R F)=Sav.

Considerafixedhorizon H=2: • δ has3 trajectories:

τ

1= (RU,P U,P U)with vτ1= (0.5,0.2,0.3,1,0.3);

τ

2= (RU,RU,P U) with vτ2= (0.5,1,0.5,0.2,0.3);

(7)

• δ′has 2 trajectories:

τ

4= (RU,R F,R F)with vτ4= (0.5,1,0.7,1,0.7);

τ

5= (RU,R F,RU) with vτ5= (0.5,1,0.7,1,0.5).

Thus uopt(δ,RU)=uopt(δ′,RU)=0.5. However δ′ seems better than δ since it provides utility 0.5 for sure while δ

providesabadutility(0.3)insomenon-impossibletrajectories(

τ

1 and

τ

2).

τ

3 whichisgoodandtotally possible“drowns”

τ

1 and

τ

2:δ isconsideredasgoodasδ′.

3. Boundediterationssolutionstolexicographicfinitehorizon5MDPs

Possibilistic decisioncriteria,especially pessimisticand optimisticutilities,aresimple and realisticasillustratedin Sec-tion2,buttheyhaveanimportantshortcoming:theprincipleofParetoefficiencyisviolatedsincethesecriteriasufferfrom the drowning effect.Indeed, one decisionmay dominate another one while notbeing strictlypreferred. Inorder to over-come the drowning effect,some refinements ofpossibilistic utilities have been proposed in the non-sequentialcase such aslexicographicrefinements,proposed by[12,13].Theserefinementsarefullyinaccordancewithordinalutilitytheoryand satisfytheprincipleofParetodominance,thatiswhywehavechosentofocusonthem.

The present section defines an extension of lexicographic refinements to finite horizon possibilistic Markov decision processesand proposesavalueiterationalgorithmthatlooksforpoliciesoptimalwithrespecttothesecriteria.

3.1. Lexi-refinementsofordinalaggregations

In ordinal(i.e.min-based and max-based) aggregationasolution tothe drowningeffectbased on leximinand leximax comparisons has been proposed by[19].It has thenbeen extended to non-sequential decisionmaking under uncertainty [13] and,inthesequentialcase,todecisiontrees[4].Letusfirstrecallthebasicdefinitionofthesetwopreferencerelations. For anytwovectorst andtoflengthm builtonthescale L:

t

º

lmint′iff

i

,

tσ(i)

=

t′σ(i)or

i

, ∀

i

<

i

,

tσ(i)

=

tσ′(i)and tσ(i)

>

tσ (i)

,

(10) t

º

lmaxt′iff

i

,

tµ(i)

=

tµ′(i)or

i

, ∀

i

<

i

,

tµ(i)

=

tµ(i)and tµ(i)

>

tµ(i)

,

(11)

where, forany vectorv (here,v=t or v=t), v

µ(i) (resp.vσ(i))isthei-th best(resp.worst)elementof v.

[13,4] haveextendedtheseprocedurestothecomparisonofmatricesbuilton L,definingpreferencerelationsºlmin(lmax)

and ºlmax(lmin):

A

º

lmin(lmax)B

⇔ ∀

j

,

a(lmax,j)

=

b(lmax,j)

or

i s.t.

j

>

i

,

a(lmax,j)

lminb(lmax,j)and a(lmax,i)

lminb(lmax,i)

,

(12)

A

º

lmax(lmin)B

⇔ ∀

j

,

a(lmin,j)

lmaxb(lmin,j)

or

i s.t.

j

<

i

,

a(lmin,j)

lmaxb(lmin,j)and a(lmin,i)

lmaxb(lmin,i)

,

(13)

where a(☎,i) (resp.b(☎,i))isthei-th largestsub-vectorof A (resp. B)accordingto ☎∈ {lmax,lmin}.

Like in(finite-horizon) possibilisticdecisiontrees[4] ourideaistoidentifythestrategiesoftheMDPwiththematrices of their trajectories,and to compare such matrices witha ºlmax(lmin) (resp. ºlmin(lmax)) procedure for theoptimistic(resp.

pessimistic)case.

3.2. Lexicographiccomparisonsofpolicies

Letusfirstdefinelexicographiccomparisonsofpoliciesoveragivenhorizon E.

Atrajectoryoverhorizon E beingasequenceofstatesandactions,any stationarypolicy canbeidentifiedwithamatrix where eachline correspondsto a distinct trajectory oflength E. In the optimisticcase each linecorresponds to a vector

vτ = (

µ

0,

π

1,

µ

1,

π

2,. . . ,

π

E−1,

µ

E)andinthepessimisticcaseto wτ = (

µ

0,1−

π

1,

µ

1,1−

π

2,. . . ,1−

π

E−1,

µ

E).

Thisallowsustodefinethecomparisonoftrajectoriesusingleximaxandleximinasfollows:

τ

º

lmin

τ

′iff

(

µ

0

,

π

1

, . . . ,

π

E

,

µ

E

) º

lmin

(

µ

′0

,

π

2′

, . . . ,

π

E

,

µ

E

),

(14)

τ

º

lmax

τ

′iff

(

µ

0

,

1

π

1

, . . . ,

1

π

E

,

µ

E

) º

lmax

(

µ

′0

,

1

π

1′

, . . .

1

π

E

,

µ

E

).

(15) Note that theabovepreference relationsimplicitlydependon thehorizon E andthesameholdsfor stationarypolicies comparison.Weleaveasideanyreferenceto E asthedependence willbeclearfromthecontext.

(8)

Using(14) and(15),wecancompare policiesby:

δ º

lmax(lmin)

δ

′iff

i

,

τ

µ(i)

lmin

τ

µ′(i)

or

i

, ∀

i

<

i

,

τ

µ(i)

lmin

τ

µ′(i) and

τ

µ(i)

lmin

τ

µ

(i)

,

(16)

δ º

lmin(lmax)

δ

′iff

i

,

τ

σ(i)

lmax

τ

σ′(i)

or

i

, ∀

i

<

i

,

τ

σ(i)

lmax

τ

σ′(i) and

τ

σ(i)

lmax

τ

σ

(i)

,

(17)

where

τ

µ(i) (resp.

τ

µ(i))isthei-th best trajectoryof δ (resp.δ′)accordingtoºlmin and

τ

σ(i) (resp.

τ

σ′(i))isthe i-th worst

trajectory ofδ (resp.δ′)accordingtoºlmax.

Hence,theutilitydegreeofapolicyδ canberepresentedbyamatrixUδwithn lines,s.t.n isthenumberoftrajectories, andm=2E+1 columns.Indeed,comparingtwopoliciesw.r.t.ºlmax(lmin)(resp.ºlmin(lmax))consistsinfirstorderingthetwo

correspondingmatricesoftrajectoriesasfollows:

• ordertheelementsofeachtrajectory(i.e.theelementsofeachline) inincreasingorderw.r.t. ºlmin (resp.indecreasing

orderw.r.t. ºlmax),

• then order all the trajectories. The lines of each policy are arranged lexicographically top-down in decreasing order (resp.top-downinincreasingorder).

Then,itisenoughtolexicographicallycomparethetwonewmatricesoftrajectories,denotedUδ (resp.Uδ′),elementby

element.Thefirstpairofdifferentelementsdeterminesthebestmatrix/policy.Notethat theorderedmatrix Uδ (resp.Uδ′)

canbeseenastheutilityofapplyingpolicy δ (resp.δ′)overalengthE horizon.

Example4. Let us consider the Counter-Example 3 with the same 5MDP of Example 2. We consider, once again, the

policiesδ and δ′ definedby:

• δ(RU)=Sav;δ(P U)=Sav;δ(R F)=Sav;

• δ′(RU)=Adv;δ′(P U)=Sav;δ′(R F)=Sav.

For horizon H=2: • δ has3 trajectories:

τ

1= (RU,P U,P U)with vτ1= (0.5,0.2,0.3,1,0.3);

τ

2= (RU,RU,P U) with vτ2= (0.5,1,0.5,0.2,0.3);

τ

3= (RU,RU,RU) withvτ3= (0.5,1,0.5,1,0.5).

Thematrixoftrajectoriesis:Uδ=   0.5 0.2 0.3 1 0.3 0.5 1 0.5 0.2 0.3 0.5 1 0.5 1 0.5  ∼   0.2 0.3 0.3 0.5 1 0.2 0.3 0.5 0.5 1 0.5 0.5 0.5 1 1  .

So,theorderedmatrix oftrajectoriesis:Uδ=   0.5 0.5 0.5 1 1 0.2 0.3 0.3 0.5 1 0.2 0.3 0.5 0.5 1  . • δ′has 2 trajectories:

τ

4= (RU,R F,R F)with vτ4= (0.5,1,0.7,1,0.7);

τ

5= (RU,R F,RU) with vτ5= (0.5,1,0.7,1,0.5). Theorderedmatrixoftrajectoriesis: Uδ′=

·

0.5 0.7 0.7 1 1 0.5 0.5 0.7 1 1 ¸

.

GiventhetwoorderedmatricesUδ andUδ′,δand δ′ areindifferentforoptimisticutilitysincethetwofirst(i.e.top-left)

elements ofthematrices areequali.e. uopt(δ)=uopt(δ′)=0.5.For lmax(lmin) wecompare successivelythe nextelements

(left toright thentop tobottom)until wefindapairofdifferent values.In particular,wehave thesecondelement ofthe first (i.e.thebest)trajectory ofδ′ isstrictlygreaterthan thesecondelement ofthefirsttrajectory ofδ (0.7>0.5).So,the firsttrajectoryofδ′ isstrictlypreferredtothefirsttrajectoryofδ accordingtoºlmin.Wededucethat δ′ isstrictlypreferred

to δ:

δ

lmax(lmin)

δ

since

(

0

.

5

,

0

.

7

,

0

.

7

,

1

,

1

) ≻

lmin

(

0

.

5

,

0

.

5

,

0

.

5

,

1

,

1

).

Thefollowingpropositionscanbeshown,concerningthefixedhorizoncomparisonofstationarypolicies.Noteagainthat thedependenceon E isleftimplicit.

(9)

Proposition1.

Ifuopt(δ)>uopt(δ′)thenδ ≻lmax(lmin)δ′.

Ifupes(δ)>upes(δ′)thenδ ≻lmin(lmax)δ′.

Proposition2.ºlmax(lmin)andºlmin(lmax)satisfytheprincipleofParetoefficiency.

Now,inordertodesigndynamicprogrammingalgorithms,i.e.toextendthevalueiterationalgorithmtolexicomparison, we showthat thecomparison ofpoliciesisa preorderandsatisfies theprincipleofstrict monotonicitydefined asfollows forany optimizationcriterion O by:∀δ,δ′,δ′′∈ 1,

δ º

O

δ

⇐⇒ δ + δ

′′

º

O

δ

+ δ

′′

,

where δ (resp. δ′) and δ′′ denote two disjointsets oftrajectories and δ + δ′′ (resp. δ′+ δ′′) is the set of trajectories that gatherstheonesofδ (resp.δ′)and theonesofδ′′.

Then,addingorremovingidenticaltrajectoriestotwosetsoftrajectoriesdoesnotchangetheircomparisonbyºlmax(lmin)

(resp.ºlmin(lmax)).

Proposition3.Relationsºlmin(lmax)andºlmax(lmin)arecomplete,transitiveandsatisfytheprincipleofstrictmonotonicity.

Note that uopt and upes satisfyonly aweakform ofmonotonicitysincetheaddition ortheremovaloftrajectoriesmay

transformastrictpreferenceintoanindifferenceif uopt orupes isused.

Let usdefine thecomplementary MDP (S,A,

π

,

µ

¯) of agiven 5MDP (S,A,

π

,

µ

) where

µ

¯(s)=1−

µ

(s),∀sS.The complementary MDPsimplygivescomplementary utilities.Fromthedefinitionsofºlmax and ºlmin,wecancheckthat:

Proposition4.

τ

ºlmax

τ

′⇔ ¯

τ

′ºlmin

τ

¯andδ ºlmin(lmax)δ′⇔ ¯δ′ºlmax(lmin)δ¯.

There

τ

¯ and ¯δ areobtainedbyreplacing

µ

with

µ

¯ inthetrajectory/5MDP.

Therefore, allthe results whichwe will provefor ºlmax(lmin) also hold for ºlmin(lmax), if wetake care toapply them to

complementary policies. Since considering ºlmax(lmin) involves less cumbersome expressions (no 1− ·), we will give the

results forthiscriterion.Aconsequence ofProposition4isthat theresultsholdforthepessimisticcriterionaswell. This monotonicity of the lmin(lmax) and lmax(lmin) criteria is sufficient to allow us to use a dynamic programming algorithm such asvalue iterationorpolicy iteration[2]. Thealgorithms wepropose in thepresent paper performexplicit Bellmanupdatesinthelexicographicframework(lines12–13ofAlgorithms3and4,line11ofAlgorithm5);thecorrectness oftheiruseis provedinPropositions6to10.

3.3. Basicoperationsonmatricesoftrajectories

Before going further, in order to give more explicit and compact descriptions of the algorithms and the proofs, let us introduce the following notations and some basic operations on matrices (typically, on the matrix U(s) representing trajectoriesissued from state s).Abusing notationsslightly,weidentifytrajectories

τ

(resp.policies)with their vτ vectors

(resp.matricesof vτ vectors)whenthereisnoambiguity.Forany matrixU ,[U]l,c denotestherestrictionofU toitsfirstl

lines andfirstc columnsand Ui,j denotestheelementatlinei andcolumn j.

Composition:Let U beaa×b matrixand N1,. . . ,Na beaseriesofa matricesofdimensionni×c (theyallsharethe

samenumberofcolumns).ThecompositionofU with(N1,. . . ,Na)denoted U× (N1,. . . ,Na)isamatrixofdimension

( 6

1≤iani)× (b+c).Forany ia,jnj,the((6i

<ini′)+ j)-th lineofU× (N1,. . . ,Na) istheconcatenation ofthei-th

lineofU andthe j-th lineof Ni.

The composition of U× (N1,. . . ,Na) is done in O(n·m) operations, where n= 6

1≤iani and m=b+c. The matrix U(s), matrix of trajectories out of state s when making decision a, is typicallythe concatenation of the matrix U = ((

π

(s|s,a),

µ

(s)),ssucc(s,a))withthematricesN

s′=U(s′).Thisprocedureaddstwocolumnstoeachmatrix U(s′),

filledwith

π

(s|s,a) and

µ

(s) the possibility degrees and the utility of reaching s;then the matrices are vertically

concatenated to get the matrix U(s) when making decision a. Then it is possible to lexicographically compare the resultingmatricesinordertogettheoptimalactioninstate s.

Orderingmatrices: LetU bean×m matrix, Ulmaxlmin isthematrixobtained byorderingtheelements ofthelines of

U inincreasingorderand thelines ofU accordingtolmax in decreasingorder(see Example4). Thisoperationallows tocomparematricesoftrajectories Q(s,a) ofeveryactionin order tocomparethemandchoosetheoptimaldecision. The complexity of the operation depends on the sorting algorithm: if we use QuickSort then ordering the elements withinaline isperformed in O(m·log(m)), and the inter-ranking ofthelines is donein O(n·log(nm) operations. Hence,theoverallcomplexityis O(n·m·log(n·m)).

(10)

Comparisonoforderedmatrices:Giventwoorderedmatrices Ulmaxlmin and Vlmaxlmin, wesaythat Ulmaxlmin>Vlmaxlmin

iff∃i,j such that∀i′<i,j′,Ulmaxlmini,j′ =Vlmaxlmini,j′ and∀j′<j, Ulmaxlmini,j′ =Vlmaxlmini,j′ andUlmaxlmini,j >Vlmaxlmini,j .

UlmaxlminVlmaxlmin iff they are identical (comparison complexity: O(n·m)). Once matrices Q(s,a) are ordered, the lexicographiccomparison oftwo decisionsis performedbyscanning theelements oftheirmatrices, lineby linefrom thefirstone.Thefirstpairofdifferentvaluesdeterminesthebestmatrixandthebestcorrespondingactiona isselected (seeExample4).

If thepolicies (orsub-policies) have different numbersof trajectories,thecomparison oftwo matrices isbased on the numberoftrajectoriesoftheshortestmatrix.Twocasesmayarise:

• If wehavea strictpreference betweenthetwo matricesbefore reachingthelast lineoftheshortest matrix,we geta strictpreferencebetweenthepolicies(orbetweenthesub-policies).

• If we have an indifference up to the last line, the shortest matrix is the best for the lexicographic criterion, since it expresseslessuncertaintyinthecorrespondingpolicy (orinthesub-policy).

3.4. Boundediterationslexicographicvalueiteration

In this section, we propose aniterative value iteration-typealgorithm (Algorithm 3). This algorithm follows the same principle as in the possibilistic case (Eqs. (6)–(9)). Repeated Bellman updates are performed successively E times. This algorithm will provide anapproximation of alexi optimalstrategy in the infinite horizon case (by considering thepolicy returned forthefirst timestep).This algorithmissub-optimal for anyfixed E,but wewill seeinSection 4that letting E

grow,anoptimallexicographicpolicywillbeobtainedforfinite E.

Weproposetwoversionsofthevalueiterationalgorithm:Thefirstonecomputestheoptimalpolicywithrespecttothe lmax(lmin)criterionand thesecondoneprovidestheoptimalpolicy withrespecttothelmin(lmax)criterion.Inthispaper, wepresentand detailonlythefirstalgorithm,sincethesecondisverysimilar.2

Algorithm 3: Boundediterations lmax(lmin)-valueiteration(BI-VI).

Data: ApossibilisticMDPandmaximumnumberofiterationsE

Result: TheδE strategyobtainedafter E iterations

1 begin 2 e←0;

3 foreach sS do U(s)← ((µ(s))); 4 foreach sS,aA do

5 T Us,aTs,a× ((µ(s′)),s′∈succ(s,a));

6 repeat 7 ee+1; 8 foreach sS do 9 Uold(s)=U(s); 10 Q← ((0)); 11 foreach aA do

12 Future← (Uold(s),ssucc(s,a));// Gather the matrices provided by the successors of s;

13 Q(s,a)←¡T Us,a×Future¢lmaxlmin;

14 if Q lmaxlminQ(s,a)then 15 QQ(s,a); 16 δ(s)←a 17 U(s)←Q(s,δ(s)) 18 until e==E; 19 δ(s)←argmaxaQ(s,a) 20 returnδE= δ;

This algorithm is aniterative procedure that performs aprescribed number of updates, E, ofthe utilityof each state, representedbyafinitematrixoftrajectories,usingtheutilitiesoftheneighboringstates.

Atstage1≤eE, theprocedureupdatestheutilityofeverystatesS asfollows:

• ForeachactionaA,amatrix Q(s,a)isbuiltto evaluatethe“utility”ofperforminga in s atstagee:thisisdoneby combining T Us,a (combinationofthetransitionmatrix Ts,a=

π

(·|s,a) and theutilities

µ

(s′) ofthestatess′ that may

(11)

follows when a isexecuted)with thematrices Uold(s) oftrajectoriesprovidedbythese satthepreviousstage. The

matrix Q(s,a) isthenordered(theoperationismade less complexbythefactthat thematricesUold(s) have already

beenorderedate−1).

• Thelmax(lmin) comparisonisperformedontheflytomemorizethebest Q(s,a).

• Thevalueofstates atstagee,U(s),istheonegivenbytheactiona whichprovidesthebest Q(s,a).δ isupdated,U is

memorized(andUold canbediscarded).

Timeandspacecomplexitiesofthisalgorithmareneverthelessexpensive,sinceiteventuallymemorizesallthe trajecto-ries.At eachstepe itssizemaygrowtobe· (2·e+1),whereb isthemaximalnumberofpossiblesuccessorsofanaction;

theoverallcomplexityofthealgorithmis O(|S|· |AE·bE),whichisaproblem.

Algorithm 3 is provided with a number of iterations, E. Does it converge when E tends to infinity? That is, are the returnedpoliciesidenticalforany E exceedingagiventhreshold?Beforeanswering(positively)thisquestioninSection4.4, wearegoingtodefineboundedutilitymatrices solutionstolexicographicpossibilisticMDPs.Thesesolutionconceptswillbe usefultoanswertheabovequestion.

4. Boundedutilitysolutionstolexicographic5MDPs

Wehave just proposedalexicographic value iterationalgorithmfor thecomputationoflexicographic policiesbased on thewholematricesoftrajectories.Asaconsequence,thespatial/temporalcomplexityofthealgorithmisexponentialinthe number of iterations. This section presents an alternative wayto get lexicographic policies. Rather than limiting the size of thematrices oftrajectories bylimitingthe numberof iterations, wepropose to “forget”the less significantpart ofthe matricesofutilityandtodecideonlybasedonthemostsignificant(l,c)sub-matrices–we“bound”theutilitymatrices.We proposeinthepresentsectiontwoalgorithmsbasedonthisidea,namelyavalueiterationandapolicyiterationalgorithms.

4.1. Boundedlexicographiccomparisonsofutilitymatrices

Recallthat, forany matrixU , [U]l,c denotestherestriction ofU to itsfirstl linesand first c columns.Noticenowthat,

atany stage e andforany states [U(s)]1,1 (i.e. thetopleftvalue inU(s))ispreciselyequalto uopt(s).Wehave seen that

making the choices on thisbasis isnot discriminantenough. On theother hand, taking thewhole matrix into account is discriminant,butexponentiallycostly.Hencetheideaofconsideringmorethan onelineand onecolumn,butlessthan the whole matrix–namelythefirstl linesand c columnsofUt(s)lmaxlmin;hencethedefinitionofthefollowingpreference:

δ ≥

lmaxlmin,l,c

δ

′iff

lmaxlmin

]

l,c

≥ [δ

lmaxlmin

]

l,c

.

(18)

lmaxlmin,1,1correspondstoºopt and≥lmaxlmin,+∞,+∞correspondsto≥lmaxlmin.

Thefollowingpropositionshowsthatthisapproachissoundand that≻lmaxlmin,l,c refinesuopt:

Proposition5.

Foranyl,l,c suchthatl>l,δ ≻

lmaxlmin,l,cδ′⇒ δ ≻lmaxlmin,l,cδ′.

Foranyl,cδ ≻optδ′⇒ δ ≻lmaxlmin,l,cδ′.

Inotherwords,theorderoverthepoliciesisrefinedforafixedc whenl increases.Ittendsto≻lmaxlmin whenc=2.E+1

and l tendstobE.

Noticethatthecombinatorialexplosionisduetothenumberoflines(thenumberofcolumns isboundedby2·E+1), henceweshallbound thenumberofconsideredlines only.

Up to this point, the comparison by ≥lmaxlmin,l,c is made on the basis of the first l lines and c columns of the full

matrices of trajectories.This doesobviously notreduce their size. Theimportant following Propositionallows usto make thel,c reductionoftheordered matricesateachstep (aftereachcomposition),and notonly attheveryend,thus keeping spaceandtimecomplexitiespolynomial.

Proposition6.LetU beaa×b matrixandN1,. . . ,Nabeaseriesofa matricesofdimensionai×c.Itholdsthat:

[(

U

× (

N1

, . . . ,

Na

))

lmaxlmin

]

l,c

= [(

U

× ([

Nlmaxlmin1

]

l,c

, . . . , [

Nlmaxlmina

]

l,c

))

lmaxlmin)

]

l,c

.

4.2. Boundedutilitylexicographicvalueiteration

It is now easy to design a generalization of the possibilistic algorithm of value iteration (Algorithm 1) by keeping a submatrix ofeachcurrentvaluematrix–namelythefirstl linesandc columns.WecallthisalgorithmBoundedUtilityValue Iteration (BU -V I) (seeAlgorithm4).

(12)

Algorithm 4: BoundedUtilityLmax(lmin)ValueIteration(BU-VI).

Data: ApossibilisticMDP,bounds(l,c);δ,thepolicybuiltbythealgorithm,isaglobalvariable Result: Apolicyδoptimalforºlmaxlmin,l,c

1 begin

2 foreach sS do U(s)← ((µ(s))); 3 foreach sS,aA do

4 T Us,aTs,a× ((µ(s′)),s′∈succ(s,a));

5 repeat

6 until U(s)==Uold(s)foreachs;

7 foreach sS do

8 Uold(s)←U(s); 9 Q← ((0));

10 foreach aA do

11 Future← (Uold(s),ssucc(s,a));// Gather the matrices provided by the successors of s;

12 Q(s,a)← [¡T Us,a×Future¢ lmaxlmin ]l,c; 13 if Q lmaxlminQ(s,a)then 14 QQ(s,a); 15 δ(s)←a 16 U(s)←Q(s,δ(s)) 17 δ(s)←argmaxaQ(s,a); 18 U(s) ←maxaQ(s,a) 19 returnδ;

WhenthehorizonoftheMDPisfinitethisalgorithmprovidesinpolynomialtimeapolicythatisalwaysatleast asgood astheoneprovidedbyuopt (according tolmax(lmin))and tendstolexicographic optimalitywhen c=2·E+1 and l tends

tobE.

Let us now study the time complexity. The number of iterations is bounded by the size of the set of possible ma-trices of trajectories which is in O(|S|· |AE). One iteration of the algorithm requires composition, ordering and com-paring operations on b matrices of size (l,c). Since the composition and comparison of matrices are linear operations, the complexity of one iteration in worst case is in b· (l·clog(l·c). Therefore, the complexity of the algorithm is in

O(|S|· |AE·b· (l·clog(l·c)).

When the horizon of the MDP is not finite, equations (16) and (17) are not enough to rank-order the policies. The length of the trajectoriesmay be infinite, as well as their number. This problem is well known in classical probabilistic MDPs whereadiscountfactorisusedto attenuatetheinfluenceoflaterutilitydegrees–thus allowingtheconvergence ofthealgorithm[21].Onthecontrary,classical5MDPsdonotneedanydiscountfactorandValueIteration,basedonthe evaluation forl=c=1,converges forinfinite horizon case [22]. Ina sense, thislimitationtol=c=1 plays therole ofa discount factor–but avery drastic one. Extendingthecomparison byusing≥lmaxlmin,l,c with larger(l,c) as shownbelow

allowstousealessdrastic discount.

Inotherterms,≥lmaxlmin,l,c canbeusedintheinfinitecase,asshownbythefollowingproposition.

Proposition7(Boundedutilitylmax(lmin)-policyevaluationconverges).LetUt(s)bethematrixissuedfroms atinstantt whena strategyδisexecuted.Itholdsthat:

l

,

c

, ∃

t

,

such that

t

t

, (

Ut

)

llmaxlmin,c

(

s

) = (

Ut

)

lmaxlminl,c

(

s

) ∀

s

.

Hencethereexistsastaget,wherethevalueofapolicybecomesstableifcomputedwiththeboundedutilitylmax(lmin) evaluation algorithm.This criterionisthus soundlydefined and can beusedin theinfinite horizon case(and ofcoursein thefinitehorizon case).

ThenumberofiterationsofAlgorithm4isnotexplicitlyboundedbut theconvergence ofthealgorithmisguaranteed– thisisadirectconsequenceofProposition7.

Corollary1(Boundedutilitylmax(lmin)-valueiterationconverges).l,c,t suchthat,tt,(Ut)lmaxlmin

l,c (s) = (U t

)lmaxlminl,c (s) ∀s.

Theoverallcomplexityofboundedutilitylmax(lmin)-valueiteration (Algorithm 4)isboundedby O(|S|· |A|· |Lb· (l·c

(13)

4.3. Boundedutilitylexicographic-policyiteration

In Ref. [17], Howard shows that a policy often becomes optimal long before the convergence of the value estimates. ThatiswhyPuterman[21] hasproposedapolicyiterationalgorithm.Thisalgorithmhasbeenadapted topossibilisticMDPs by [22].

Likewise, weproposea(boundedutility) lexicographicpolicyiteration algorithm(Algorithm5),denotedhere BU -P I that

alternatesimprovementandevaluationphases,asany policyiterationalgorithm.

Algorithm 5: Lmax(lmin)-BoundedUtilityPolicyIteration.

Data: ApossibilisticMDP,bounds(l,c)

Result: Apolicyδ∗optimalwhenl,c grows

1 begin

2 // Arbitrary initialization of δ on S

3 foreach sS doδ(s)←chooseanyasAs;

4 repeat 5 // Evaluation of δ 6 foreach sS do U(s)←µ(s); 7 repeat 8 foreach sS do 9 Uold(s)U(s);

10 // Gather the matrices of the successors of s given δ 11 Future← (U(s),ssucc(s,δ(s)));

U(s)←h¡T Us,δ(s)×Future¢lmaxlmin

i

l,c;

12 until U(s)==Uold(s) foreachs;

13 δold← δ;

14 // Improvement of δ

15 foreach sS do

16 // Compute the utility of the strategy playing a (for each a), given what was chosen for the other states

17 foreach aA do

18 Future← (U(s),ssucc(s,δold(s)));

Q(s,a)←£¡T Us,a×Future¢¤lmaxlminl,c

19 // Update the choice of an action for S

20 δ(s)←arg maxlmax(lmin)a∈A Q(s,a)

21 untilδ == δold; 22 returnδ;

In line 3 of Algorithm 5, an arbitrary initial policy is chosen. The algorithm then proceeds by evaluating the current policy,through successiveupdatesofthevaluefunction(lines8to11);theconvergenceofthisevaluationiseasilyderived from that of the boundedutilitylmax(Lmin)-valueiteration algorithm. Then the algorithm enters the improvement phase: Lines 17–18 compute Q(s,a), the (bounded lexicographic) utility of playing action a in state s and then applying policy δold insubsequent states(thepolicy computedduring thelast iteration); asusual in Policy Iterationstyle algorithms, the updated policy (δ) is thenobtained by greedilyimproving the current action, which is donein line 20. Sincethe actions consideredatline20doincludetheoneprescribedbyδold,eithernothingischanged, andthealgorithmstops, orthenew policy,δ,isbetterthan thepreviousoneδold.

Proposition8.Boundedutilitylmax(lmin)-policyiterationconvergestoanoptimalpolicyforºlmaxlmin,l,cinfinitetime.

Policyiteration (Algorithm 5) converges and is guaranteed to find a policy optimal for the (l,c) lexicographic criterion in finitetimeand usually inafew iterations. As forthealgorithmic complexity oftheclassical, stochastic, policy iteration algorithm (whichis still not well understood[16]), atight bound worst-case complexity oflexicographicpolicyiteration is

hardtoobtain.Therefore,weprovideanupper-boundofthiscomplexity.

The policyiteration algorithm never visits a policy twice:in the worst case, the number oftrial iterations before con-vergence is exponential but it is dominated by the number of distinct policies. So, the complexity of this algorithm is dominated by (|A||S|). Besides, each iteration has a cost, the evaluation phase relying on a bounded utility value itera-tion algorithm that costs O(|S|· |A|· |Lb· (l·cb·log(l·c)) when many actions are possible ata given step, and cost

O(|S|· |Lb· (l·cb·log(l·c))herebecauseoneactionisselected(by thecurrent policy)foreachstate.Thus,theoverall complexity ofthealgorithmisin O(|A||S|· |S|· |L| ·b· (l·cb·log(l·c)).

(14)

4.4. Backtolexicographic-valueiteration:fromfinitetoinfinitehorizon5-MDPs

The bounded iterations algorithm defined insection 3(Algorithm 3, (B I-V I)) can beused for both finitehorizon and infinite horizonMDPs,becauseitfixesanumberofiterations E;if E islow,thepolicy reachedinnotnecessarilyoptimal– thealgorithmisanapproximationalgorithm.

Now, exploiting the above propositions, we are able to show that the bounded iterations Lmax(lmin) value iteration algorithm(Algorithm3)convergeswhen E tendstoinfinity. Todo so,wefirstprovethefollowingproposition:

Proposition9.Letanarbitrarystationary5MDPbegiven.Then,thereexisttwopositivenaturalnumbers(l,c),suchthatforany

pair(δ,δ′)ofarbitrarypoliciesandanystatesS,andforanypair(l,c)suchthatllandcc,

δ(

s

) ≻

lmaxlmin,l,c

δ

(

s

) ⇔ δ(

s

) ≻

lmaxlmin,l,c

δ

(

s

)

Now, this proposition can be used to prove the convergence of the boundediterationsLmax(lmin)-valueiteration

algo-rithm. For this,let usdefine ≻lmaxlmin=de flmaxlmin,l,c∗, theunique preference relation between policiesthat results from

Proposition9.

Proposition10.IfweletδEbethepolicyreturnedbyAlgorithm3foranyfixedE,wecanshowthatthesequenceE)convergesand thatthereexistsafiniteE,suchthat:

lim

E→∞

δ

E

= δ

E

.

Furthermore,δEisoptimalwithrespecttolmaxlmin.

Thesequenceofpoliciesobtainedfor(B I-V I)(Algorithm3)whenE tendstoinfinityconverges.Furthermore,thelimitis attainedforafinite(butunknown inadvance) E.Alternately,itisalsoattainedforthe(BU -V I) and(BU -P I) algorithms,

withfinitebutunknown(l,c).

Now, let us summarize the theoreticalresults that we have obtained so far. We have shown that possibilistic utilities (optimistic and pessimistic) are special cases of bounded lexicographic utilities, which can be represented by matrices. Possibilistic utilitiesareobtainedwhenl=c=1.

The possibilisticvalue iterationand policy iterationalgorithmscanbeextended tocompute policieswhich areoptimal accordingto≻lmaxlmin,l,c.

Finally,ifinfinitehorizonlexicographicoptimalpoliciesaredefinedasthelimitingpoliciesobtainedfromanon-bounded lexicographicvalue-iterationalgorithm,wehaveshownthatsuchpoliciescanbecomputedbyapplyingourboundedutility lmax(lmin) value iterationalgorithm and that only a finite number of iterations (even though not known in advance) is required.

5. Experiments

In orderto evaluate thepreviousalgorithms, wepropose, in thefollowing, twoexperimental analyses: in thefirst one we will compare the bounded iterations algorithm of value iteration (Algorithm 3) with the bounded utility one and in the second we propose to compare the bounded utilitylexicographic policy iteration algorithm with the bounded utility lexicographicvalueiterationone.ThealgorithmshavebeenimplementedinJavaandtheexperimentshavebeenperformed on anIntel Corei5processorcomputer(1.70 GHz)with8GBDDR3LofRAM.

5.1. Boundedutilityvsboundediterationsvalueiteration

Experimentalprotocol. We nowcompare the performanceofboundedutilitylexicographicvalueiteration (BU -V I) asan

approximation of lexicographicvalueiteration (B I-V I) for finite horizon problems, in the Lmax(lmin) variant. Because the horizon is finite, the number of steps of (B I-V I) can be set equal to thehorizon and the algorithm provides a solution optimal accordingto Lmax(lmin). (BU -V I) on the other side limits thesize on thematrices, and can lead tosub-optimal solutions.

We evaluate the performance of the algorithms by carrying out simulations on randomly generated finite horizon 5MDPs with 25 states – we generate five series of problems, letting E varying form 5 to 25. The number of actions in eachstateisequalto 4.The outputofeachaction isadistribution ontwo statesrandomly sampled(i.e.thebranching factor is equal to 2). The utility values are uniformly randomly sampled in the set L= {0.1,0.3,0.5,0.7,1}. Conditional possibilities relative to decisions should be normalized. To this end, one choice is fixed to possibility degree 1 and the possibility degreeof theother one isuniformlysampled in L. For eachexperience, 1005MDPs aregenerated. The two algorithms are compared w.r.t. two measures: (i) CPUtime and (ii) Pairwise success rate (Success) i.e. the percentage of optimalsolutionsprovidedbyBU -V I withfixed(l,c)w.r.t.thelmax(lmin)criterioninitsfullgenerality.ThehigherSuccess,

(15)

Fig. 2. Bounded utility lexicographic value iteration vs lexicographic value iteration.

Table 1

AverageCPUtime(inseconds) andaveragenumberofiterations. Bounded utility policy iteration

(l,c) (2,2) (4,4) (6,6) (10,10)

CPU time (s) 0.029 0.042 0.064 0.091 Average number of iterations 3.2 4.33 5.6 9.7

Bounded utility value iteration

(l,c) (2,2) (4,4) (6,6) (10,10)

CPU time (s) 0.03 0.052 0.082 0.1

Average number of iterations 6.75 9.25 16.11 20.2

themoreimportanttheeffectivenessofcuttingmatriceswithBU -V I;thelowerthisrate,themoreimportantthedrowning effect.

Results. Fig.2(a)presentstheaverageexecutionCPUtimeforthetwoalgorithms.Obviously,forbothB I-V I and BU -V I,

the execution time increases with the horizon.Also, we observe that theCPU time of BU -V I increasesaccording to the valuesof(l,c) butitremainsaffordable,asthemaximalCPUtimeislowerthan 1 s forMDPswith25statesand4actions when (l,c)= (40,40) and E=25. Unsurprisingly,wecancheck thatthe BU -V I (regardlessofthevaluesof(l,c))isfaster than B I-V I especially when the horizon increases: the manipulation ofl,c-matrices is obviouslyless expensive than the one offullmatrices.Thesavingincreases withthehorizon.

Aswiththesuccessrate,theresultsaredescribedinFig.2(b).Itappearsthat BU -V I providesaverygoodapproximation especially whenincreasing(l,c).Itprovidesthesameoptimalsolutionasthe B I-V I inabout 90%ofcases,withan(l,c)= (200,200).Moreover, even whenthesuccessrateof BU -V I decreases(when E increases), thequalityofapproximationis still good: neverless than 70% ofoptimalactionsreturned, with E=25.These experimentsconclude in favorof bounded valueiteration:thequalityofitsapproximatedsolutionsarecomparablewiththoseoftheunboundedversionforhigh(l,c) and increaseswhen(l,c) increase,whileitismuchfaster.

5.2. Boundedutilitylexicographicpolicyiterationvsboundedutilitylexicographicvalueiteration

Experimentalprotocol. In what follows we evaluate the performances of boundedutilitylexicographicpolicy iteration

(BU -P I) andboundedlexicographicvalueiteration (BU -V I), inthe lmax(lmin)variant. Weevaluate theperformance ofthe algorithmson randomlygenerated5MDPsasthoseofSection5.1 with|S|=25 and|As|=4,∀s.

Weran thetwo algorithmsfordifferent valuesof (l,c) (100 5MDPs areconsideredin eachsample).For eachofthe two algorithmswemeasure theCPUtimeneededtoconverge.Wealsomeasuretheaveragenumberofvalueiterations for (BU -V I)and theaveragenumberofpolicy iterationsfor(BU -P I).

Results. Table1presentstheaverageexecutionCPUtimeandtheaveragenumberofiterationsforthetwoalgorithms.

Obviously, for both BU -P I and BU -V I, the execution time increases according to the values of (l,c) but it remains affordable, asthe maximal CPUtime is lower than 0.1 s for MDPs with 25 states and 4 actions when (l,c)= (10,10). It appears that BU -P I (regardlessofthevaluesof(l,c))isslightly fasterthan BU -V I.

Consider now the number of iterations. At each iteration, BU -P I considers one policy, explicitly, and updates it at line 20. And so does value iteration: for each state, the current policy is updatedat line 15. Table 1 shows that BU -P I

always considers fewer policies than BU -V I. This experiment provides an empirical evidence in favor of policyiteration

over valueiteration,astheformerconvergestotheapproximatesolutionfaster.However,thisconclusionmayvarywiththe experiments, sobothalgorithmsareworthconsidering whentacklingagivenproblem.

Figure

Fig. 1. The stationary 5MDP of Example 2 .
Fig. 2. Bounded utility lexicographic value iteration vs lexicographic value iteration.

Références

Documents relatifs

De plus, la seule modalité qui, dans les protocoles de recherche, distingue les participants selon le sexe concerne la procréation ; les essais cliniques impli- quent généralement,

In this paper, we present the Collaboration Maturity Model (Col-MM) that was developed in cooperation with a Focus Group consisting of professional collaboration experts. The

6 High Energy Physics Division, Argonne National Laboratory, Argonne IL; United States of America.. 7 Department of Physics, University of Arizona, Tucson AZ; United States

Previous work [2,5] has analyzed the problem of determining the expected utility over sets of auctions, but this work bases the decision on whether or not to participate in an

The outcome of these considerations (mission of the firm, definition of STUs, technology environmental scan, etc.) is to capture change, either change that

/ La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur. For

The pure Fe hydrogenase was recovered in a 0.1 M pH 8 Tris/HCl medium containing three other molecules at low concentration: DTT, dithionite and desthiobiotin, which are

− la quatrième et dernière visite a lieu trois mois après la troisième visite, la description porte sur une période rétrospective de trois mois, cette visite est donc