• Aucun résultat trouvé

Lexicographic refinements in stationary possibilistic Markov Decision Processes

N/A
N/A
Protected

Academic year: 2021

Partager "Lexicographic refinements in stationary possibilistic Markov Decision Processes"

Copied!
23
0
0

Texte intégral

(1)

HAL Id: hal-02124080

https://hal.archives-ouvertes.fr/hal-02124080

Submitted on 9 May 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Lexicographic refinements in stationary possibilistic

Markov Decision Processes

Nahla Ben Amor, Zeineb El Khalfi, Hélène Fargier, Regis Sabbadin

To cite this version:

(2)

Any correspondence concerning this service should be sent

to the repository administrator:

tech-oatao@listes-diff.inp-toulouse.fr

This is a publisher’s version published in:

http://oatao.univ-toulouse.fr/22626

To cite this version:

Ben Amor, Nahla and El Khalfi,

Zeineb and Fargier, Hélène and Sabbadin, Regis Lexicographic

refinements in stationary possibilistic Markov Decision

Processes. (2018) International Journal of Approximate

Reasoning, 103. 343-363. ISSN 0888-613X

Official URL

DOI :

https://doi.org/10.1016/j.ijar.2018.10.011

Open Archive Toulouse Archive Ouverte

(3)

Lexicographic

refinements

in

stationary

possibilistic

Markov

Decision

Processes

Nahla

Ben Amor

a,∗

,

Zeineb

El Khalfi

a,b,∗∗

,

Hélène

Fargier

b,∗

,

Régis

Sabbadin

c,∗

a LARODEC,UniversityofTunis,Tunisia

bIRIT,UPS-CNRS,UniversitédeToulouse3,118routedeNarbonne,F-31062Toulouse,France cMIAT,UR875,UniversitédeToulouse,INRA,F-31320Castanet-Tolosan,France

a b s t r a c t

Keywords:

MarkovDecisionProcess Possibilitytheory Lexicographiccomparisons Possibilisticqualitativeutilities

PossibilisticMarkovDecisionProcessesofferacompactandtractablewaytorepresentand solveproblemsofsequentialdecisionunderqualitativeuncertainty.Eventhoughappealing for itsability to handlequalitative problems,this model suffers fromthe drowningeffect

thatisinherenttopossibilisticdecisiontheory.Thepresent1paperproposestoescapethe drowningeffectbyextendingtostationarypossibilisticMDPsthelexicographicpreference relations definedbyFargier and Sabbadin[13] fornon-sequentialdecision problems.We propose a valueiteration algorithm and apolicy iteration algorithm tocomputepolicies thatareoptimalforthesenewcriteria.Thepracticalfeasibilityofthesealgorithmsisthen experimentedondifferentsamplesofpossibilisticMDPs.

1. Introduction

The classical paradigm for sequential decision making underuncertainty is the expected utility-based MarkovDecision Processes (MDPs) framework[3,21],which assumesthat the uncertaineffects ofactions can berepresentedby probability distributionsandthat utilitiesareadditive.ButtheEUmodeldoesnosuit problemswhereuncertaintyandpreferencesare ordinalinessence.

AlternativestotheEU-basedmodelhavebeenproposedtohandleordinalpreferences/uncertainty.Remainingwithinthe probabilistic, quantitative,framework whileconsideringordinalpreferences hasledto quantile-based approaches[15,18,27,

29,33].Purely ordinalapproaches tosequential decisionunder uncertaintyhave alsobeen considered. In particular, possi-bilistic MDPs[1,6,22,24] formapurelyqualitativedecisionmodel withanordinalevaluationofplausibilityand preference. Inthismodel,uncertaintyabouttheconsequencesofactionsisrepresentedbypossibilitydistributionsandutilitiesarealso ordinal.The decisioncriteriaareeitherthepessimistic qualitativeutilityoritsoptimisticcounterpart[9].Suchdegreescan beeitherelicited fromexperts,or byautomaticlearningapproaches[23].However, itisnowwellknown that possibilistic

ThispaperispartoftheVirtualspecialissueonthe14thEuropeanConferenceonSymbolicandQuantitativeApproachestoReasoningwithUncertainty

(ECSQARU2017),editedbyAlessandroAntonucci,LaurenceCholvyandOdilePapini.

*

Correspondingauthors.

**

Correspondingauthorat:IRIT,UPS-CNRS,118routedeNarbonne,31062Toulouse,France.

E-mailaddresses:nahla.benamor@gmx.com(N. Ben Amor),zeineb.khalfi@gmail.com(Z. El Khalfi),fargier@irit.fr(H. Fargier),regis.sabbadin@inra.fr

(R. Sabbadin).

1 Thispaperisanextendedandrevisedversionoftwoconferencepapers[4,5].Itincludesthefullproofsofthepropositionspresentedinthese

prelimi-narypapers,newalgorithms(basedonpolicyiteration)andnewexperiments.

(4)

decision criteria sufferfrom adrowningeffect [13]: plausible enough bad or good consequences may completelyblur the comparisonbetweenpolicies,thatwouldotherwisebeclearlydifferentiable.

In[13],FargierandSabbadinhaveproposedlexicographicrefinements ofpossibilisticcriteriafortheone-stepdecisioncase, inordertoremedythedrowningeffect.Thisworkhasrecentlybeenextendedfor(finitehorizon)possibilisticdecisiontrees [4].Inthepresentpaper,weproposetostudytheinterestofthelexicographicpreferencerelationstostationarypossibilistic MarkovDecisionProcesses,amodelthat ismorecompactthandecisiontreesandnotlimitedtoafinitehorizon.

The paperisstructuredasfollows: ThenextSection recallsthebackgroundabout possibilistic decisiontheoryand sta-tionarypossibilisticMDPs,includingthedrowningeffectproblem.Section3definesthelexicographiccomparisonofpolicies and presentsavalueiterationalgorithmwhichcomputesanearlyoptimalstrategyinalimited numberofiterations. Then, Section4proposesalexicographicvalueiterationalgorithmandalexicographicpolicyiterationalgorithmusing approxima-tion ofutilityfunctions.Lastly,Section5presentsourexperimentalresults.

2. Backgroundandnotations 2.1. Basicsofpossibilisticdecisiontheory

Most of available decision models refer to probability theory for the representation of uncertainty [20,25]. Despite its success, probability theory is not appropriate when numerical information is not available. When information about un-certainty cannot be quantified in a probabilistic way, possibilistic theory [8,34] is a natural field to consider. The basic component ofthistheoryisthenotionofpossibilitydistribution.Itisarepresentationofastateofknowledgeofanagent about thestateof theworld.Apossibility distribution

π

isamapping fromtheuniverse ofdiscourse S (theset ofallthe possible worlds) to abounded linearly ordered scale L exemplified (without loss ofgenerality) bythe unitinterval [0,1], wedenotethefunctionby:

π

:S→ [0,1].

Forstate sS,

π

(s)=1 means thatrealization s istotallypossibleand

π

(s)=0 meansthat s isanimpossiblestate.It isgenerallyassumedthat thereexistsatleastonestates whichistotallypossible:

π

isthensaidtobenormalized.

Inthepossibilisticframework,extremeformsofknowledgecanbecaptured, namely:

Completeknowledgei.e.∃s s.t.

π

(s)=1 and∀ s6=s,

π

(s)=0.

Totalignorancei.e.∀sS,

π

(s)=1 (allvaluesin S are possible).

From

π

one cancomputethepossibilitymeasure5(A)and thenecessitymeasure N(A) ofanyevent AS:

5(

A

) =

sup s∈A

π

(

s

)

N

(

A

) =

1

− 5( ¯

A

) =

1

sup s∈A/

π

(

s

)

Measure5(A) evaluatestowhichextent A isconsistentwiththeknowledgerepresentedby

π

whileN(A)corresponds totheextenttowhich¬A is impossibleand thusevaluatesatwhichlevel A iscertainlyimpliedbytheknowledge.

In decision theory acts are functions f : S7→X , where X is a finite set of outcomes. In possibilistic decision making, an act f can beviewed asa possibility distribution

π

f over X [9], where

π

f(x)= 5(f−1(x)). In asingle stage decision

making problem, autility function u:X7→U maps outcomesto utilityvalues in a totally ordered scale U= {u1,...,un}.

Thisfunctionmodelstheattractivenessofeachoutcomeforthedecision-maker.

Under theassumptionthat the utilityscaleand thepossibility scalearecommensurate and purelyordinal (i.e.U =L),

Duboisand Prade[9,7] haveproposedpessimisticandoptimisticdecisioncriteria.

First,thepessimisticcriterionwas originallyproposedbyWhalen[30] anditgeneralizestheWaldcriterion[28].Itsuits cautious decisionmakers whoare happywhen bad consequences arehardlyplausible. It summarizesto what extent itis certain(i.e.necessaryaccordingtomeasure N)thattheactreachesagoodutility.Thedefinitionofthepessimisticcriterion isasfollows[10]:

Definition1.Givenapossibilitydistribution

π

overasetofstates S andautilityfunctionu onthesetofconsequences X ,

thepessimisticutilityofanact f isdefinedby:

upes

(

f

) =

min xj∈X max

(

u

(

xj

),

1

π

f

(

xj

)),

=

min si∈S max

(

u

(

f

(

si

)),

1

π

(

si

)).

(1)

Therefore,wecancomparetwoacts f and g onthebasisoftheirpessimisticutilities:

(5)

The secondcriterionistheoptimistic possibilisticcriterionoriginally proposedbyYager [32,31].Thiscriterioncaptures thebehaviorofanadventurous decisionmakerwhoishappy assoonasatleast onegoodconsequence ishighlyplausible. It summarizes towhat extent itispossible that anact reachesagood utility. The definitionofthis criterionisasfollows [10]:

Definition2.Givena possibilitydistribution

π

over aset ofstates S andautilityfunctionu on aset ofconsequences X ,

theoptimisticutilityofanact f isdefined by:

uopt

(

f

) =

max xjX min

(

u

(

xj

),

π

f

(

xj

)),

=

max si∈S min

(

u

(

f

(

si

)),

π

(

si

)).

(2)

Hence,wecancomparetwo acts f and g onthebasisoftheiroptimisticutilities:

f

º

uopt g

uopt

(

f

) ≥

uuopt

(

g

).

Example1.Let S= {s1,s2}and f and g betwoactswhoseutilitiesofconsequencesinthestatess1and s2 arelistedinthe

followingtable,aswellasthedegreesofpossibilityofs1 ands2:

s1 s2

u(f(s)) 0.3 0.5 u(g(s)) 0.4 0.6

π 1 0.2

Comparing f and g withrespecttothepessimisticcriterion,we get:

upes(f)=min(max(0.3,0),max(0.5,0.8))=0.3,

upes(g)=min(max(0.4,0),max(0.6,0.8))=0.4.

Thus, gºupes f .

Letusnowcomparethetwoactswithrespecttotheoptimisticcriterion:

uopt(f)=max(min(0.3,1),min(0.5,0.2))=0.3,

uopt(g)=max(min(0.4,1),min(0.6,0.2))=0.4.

Thus, gºuopt f .

Itisimportanttonotethat whiletransitionprobabilitiescanbeestimatedthroughsimulationsoftheprocess,transition possibilitiesmaynot.Ontheotherhand,expertsmaybeinvolvedfortheelicitationofthepossibilitydegreesandutilitiesof transitions.Inthepossibilisticframework,utilityanduncertaintylevelscanbeelicitedjointly,bycomparisonofpossibilistic lotteries,for example(e.g.byusingcertaintyequivalents,asin [11]).Simulationcanalsobeused jointlywithexpert eval-uationwhentheunderlyingprocess istoocostlytosimulatealargenumberoftimes: simulationmaybeusedtogenerate samples onwhich expertelicitationisapplied. Anotheroptionisto usepossibilisticreinforcementlearningprocedure (for moredetailssee[23]),inparticularmodel-based reinforcementlearningalgorithm.The latteruses auniformsimulationof trajectories(withrandomchoiceofactions)inordertogenerateanapproximationofthepossibilisticdecisionmodel.

2.2. StationaryPossibilisticMarkovDecisionProcesses

AstationaryPossibilisticMarkovDecisionProcess(5MDP)[22] isdefinedby:

• Afiniteset S of states;

• Afiniteset A of actions, As denotestheset ofactionsavailableinstates;

• A possibilistic transition function: for each action aAs and each state sS the possibility distribution

π

(s′|s,a)

evaluatestowhatextent eachsisapossiblesuccessor ofs whenactiona isapplied;

• Autilityfunction

µ

:

µ

(s)istheintermediatesatisfactiondegreeobtainedinstates.

(6)

Fig. 1. The stationary5MDPof Example2.

Example2.Let ussuppose thata“RichandUnknown” personrunsastartupcompany.Initially,s/hemustchoosebetween Saving money (Sav) or Advertising ( Adv) and may then get Rich (R) or Poor (P ) and Famous (F ) or Unknown (U ). In the otherstates, Sav is theonlypossible action.Fig. 1showsthestationary5MDP thatcaptures thisproblem,formally described asfollows:

S= {RU,R F,P U},

ARU= {Adv,Sav}, AR F =AP U= {Sav},

π

(P U|RU,Sav)=0.2,

π

(RU|RU,Sav)=

π

(R F|RU,Adv)=

π

(R F|R F,Sav)=

π

(RU|R F,Sav)=1,

µ

(RU)=0.5,

µ

(R F)=0.7,

µ

(P U)=0.3.

SolvingastationaryMDPconsistsinfindinga(stationary)policy,i.e.afunctionδ:SA whichisoptimalwithrespect to adecisioncriterion.Inthepossibilisticcase,asintheprobabilisticcase,thevalueofapolicydependsontheutilityand on the likelihoodofits trajectories.Formally, let1bethe setof allpoliciesthat can bebuiltfor the5MDP (theset of all thefunctions thatassociate an elementof As to eachs).Each δ ∈1defines a listof scenarioscalledtrajectories. Each

trajectory

τ

isasequenceofstatesandactionsi.e.

τ

= (s0,a0,s1,. . . ,st−1, at−1,st).

Tosimplifynotations,wewillassociatethevectorvτ = (

µ

0,

π

1,

µ

1,

π

2,. . . ,

πt

−1,

µt

)toeachtrajectory

τ

,where

πi

+1=

π

(si+1|si,ai) isthepossibility degreetoreach thestate si+1 att=i+1,applying theaction ai att=i and

µi

=

µ

(si) is

theutilityobtainedinthei-th statesi ofthetrajectory.

Thepossibilityandtheutilityoftrajectory

τ

given thatδ isappliedfroms0 aredefinedby:

π

(

τ

|

s0

, δ) =

min

i=1...t

π

(

si

|

si−1

, δ(

si−1

))

and

µ

(

τ

) =

i=min0...t

µ

(

si

).

(3)

Twocriteria,anoptimisticandapessimisticone,canthenbeusedtoevaluate δ [24,9]:

uopt

(δ,

s0

) =

max

τ min

{

π

(

τ

|

s0

, δ),

µ

(

τ

)},

(4)

upes

(δ,

s0

) =

min

τ max

{

1

π

(

τ

|

s0

, δ),

µ

(

τ

)}.

(5)

The policies optimizing these criteria can be computed by applying, for every state s and time step i=0,...,t, the followingcounterpartsoftheBellmanupdates[22]:

uopt

(

s

,

i

) ←

max a∈As min

{

µ

(

s

),

max s∈S min

(

π

(

s

|

s

,

a

),

uopt

(

s

,

i

+

1

))},

(6) upes

(

s

,

i

) ←

max a∈As min

{

µ

(

s

),

min s∈Smax

(

1

π

(

s

|

s

,

a

),

upes

(

s

,

i

+

1

))},

(7)

δ

opt

(

s

,

i

) ←

arg max a∈As

min

{

µ

(

s

),

max

s∈S min

(

π

(

s

|

s

,

a

),

uopt

(

s

,

i

+

1

))},

(8)

δ

pes

(

s

,

i

) ←

arg max a∈As

min

{

µ

(

s

),

min

s∈Smax

(

1

π

(

s

|

s

,

a

),

upes

(

s

,

i

+

1

))}.

(9)

There weset,arbitrarily, uopt(s′,t+1))=1 and upes(s′,t+1))=1.

Ithasallowedthedefinitionofa(possibilistic)valueiteration algorithm(seeAlgorithm1fortheoptimisticvariantofthis algorithm)whichconvergestoanoptimalpolicy inpolytime[22].

This algorithm proceeds byiterated modifications ofa possibilisticvalue function Q(s,a) which evaluates the“utility” (pessimisticoroptimistic)ofperforminga in s.

(7)

Algorithm 1: V I-M D P :Possibilistic(Optimistic)Valueiteration.

Data: Astationary5MDP

Result: Apolicyδoptimalforuopt

1 begin 2 foreach sS do uopt(s)←µ(s); 3 repeat 4 foreach sS do 5 uold(s)←uopt(s); 6 foreach aA do

7 Q(s,a)←minnµ(s),maxsSmin{(π(s′|s,a),uopt(s′)}

o ;

8 uopt(s)←maxaQ(s,a);

9 δ(s)←arg maxaQ(s,a);

10 until uopt(s)==uold(s)foreachs;

11 returnδ;

Algorithm 2: P I-M D P :Possibilistic(Optimistic)Policyiteration.

Data: Astationary5MDP

Result: Apolicyδoptimalforuopt

1 begin

2 // Initialization of δ and uopt

3 foreach sS do

4 δ(s)←chooseanyasAs;

5 uopt(s)←µ(s);

6 repeat

7 // Evaluation of δ until stabilization of uopt

8 repeat

9 foreach sS do

10 uold(s)←uopt(s);

11 uopt(s)←min

n

µ(s),maxsSmin{π(s|s,δ(s)).uold(s)}

o

; 12 until uopt==uold;

13 // Improvement of δ

14 foreach sS do

15 δold(s)← δ(s);

16 δ(s)←arg maxaAmin

n

µ(s),maxsSmin{π(s′|s,a).uopt(s′)}

o

; 17 untilδ(s)== δold(s)foreachs;

18 // stabilization of δ

19 returnδ;

2.3. Thedrowningeffectinstationarysequentialdecisionproblems

Unfortunately, possibilisticutilities sufferfrom animportantdrawback calledthe drowningeffect:plausible enough bad orgood consequencesmaycompletelyblurthecomparisonbetweenactsthatwouldotherwisebeclearlydifferentiated;as aconsequence,anoptimalpolicy δisnotnecessarilyParetoefficient.RecallthatapolicyδisParetoefficientwhennoother policy δ′ dominatesit(i.e. thereisno policy δ′ suchthat (i)∀ sS,upes(δ′,supes(δ,s) and (ii)∃sS s.t. upes(δ′,s)≻

upes(δ,s)).Thefollowingexample showsthat itcansimultaneouslyhappenthatδ′ dominates δ andupes(δ)=upes(δ′).

Example3.The5MDP ofExample2admitstwopoliciesδ and δ′:

• δ(RU)=Sav;δ(P U)=Sav;δ(R F)=Sav;

• δ′(RU)=Adv;δ′(P U)=Sav;δ′(R F)=Sav.

Considerafixedhorizon H=2:

• δ has3 trajectories:

(8)

• δ′has 2 trajectories:

τ

4= (RU,R F,R F)with vτ4= (0.5,1,0.7,1,0.7);

τ

5= (RU,R F,RU) with vτ5= (0.5,1,0.7,1,0.5).

Thus uopt(δ,RU)=uopt(δ′,RU)=0.5. However δ′ seems better than δ since it provides utility 0.5 for sure while δ

providesabadutility(0.3)insomenon-impossibletrajectories(

τ

1 and

τ

2).

τ

3 whichisgoodandtotally possible“drowns”

τ

1 and

τ

2:δ isconsideredasgoodasδ′.

3. Boundediterationssolutionstolexicographicfinitehorizon5MDPs

Possibilistic decisioncriteria,especially pessimisticand optimisticutilities,aresimple and realisticasillustratedin Sec-tion2,buttheyhaveanimportantshortcoming:theprincipleofParetoefficiencyisviolatedsincethesecriteriasufferfrom the drowning effect.Indeed, one decisionmay dominate another one while notbeing strictlypreferred. Inorder to over-come the drowning effect,some refinements ofpossibilistic utilities have been proposed in the non-sequentialcase such aslexicographicrefinements,proposed by[12,13].Theserefinementsarefullyinaccordancewithordinalutilitytheoryand satisfytheprincipleofParetodominance,thatiswhywehavechosentofocusonthem.

The present section defines an extension of lexicographic refinements to finite horizon possibilistic Markov decision processesand proposesavalueiterationalgorithmthatlooksforpoliciesoptimalwithrespecttothesecriteria.

3.1. Lexi-refinementsofordinalaggregations

In ordinal(i.e.min-based and max-based) aggregationasolution tothe drowningeffectbased on leximinand leximax comparisons has been proposed by[19].It has thenbeen extended to non-sequential decisionmaking under uncertainty [13] and,inthesequentialcase,todecisiontrees[4].Letusfirstrecallthebasicdefinitionofthesetwopreferencerelations. For anytwovectorst andtoflengthm builtonthescale L:

t

º

lmint′iff

i

,

tσ(i)

=

t′σ(i)or

i

, ∀

i

<

i

,

tσ(i)

=

t′σ(i)and tσ(i∗)

>

tσ(i)

,

(10)

t

º

lmaxt′iff

i

,

tµ(i)

=

tµ′(i)or

i

, ∀

i

<

i

,

tµ(i)

=

tµ(i)and tµ(i)

>

tµ(i)

,

(11)

where, forany vectorv (here,v=t or v=t), v

µ(i) (resp.vσ(i))isthei-th best(resp.worst)elementof v.

[13,4] haveextendedtheseprocedurestothecomparisonofmatricesbuilton L,definingpreferencerelationsºlmin(lmax)

and ºlmax(lmin):

A

º

lmin(lmax)B

⇔ ∀

j

,

a(lmax,j)

=

b(lmax,j)

or

i s.t.

j

>

i

,

a(lmax,j)

lminb(lmax,j)and a(lmax,i)

lminb(lmax,i)

,

(12)

A

º

lmax(lmin)B

⇔ ∀

j

,

a(lmin,j)

lmaxb(lmin,j)

or

i s.t.

j

<

i

,

a(lmin,j)

lmaxb(lmin,j)and a(lmin,i)

lmaxb(lmin,i)

,

(13)

where a(☎,i) (resp.b(☎,i))isthei-th largestsub-vectorof A (resp. B)accordingto ☎∈ {lmax,lmin}.

Like in(finite-horizon) possibilisticdecisiontrees[4] ourideaistoidentifythestrategiesoftheMDPwiththematrices of their trajectories,and to compare such matrices witha ºlmax(lmin) (resp. ºlmin(lmax)) procedure for theoptimistic(resp.

pessimistic)case.

3.2. Lexicographiccomparisonsofpolicies

Letusfirstdefinelexicographiccomparisonsofpoliciesoveragivenhorizon E.

Atrajectoryoverhorizon E beingasequenceofstatesandactions,any stationarypolicy canbeidentifiedwithamatrix where eachline correspondsto a distinct trajectory oflength E. In the optimisticcase each linecorresponds to a vector

vτ = (

µ

0,

π

1,

µ

1,

π

2,. . . ,

πE

−1,

µE

)andinthepessimisticcaseto wτ = (

µ

0,1−

π

1,

µ

1,1−

π

2,. . . ,1−

πE

−1,

µE

).

Thisallowsustodefinethecomparisonoftrajectoriesusingleximaxandleximinasfollows:

τ

º

lmin

τ

′iff

(

µ

0

,

π

1

, . . . ,

π

E

,

µ

E

) º

lmin

(

µ

′0

,

π

2′

, . . . ,

π

E

,

µ

E

),

(14)

τ

º

lmax

τ

′iff

(

µ

0

,

1

π

1

, . . . ,

1

π

E

,

µ

E

) º

lmax

(

µ

′0

,

1

π

1′

, . . .

1

π

E

,

µ

E

).

(15)

(9)

Using(14) and(15),wecancompare policiesby:

δ º

lmax(lmin)

δ

′iff

i

,

τ

µ(i)

lmin

τ

µ′(i)

or

i

, ∀

i

<

i

,

τ

µ(i)

lmin

τ

µ′(i) and

τ

µ(i∗)

lmin

τ

µ(i)

,

(16)

δ º

lmin(lmax)

δ

′iff

i

,

τ

σ(i)

lmax

τ

σ′(i)

or

i

, ∀

i

<

i

,

τ

σ(i)

lmax

τ

σ′(i) and

τ

σ(i∗)

lmax

τ

σ(i)

,

(17)

where

τ

µ(i) (resp.

τ

µ(i))isthei-th best trajectoryof δ (resp.δ′)accordingtoºlmin and

τ

σ(i) (resp.

τ

σ′(i))isthe i-th worst

trajectory ofδ (resp.δ′)accordingtoºlmax.

Hence,theutilitydegreeofapolicyδ canberepresentedbyamatrixUδwithn lines,s.t.n isthenumberoftrajectories, andm=2E+1 columns.Indeed,comparingtwopoliciesw.r.t.ºlmax(lmin)(resp.ºlmin(lmax))consistsinfirstorderingthetwo

correspondingmatricesoftrajectoriesasfollows:

• ordertheelementsofeachtrajectory(i.e.theelementsofeachline) inincreasingorderw.r.t. ºlmin (resp.indecreasing

orderw.r.t. ºlmax),

• then order all the trajectories. The lines of each policy are arranged lexicographically top-down in decreasing order (resp.top-downinincreasingorder).

Then,itisenoughtolexicographicallycomparethetwonewmatricesoftrajectories,denotedUδ (resp.Uδ′),elementby

element.Thefirstpairofdifferentelementsdeterminesthebestmatrix/policy.Notethat theorderedmatrix Uδ (resp.Uδ′)

canbeseenastheutilityofapplyingpolicy δ (resp.δ′)overalengthE horizon.

Example4. Let us consider the Counter-Example 3 with the same 5MDP of Example 2. We consider, once again, the policiesδ and δ′ definedby:

• δ(RU)=Sav;δ(P U)=Sav;δ(R F)=Sav;

• δ′(RU)=Adv;δ′(P U)=Sav;δ′(R F)=Sav.

For horizon H=2:

• δ has3 trajectories:

τ

1= (RU,P U,P U)with vτ1= (0.5,0.2,0.3,1,0.3);

τ

2= (RU,RU,P U) with vτ2= (0.5,1,0.5,0.2,0.3);

τ

3= (RU,RU,RU) withvτ3= (0.5,1,0.5,1,0.5).

Thematrixoftrajectoriesis:Uδ=

  0.5 0.2 0.3 1 0.3 0.5 1 0.5 0.2 0.3 0.5 1 0.5 1 0.5  ∼   0.2 0.3 0.3 0.5 1 0.2 0.3 0.5 0.5 1 0.5 0.5 0.5 1 1  .

So,theorderedmatrix oftrajectoriesis:Uδ=   0.5 0.5 0.5 1 1 0.2 0.3 0.3 0.5 1 0.2 0.3 0.5 0.5 1  . • δ′has 2 trajectories:

τ

4= (RU,R F,R F)with vτ4= (0.5,1,0.7,1,0.7);

τ

5= (RU,R F,RU) with vτ5= (0.5,1,0.7,1,0.5).

Theorderedmatrixoftrajectoriesis: Uδ′=

·

0.5 0.7 0.7 1 1 0.5 0.5 0.7 1 1

¸

.

GiventhetwoorderedmatricesUδ andUδ′,δand δ′ areindifferentforoptimisticutilitysincethetwofirst(i.e.top-left)

elements ofthematrices areequali.e. uopt(δ)=uopt(δ′)=0.5.For lmax(lmin) wecompare successivelythe nextelements

(left toright thentop tobottom)until wefindapairofdifferent values.In particular,wehave thesecondelement ofthe first (i.e.thebest)trajectory ofδ′ isstrictlygreaterthan thesecondelement ofthefirsttrajectory ofδ (0.7>0.5).So,the firsttrajectoryofδ′ isstrictlypreferredtothefirsttrajectoryofδ accordingtoºlmin.Wededucethat δ′ isstrictlypreferred

to δ:

δ

lmax(lmin)

δ

since

(

0

.

5

,

0

.

7

,

0

.

7

,

1

,

1

) ≻

lmin

(

0

.

5

,

0

.

5

,

0

.

5

,

1

,

1

).

(10)

Proposition1.

Ifuopt(δ)>uopt(δ′)thenδ ≻lmax(lmin)δ′.

Ifupes(δ)>upes(δ′)thenδ ≻lmin(lmax)δ′.

Proposition2.ºlmax(lmin)andºlmin(lmax)satisfytheprincipleofParetoefficiency.

Now,inordertodesigndynamicprogrammingalgorithms,i.e.toextendthevalueiterationalgorithmtolexicomparison, we showthat thecomparison ofpoliciesisa preorderandsatisfies theprincipleofstrict monotonicitydefined asfollows forany optimizationcriterion O by:∀δ,δ′,δ′′∈ 1,

δ º

O

δ

⇐⇒ δ + δ

′′

º

O

δ

+ δ

′′

,

where δ (resp. δ′) and δ′′ denote two disjointsets oftrajectories and δ + δ′′ (resp. δ′+ δ′′) is the set of trajectories that gatherstheonesofδ (resp.δ′)and theonesofδ′′.

Then,addingorremovingidenticaltrajectoriestotwosetsoftrajectoriesdoesnotchangetheircomparisonbyºlmax(lmin)

(resp.ºlmin(lmax)).

Proposition3.Relationsºlmin(lmax)andºlmax(lmin)arecomplete,transitiveandsatisfytheprincipleofstrictmonotonicity.

Note that uopt and upes satisfyonly aweakform ofmonotonicitysincetheaddition ortheremovaloftrajectoriesmay

transformastrictpreferenceintoanindifferenceif uopt orupes isused.

Let usdefine thecomplementary MDP (S,A,

π

,

µ

¯) of agiven 5MDP (S,A,

π

,

µ

) where

µ

¯(s)=1−

µ

(s),∀sS.The complementary MDPsimplygivescomplementary utilities.Fromthedefinitionsofºlmax and ºlmin,wecancheckthat:

Proposition4.

τ

ºlmax

τ

′⇔ ¯

τ

′ºlmin

τ

¯andδ ºlmin(lmax)δ′⇔ ¯δ′ºlmax(lmin)δ¯.

There

τ

¯ and ¯δ areobtainedbyreplacing

µ

with

µ

¯ inthetrajectory/5MDP.

Therefore, allthe results whichwe will provefor ºlmax(lmin) also hold for ºlmin(lmax), if wetake care toapply them to

complementary policies. Since considering ºlmax(lmin) involves less cumbersome expressions (no 1− ·), we will give the

results forthiscriterion.Aconsequence ofProposition4isthat theresultsholdforthepessimisticcriterionaswell. This monotonicity of the lmin(lmax) and lmax(lmin) criteria is sufficient to allow us to use a dynamic programming algorithm such asvalue iterationorpolicy iteration[2]. Thealgorithms wepropose in thepresent paper performexplicit Bellmanupdatesinthelexicographicframework(lines12–13ofAlgorithms3and4,line11ofAlgorithm5);thecorrectness oftheiruseis provedinPropositions6to10.

3.3. Basicoperationsonmatricesoftrajectories

Before going further, in order to give more explicit and compact descriptions of the algorithms and the proofs, let us introduce the following notations and some basic operations on matrices (typically, on the matrix U(s) representing trajectoriesissued from state s).Abusing notationsslightly,weidentifytrajectories

τ

(resp.policies)with their vτ vectors

(resp.matricesof vτ vectors)whenthereisnoambiguity.Forany matrixU ,[U]l,c denotestherestrictionofU toitsfirstl

lines andfirstc columnsand Ui,j denotestheelementatlinei andcolumn j.

Composition:Let U beaa×b matrixand N1,. . . ,Na beaseriesofa matricesofdimensionni×c (theyallsharethe

samenumberofcolumns).ThecompositionofU with(N1,. . . ,Na)denoted U× (N1,. . . ,Na)isamatrixofdimension

( 6

1≤iani)× (b+c).Forany ia,jnj,the((6i

<ini′)+ j)-th lineofU× (N1,. . . ,Na) istheconcatenation ofthei-th

lineofU andthe j-th lineof Ni.

The composition of U× (N1,. . . ,Na) is done in O(n·m) operations, where n= 6

1≤iani and m=b+c. The matrix

U(s), matrix of trajectories out of state s when making decision a, is typicallythe concatenation of the matrix U = ((

π

(s|s,a),

µ

(s)),ssucc(s,a))withthematricesN

s′=U(s′).Thisprocedureaddstwocolumnstoeachmatrix U(s′),

filledwith

π

(s|s,a) and

µ

(s) the possibility degrees and the utility of reaching s;then the matrices are vertically

concatenated to get the matrix U(s) when making decision a. Then it is possible to lexicographically compare the resultingmatricesinordertogettheoptimalactioninstate s.

Orderingmatrices: LetU bean×m matrix, Ulmaxlmin isthematrixobtained byorderingtheelements ofthelines of

(11)

Comparisonoforderedmatrices:Giventwoorderedmatrices Ulmaxlmin and Vlmaxlmin, wesaythat Ulmaxlmin>Vlmaxlmin

iff∃i,j such that∀i′<i,j′,Ulmaxlmini,j′ =Vlmaxlmini,j′ and∀j′<j, Ulmaxlmini,j′ =Vlmaxlmini,j′ andUlmaxlmini,j >Vlmaxlmini,j .

UlmaxlminVlmaxlmin iff they are identical (comparison complexity: O(n·m)). Once matrices Q(s,a) are ordered, the lexicographiccomparison oftwo decisionsis performedbyscanning theelements oftheirmatrices, lineby linefrom thefirstone.Thefirstpairofdifferentvaluesdeterminesthebestmatrixandthebestcorrespondingactiona isselected (seeExample4).

If thepolicies (orsub-policies) have different numbersof trajectories,thecomparison oftwo matrices isbased on the numberoftrajectoriesoftheshortestmatrix.Twocasesmayarise:

• If wehavea strictpreference betweenthetwo matricesbefore reachingthelast lineoftheshortest matrix,we geta strictpreferencebetweenthepolicies(orbetweenthesub-policies).

• If we have an indifference up to the last line, the shortest matrix is the best for the lexicographic criterion, since it expresseslessuncertaintyinthecorrespondingpolicy (orinthesub-policy).

3.4. Boundediterationslexicographicvalueiteration

In this section, we propose aniterative value iteration-typealgorithm (Algorithm 3). This algorithm follows the same principle as in the possibilistic case (Eqs. (6)–(9)). Repeated Bellman updates are performed successively E times. This algorithm will provide anapproximation of alexi optimalstrategy in the infinite horizon case (by considering thepolicy returned forthefirst timestep).This algorithmissub-optimal for anyfixed E,but wewill seeinSection 4that letting E

grow,anoptimallexicographicpolicywillbeobtainedforfinite E.

Weproposetwoversionsofthevalueiterationalgorithm:Thefirstonecomputestheoptimalpolicywithrespecttothe lmax(lmin)criterionand thesecondoneprovidestheoptimalpolicy withrespecttothelmin(lmax)criterion.Inthispaper, wepresentand detailonlythefirstalgorithm,sincethesecondisverysimilar.2

Algorithm 3: Boundediterations lmax(lmin)-valueiteration(BI-VI).

Data: ApossibilisticMDPandmaximumnumberofiterationsE

Result: TheδE strategyobtainedafter E iterations

1 begin 2 e←0; 3 foreach sS do U(s)← ((µ(s))); 4 foreach sS,aA do 5 T Us,aTs,a× ((µ(s′)),s′∈succ(s,a)); 6 repeat 7 ee+1; 8 foreach sS do 9 Uold(s)=U(s); 10 Q← ((0)); 11 foreach aA do

12 Future← (Uold(s),ssucc(s,a));// Gather the matrices provided by the successors of s;

13 Q(s,a)←¡T Us,a×Future¢lmaxlmin; 14 if Q lmaxlminQ(s,a)then 15 QQ(s,a); 16 δ(s)←a 17 U(s)←Q(s,δ(s)) 18 until e==E; 19 δ(s)←argmaxaQ(s,a) 20 returnδE= δ;

This algorithm is aniterative procedure that performs aprescribed number of updates, E, ofthe utilityof each state, representedbyafinitematrixoftrajectories,usingtheutilitiesoftheneighboringstates.

Atstage1≤eE, theprocedureupdatestheutilityofeverystatesS asfollows:

• ForeachactionaA,amatrix Q(s,a)isbuiltto evaluatethe“utility”ofperforminga in s atstagee:thisisdoneby combining T Us,a (combinationofthetransitionmatrix Ts,a=

π

(·|s,a) and theutilities

µ

(s′) ofthestatess′ that may

(12)

follows when a isexecuted)with thematrices Uold(s) oftrajectoriesprovidedbythese satthepreviousstage. The

matrix Q(s,a) isthenordered(theoperationismade less complexbythefactthat thematricesUold(s) have already

beenorderedate−1).

• Thelmax(lmin) comparisonisperformedontheflytomemorizethebest Q(s,a).

• Thevalueofstates atstagee,U(s),istheonegivenbytheactiona whichprovidesthebest Q(s,a).δ isupdated,U is

memorized(andUold canbediscarded).

Timeandspacecomplexitiesofthisalgorithmareneverthelessexpensive,sinceiteventuallymemorizesallthe trajecto-ries.At eachstepe itssizemaygrowtobe· (2·e+1),whereb isthemaximalnumberofpossiblesuccessorsofanaction;

theoverallcomplexityofthealgorithmis O(|S|· |AE·bE),whichisaproblem.

Algorithm 3 is provided with a number of iterations, E. Does it converge when E tends to infinity? That is, are the returnedpoliciesidenticalforany E exceedingagiventhreshold?Beforeanswering(positively)thisquestioninSection4.4, wearegoingtodefineboundedutilitymatrices solutionstolexicographicpossibilisticMDPs.Thesesolutionconceptswillbe usefultoanswertheabovequestion.

4. Boundedutilitysolutionstolexicographic5MDPs

Wehave just proposedalexicographic value iterationalgorithmfor thecomputationoflexicographic policiesbased on thewholematricesoftrajectories.Asaconsequence,thespatial/temporalcomplexityofthealgorithmisexponentialinthe number of iterations. This section presents an alternative wayto get lexicographic policies. Rather than limiting the size of thematrices oftrajectories bylimitingthe numberof iterations, wepropose to “forget”the less significantpart ofthe matricesofutilityandtodecideonlybasedonthemostsignificant(l,c)sub-matrices–we“bound”theutilitymatrices.We proposeinthepresentsectiontwoalgorithmsbasedonthisidea,namelyavalueiterationandapolicyiterationalgorithms.

4.1. Boundedlexicographiccomparisonsofutilitymatrices

Recallthat, forany matrixU , [U]l,c denotestherestriction ofU to itsfirstl linesand first c columns.Noticenowthat,

atany stage e andforany states [U(s)]1,1 (i.e. thetopleftvalue inU(s))ispreciselyequalto uopt(s).Wehave seen that

making the choices on thisbasis isnot discriminantenough. On theother hand, taking thewhole matrix into account is discriminant,butexponentiallycostly.Hencetheideaofconsideringmorethan onelineand onecolumn,butlessthan the whole matrix–namelythefirstl linesand c columnsofUt(s)lmaxlmin;hencethedefinitionofthefollowingpreference:

δ ≥

lmaxlmin,l,c

δ

′iff

lmaxlmin

]

l,c

≥ [δ

lmaxlmin

]

l,c

.

(18)

lmaxlmin,1,1correspondstoºopt and≥lmaxlmin,+∞,+∞correspondsto≥lmaxlmin.

Thefollowingpropositionshowsthatthisapproachissoundand that≻lmaxlmin,l,c refinesuopt:

Proposition5.

Foranyl,l,c suchthatl>l,δ ≻

lmaxlmin,l,cδ′⇒ δ ≻lmaxlmin,l,cδ′.

Foranyl,cδ ≻optδ′⇒ δ ≻lmaxlmin,l,cδ′.

Inotherwords,theorderoverthepoliciesisrefinedforafixedc whenl increases.Ittendsto≻lmaxlmin whenc=2.E+1

and l tendstobE.

Noticethatthecombinatorialexplosionisduetothenumberoflines(thenumberofcolumns isboundedby2·E+1), henceweshallbound thenumberofconsideredlines only.

Up to this point, the comparison by ≥lmaxlmin,l,c is made on the basis of the first l lines and c columns of the full

matrices of trajectories.This doesobviously notreduce their size. Theimportant following Propositionallows usto make thel,c reductionoftheordered matricesateachstep (aftereachcomposition),and notonly attheveryend,thus keeping spaceandtimecomplexitiespolynomial.

Proposition6.LetU beaa×b matrixandN1,. . . ,Nabeaseriesofa matricesofdimensionai×c.Itholdsthat:

[(

U

× (

N1

, . . . ,

Na

))

lmaxlmin

]

l,c

= [(

U

× ([

Nlmaxlmin1

]

l,c

, . . . , [

Nlmaxlmina

]

l,c

))

lmaxlmin)

]

l,c

.

4.2. Boundedutilitylexicographicvalueiteration

(13)

Algorithm 4: BoundedUtilityLmax(lmin)ValueIteration(BU-VI).

Data: ApossibilisticMDP,bounds(l,c);δ,thepolicybuiltbythealgorithm,isaglobalvariable

Result: Apolicyδoptimalforºlmaxlmin,l,c

1 begin

2 foreach sS do U(s)← ((µ(s))); 3 foreach sS,aA do

4 T Us,aTs,a× ((µ(s′)),s′∈succ(s,a));

5 repeat

6 until U(s)==Uold(s)foreachs;

7 foreach sS do

8 Uold(s)←U(s); 9 Q← ((0));

10 foreach aA do

11 Future← (Uold(s),ssucc(s,a));// Gather the matrices provided by the successors of s;

12 Q(s,a)← [¡T Us,a×Future¢ lmaxlmin ]l,c; 13 if Q lmaxlminQ(s,a)then 14 QQ(s,a); 15 δ(s)←a 16 U(s)←Q(s,δ(s)) 17 δ(s)←argmaxaQ(s,a); 18 U(s) ←maxaQ(s,a) 19 returnδ;

WhenthehorizonoftheMDPisfinitethisalgorithmprovidesinpolynomialtimeapolicythatisalwaysatleast asgood astheoneprovidedbyuopt (according tolmax(lmin))and tendstolexicographic optimalitywhen c=2·E+1 and l tends

tobE.

Let us now study the time complexity. The number of iterations is bounded by the size of the set of possible ma-trices of trajectories which is in O(|S|· |AE). One iteration of the algorithm requires composition, ordering and com-paring operations on b matrices of size (l,c). Since the composition and comparison of matrices are linear operations, the complexity of one iteration in worst case is in b· (l·clog(l·c). Therefore, the complexity of the algorithm is in

O(|S|· |AE·b· (l·clog(l·c)).

When the horizon of the MDP is not finite, equations (16) and (17) are not enough to rank-order the policies. The length of the trajectoriesmay be infinite, as well as their number. This problem is well known in classical probabilistic

MDPs whereadiscountfactorisusedto attenuatetheinfluenceoflaterutilitydegrees–thus allowingtheconvergence ofthealgorithm[21].Onthecontrary,classical5MDPsdonotneedanydiscountfactorandValueIteration,basedonthe evaluation forl=c=1,converges forinfinite horizon case [22]. Ina sense, thislimitationtol=c=1 plays therole ofa discount factor–but avery drastic one. Extendingthecomparison byusing≥lmaxlmin,l,c with larger(l,c) as shownbelow

allowstousealessdrastic discount.

Inotherterms,≥lmaxlmin,l,c canbeusedintheinfinitecase,asshownbythefollowingproposition.

Proposition7(Boundedutilitylmax(lmin)-policyevaluationconverges).LetUt(s)bethematrixissuedfroms atinstantt whena strategyδisexecuted.Itholdsthat:

l

,

c

, ∃

t

,

such that

t

t

, (

Ut

)

lmaxlminl,c

(

s

) = (

Ut

)

lmaxlminl,c

(

s

) ∀

s

.

Hencethereexistsastaget,wherethevalueofapolicybecomesstableifcomputedwiththeboundedutilitylmax(lmin) evaluation algorithm.This criterionisthus soundlydefined and can beusedin theinfinite horizon case(and ofcoursein thefinitehorizon case).

ThenumberofiterationsofAlgorithm4isnotexplicitlyboundedbut theconvergence ofthealgorithmisguaranteed– thisisadirectconsequenceofProposition7.

Corollary1(Boundedutilitylmax(lmin)-valueiterationconverges).l,c,t suchthat,tt,(Ut)lmaxlmin

l,c (s) = (U t

)lmaxlminl,c (s) ∀s.

Theoverallcomplexityofboundedutilitylmax(lmin)-valueiteration (Algorithm 4)isboundedby O(|S|· |A|· |Lb· (l·c

(14)

4.3. Boundedutilitylexicographic-policyiteration

In Ref. [17], Howard shows that a policy often becomes optimal long before the convergence of the value estimates. ThatiswhyPuterman[21] hasproposedapolicyiterationalgorithm.Thisalgorithmhasbeenadapted topossibilisticMDPs by [22].

Likewise, weproposea(boundedutility) lexicographicpolicyiteration algorithm(Algorithm5),denotedhere BU -P I that

alternatesimprovementandevaluationphases,asany policyiterationalgorithm.

Algorithm 5: Lmax(lmin)-BoundedUtilityPolicyIteration.

Data: ApossibilisticMDP,bounds(l,c)

Result: Apolicyδ∗optimalwhenl,c grows

1 begin

2 // Arbitrary initialization of δ on S

3 foreach sS doδ(s)←chooseanyasAs;

4 repeat 5 // Evaluation of δ 6 foreach sS do U(s)←µ(s); 7 repeat 8 foreach sS do 9 Uold(s)U(s);

10 // Gather the matrices of the successors of s given δ

11 Future← (U(s),ssucc(s,δ(s))); U(s)←h¡T Us,δ(sFuture

¢lmaxlmini l,c;

12 until U(s)==Uold(s) foreachs;

13 δold← δ;

14 // Improvement of δ

15 foreach sS do

16 // Compute the utility of the strategy playing a (for each a), given what was chosen for the

other states 17 foreach aA do

18 Future← (U(s),ssucc(s,δold(s)));

Q(s,a)←£¡T Us,a×Future¢¤lmaxlminl,c

19 // Update the choice of an action for S

20 δ(s)←arg maxlmaxaA(lmin)Q(s,a)

21 untilδ == δold; 22 returnδ;

In line 3 of Algorithm 5, an arbitrary initial policy is chosen. The algorithm then proceeds by evaluating the current policy,through successiveupdatesofthevaluefunction(lines8to11);theconvergenceofthisevaluationiseasilyderived from that of the boundedutilitylmax(Lmin)-valueiteration algorithm. Then the algorithm enters the improvement phase: Lines 17–18 compute Q(s,a), the (bounded lexicographic) utility of playing action a in state s and then applying policy

δold insubsequent states(thepolicy computedduring thelast iteration); asusual in Policy Iterationstyle algorithms, the updated policy (δ) is thenobtained by greedilyimproving the current action, which is donein line 20. Sincethe actions consideredatline20doincludetheoneprescribedbyδold,eithernothingischanged, andthealgorithmstops, orthenew policy,δ,isbetterthan thepreviousoneδold.

Proposition8.Boundedutilitylmax(lmin)-policyiterationconvergestoanoptimalpolicyforºlmaxlmin,l,cinfinitetime.

Policyiteration (Algorithm 5) converges and is guaranteed to find a policy optimal for the (l,c) lexicographic criterion in finitetimeand usually inafew iterations. As forthealgorithmic complexity oftheclassical, stochastic, policy iteration algorithm (whichis still not well understood[16]), atight bound worst-case complexity oflexicographicpolicyiteration is

hardtoobtain.Therefore,weprovideanupper-boundofthiscomplexity.

The policyiteration algorithm never visits a policy twice:in the worst case, the number oftrial iterations before con-vergence is exponential but it is dominated by the number of distinct policies. So, the complexity of this algorithm is dominated by (|A||S|). Besides, each iteration has a cost, the evaluation phase relying on a bounded utility value itera-tion algorithm that costs O(|S|· |A|· |Lb· (l·cb·log(l·c)) when many actions are possible ata given step, and cost

O(|S|· |Lb· (l·cb·log(l·c))herebecauseoneactionisselected(by thecurrent policy)foreachstate.Thus,theoverall complexity ofthealgorithmisin O(|A||S|· |S|· |L| ·b· (l·cb·log(l·c)).

(15)

4.4. Backtolexicographic-valueiteration:fromfinitetoinfinitehorizon5-MDPs

The bounded iterations algorithm defined insection 3(Algorithm 3, (B I-V I)) can beused for both finitehorizon and infinite horizonMDPs,becauseitfixesanumberofiterations E;if E islow,thepolicy reachedinnotnecessarilyoptimal– thealgorithmisanapproximationalgorithm.

Now, exploiting the above propositions, we are able to show that the bounded iterations Lmax(lmin) value iteration algorithm(Algorithm3)convergeswhen E tendstoinfinity. Todo so,wefirstprovethefollowingproposition:

Proposition9.Letanarbitrarystationary5MDPbegiven.Then,thereexisttwopositivenaturalnumbers(l,c),suchthatforany

pair(δ,δ′)ofarbitrarypoliciesandanystatesS,andforanypair(l,c)suchthatllandcc,

δ(

s

) ≻

lmaxlmin,l,c

δ

(

s

) ⇔ δ(

s

) ≻

lmaxlmin,l,c

δ

(

s

)

Now, this proposition can be used to prove the convergence of the boundediterationsLmax(lmin)-valueiteration

algo-rithm. For this,let usdefine ≻lmaxlmin=de flmaxlmin,l,c∗, theunique preference relation between policiesthat results from

Proposition9.

Proposition10.IfweletδEbethepolicyreturnedbyAlgorithm3foranyfixedE,wecanshowthatthesequenceE)convergesand

thatthereexistsafiniteE,suchthat:

lim

E→∞

δ

E

= δ

E

.

Furthermore,δEisoptimalwithrespecttolmaxlmin.

Thesequenceofpoliciesobtainedfor(B I-V I)(Algorithm3)whenE tendstoinfinityconverges.Furthermore,thelimitis attainedforafinite(butunknown inadvance) E.Alternately,itisalsoattainedforthe(BU -V I) and(BU -P I) algorithms,

withfinitebutunknown(l,c).

Now, let us summarize the theoreticalresults that we have obtained so far. We have shown that possibilistic utilities (optimistic and pessimistic) are special cases of bounded lexicographic utilities, which can be represented by matrices. Possibilistic utilitiesareobtainedwhenl=c=1.

The possibilisticvalue iterationand policy iterationalgorithmscanbeextended tocompute policieswhich areoptimal accordingto≻lmaxlmin,l,c.

Finally,ifinfinitehorizonlexicographicoptimalpoliciesaredefinedasthelimitingpoliciesobtainedfromanon-bounded lexicographicvalue-iterationalgorithm,wehaveshownthatsuchpoliciescanbecomputedbyapplyingourboundedutility lmax(lmin) value iterationalgorithm and that only a finite number of iterations (even though not known in advance) is required.

5. Experiments

In orderto evaluate thepreviousalgorithms, wepropose, in thefollowing, twoexperimental analyses: in thefirst one we will compare the bounded iterations algorithm of value iteration (Algorithm 3) with the bounded utility one and in the second we propose to compare the bounded utilitylexicographic policy iteration algorithm with the bounded utility lexicographicvalueiterationone.ThealgorithmshavebeenimplementedinJavaandtheexperimentshavebeenperformed on anIntel Corei5processorcomputer(1.70 GHz)with8GBDDR3LofRAM.

5.1. Boundedutilityvsboundediterationsvalueiteration

Experimentalprotocol. We nowcompare the performanceofboundedutilitylexicographicvalueiteration (BU -V I) asan approximation of lexicographicvalueiteration (B I-V I) for finite horizon problems, in the Lmax(lmin) variant. Because the horizon is finite, the number of steps of (B I-V I) can be set equal to thehorizon and the algorithm provides a solution optimal accordingto Lmax(lmin). (BU -V I) on the other side limits thesize on thematrices, and can lead tosub-optimal solutions.

We evaluate the performance of the algorithms by carrying out simulations on randomly generated finite horizon

(16)

Fig. 2. Bounded utility lexicographic value iteration vs lexicographic value iteration.

Table 1

AverageCPUtime(inseconds) andaveragenumberofiterations. Bounded utility policy iteration

(l,c) (2,2) (4,4) (6,6) (10,10)

CPU time (s) 0.029 0.042 0.064 0.091

Average number of iterations 3.2 4.33 5.6 9.7

Bounded utility value iteration

(l,c) (2,2) (4,4) (6,6) (10,10)

CPU time (s) 0.03 0.052 0.082 0.1

Average number of iterations 6.75 9.25 16.11 20.2

themoreimportanttheeffectivenessofcuttingmatriceswithBU -V I;thelowerthisrate,themoreimportantthedrowning effect.

Results. Fig.2(a)presentstheaverageexecutionCPUtimeforthetwoalgorithms.Obviously,forbothB I-V I and BU -V I,

the execution time increases with the horizon.Also, we observe that theCPU time of BU -V I increasesaccording to the valuesof(l,c) butitremainsaffordable,asthemaximalCPUtimeislowerthan 1 s forMDPswith25statesand4actions when (l,c)= (40,40) and E=25. Unsurprisingly,wecancheck thatthe BU -V I (regardlessofthevaluesof(l,c))isfaster than B I-V I especially when the horizon increases: the manipulation ofl,c-matrices is obviouslyless expensive than the one offullmatrices.Thesavingincreases withthehorizon.

Aswiththesuccessrate,theresultsaredescribedinFig.2(b).Itappearsthat BU -V I providesaverygoodapproximation especially whenincreasing(l,c).Itprovidesthesameoptimalsolutionasthe B I-V I inabout 90%ofcases,withan(l,c)= (200,200).Moreover, even whenthesuccessrateof BU -V I decreases(when E increases), thequalityofapproximationis still good: neverless than 70% ofoptimalactionsreturned, with E=25.These experimentsconclude in favorof bounded valueiteration:thequalityofitsapproximatedsolutionsarecomparablewiththoseoftheunboundedversionforhigh(l,c)

and increaseswhen(l,c) increase,whileitismuchfaster.

5.2. Boundedutilitylexicographicpolicyiterationvsboundedutilitylexicographicvalueiteration

Experimentalprotocol. In what follows we evaluate the performances of boundedutilitylexicographicpolicy iteration

(BU -P I) andboundedlexicographicvalueiteration (BU -V I), inthe lmax(lmin)variant. Weevaluate theperformance ofthe algorithmson randomlygenerated5MDPsasthoseofSection5.1 with|S|=25 and|As|=4,∀s.

Weran thetwo algorithmsfordifferent valuesof (l,c) (100 5MDPs areconsideredin eachsample).For eachofthe two algorithmswemeasure theCPUtimeneededtoconverge.Wealsomeasuretheaveragenumberofvalueiterations for

(BU -V I)and theaveragenumberofpolicy iterationsfor(BU -P I).

Results. Table1presentstheaverageexecutionCPUtimeandtheaveragenumberofiterationsforthetwoalgorithms. Obviously, for both BU -P I and BU -V I, the execution time increases according to the values of (l,c) but it remains affordable, asthe maximal CPUtime is lower than 0.1 s for MDPs with 25 states and 4 actions when (l,c)= (10,10). It appears that BU -P I (regardlessofthevaluesof(l,c))isslightly fasterthan BU -V I.

Consider now the number of iterations. At each iteration, BU -P I considers one policy, explicitly, and updates it at line 20. And so does value iteration: for each state, the current policy is updatedat line 15. Table 1 shows that BU -P I

always considers fewer policies than BU -V I. This experiment provides an empirical evidence in favor of policyiteration

Références

Documents relatifs

Selon Laramée (2013), il s’agit aussi d’accepter l’autre, de le reconnaître dans sa vulnérabilité. Le professionnel et la personne concernée peuvent faire preuve de

6 High Energy Physics Division, Argonne National Laboratory, Argonne IL; United States of America.. 7 Department of Physics, University of Arizona, Tucson AZ; United States

Previous work [2,5] has analyzed the problem of determining the expected utility over sets of auctions, but this work bases the decision on whether or not to participate in an

The outcome of these considerations (mission of the firm, definition of STUs, technology environmental scan, etc.) is to capture change, either change that

/ La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur. For

De plus, la seule modalité qui, dans les protocoles de recherche, distingue les participants selon le sexe concerne la procréation ; les essais cliniques impli- quent généralement,

In this paper, we present the Collaboration Maturity Model (Col-MM) that was developed in cooperation with a Focus Group consisting of professional collaboration experts. The

− la quatrième et dernière visite a lieu trois mois après la troisième visite, la description porte sur une période rétrospective de trois mois, cette visite est donc