• Aucun résultat trouvé

Continuous-Time Markov Decision Processes

Dans le document The DART-Europe E-theses Portal (Page 178-182)

constraint by removing all actions that does not contain a low price. However, it is not possible to decompose anymore the DP, in our opinion. In the decom-posed LP formulation, we show how it is possible to remove this action and other combinations of actions by adding simple linear constraints. We also discuss the generic problem of reducing arbitrarily the action space by adding a set of linear constraints in the decomposed LP. Not surprisingly, this problem is difficult and is not appropriate if many actions are removed arbitrarily. When new constraints are added in the decomposed LP, it is also not clear whether deterministic policies remain dominant or not. We even prove that, given a set of linear constraints added to the LP, determining whether there exists a deterministic policy solution is NP-complete. However, for some simple action reductions, we show that deterministic policies remain dominant. We finally present numerical results comparing LP and DP formulations (decomposed or not).

Before presenting the organization of the paper, we mention briefly some related literature addressing MDP with large state space (Bertsekas and Casta˜non, 1989;

Tsitsiklis and Van Roy, 1996) which tries to fight the curse of dimensionality, i.e.

the exponential growth of the state space size with some parameter of the problem.

Another issue, less tackled, appears when the state space is relatively small but the action space is very large. Hu et al.(2007) proposes a randomized search method for solving infinite horizon discounted cost discrete-time MDP for uncountable action spaces.

The rest of the paper is organized as follows. We first address the average cost problem. Section A.2 reminds the definition of a CTMDP and formulates it as a QP. We also link the QP formulation with classic LP and DP formulations. In Section A.3, we define properly D-CTMDP and show how the DP and LP can be decomposed for this class of problems. SectionA.4 discusses the problem of reduc-ing the action space by addreduc-ing valid constraints in the decomposed LP. Section A.5 compares numerically computation times of decomposed and non-decomposed for-mulations for both LP and DP algorithms, for the dynamic pricing problem. Finally, SectionA.6 explains how our results can be adapted to a discounted cost criterion.

A.2 Continuous-Time Markov Decision Processes

In this section, we remind some classic results on CTMDPs that will be useful to present our contributions.

A.2.1 Definition

We slightly adapt the definition of a CTMDP given byGuo and Hern´andez-Lerma (2009). A CTMDP is a stochastic control problem defined by a 5-tuple

S, A, λs,t(a), hs(a), rs,t(a)

with the following components:

ˆ S is a finite set of states;

ˆ Ais a finite set of actions, A(s) are the actions available from states ∈S;

ˆ λs,t(a) is the transition rate to go from state s to statet with actiona∈A(s);

ˆ hs(a) is the reward rate while staying in state s with action a∈A(s);

ˆ rs,t(a) is the instant reward to go from states to state twith action a∈A(s).

Instant rewards rs,t(a) can be included in the reward rates hs(a) by an easy transformation and reciprocally. Therefore for ease of presentation, we will use the aggregated reward rate ˜hs(a) := hs(a) +P

tSλs,t(a)rs,t(a).

We first consider the objective of maximizing the average reward over an infinite horizon, the discounted case will be discussed in SectionA.6. We restrict our atten-tion to staatten-tionary policies which are dominant for this problem (Bertsekas, 2005b) and define the following policy classes:

ˆ A (randomized stationary) policy p sets for each state s ∈ S the probability ps(a) to select action a∈A(s) with P

aA(s)ps(a) = 1.

ˆ A deterministic policy p sets for each state s one action to select: ∀s ∈ S,

∃a ∈A(s) such that ps(a) = 1 and ∀b∈A(s)\ {a}, ps(b) = 0.

ˆ A strictly randomized policy has at least one state where the action is chosen randomly: ∃s∈S, ∃a∈A(s) such that 0< ps(a)<1.

The best average reward policy p with gain g is solution of the following Quadratic Program (QP) together with a vector̟(called variant pi” or “pomega”).

̟s is to be interpreted as the stationary distribution of state s∈S under policyp.

A.2. CONTINUOUS-TIME MARKOV DECISION PROCESSES 165

QP (A.1) is a natural way to formulate a stationary MDP. It is easy to see that this formulation solves the best average stationary reward policy. First Equa-tions (A.1e) and (A.1f) define the space of admissible stationnary randomized poli-cies p. Secondly for a given policy p, Equations (A.1b)-(A.1d) compute the sta-tionary distribution ̟ of the induced Continuous-Time Markov Chain with gain expressed by Equation (A.1a).

As we will show in Section A.2.3, the usual LP formulation (A.4) can be de-rived directly from QP (A.1) by simple substitutions of variables. However, in the literature it is classically derived from the linearization of dynamic programming optimality equations given in the following subsection.

A.2.2 Optimality equations

A CTMDP can be transformed into a discrete-time MDP through a uniformiza-tion process (Lippman, 1975). In state s, we uniformize the CTMDP with rate Λs:=P

tSΛs,t where Λs,t:= maxaA(s)λs,t(a). This uniformization rate simplifies greatly the optimality equations and linear programs.

In the rest of the paper, under relatively general conditions (typically for unichain models, seee.g. Bertsekas (2005b)) we assume that the optimal average reward gis independent of the initial state and is the unique solution together with an associated differential reward vector v that satisfies the Bellman’s optimality equations:

g Λs

=T v(s)

−v(s), ∀s∈S, (A.2)

with operatorT defined as These optimality equations can be used to compute the best MDP policy. For instance, the value iteration algorithm roughly consists in defining a sequence of value functionvn+1 =T(vn) that provides a stationaryǫ-optimal deterministic policy in a number of iterations depending on the desiredǫ.

A.2.3 Linear programming formulation

Since Manne (1960) we know that it is possible to compute the best average reward policy through a LP formulation. From Equations (A.2) and (A.3), one can show that the optimal average reward g is the solution of the following program:

g = min g

Linearizing the max function in operator T leads to the following LP formulation and its dual counterpart.

Dans le document The DART-Europe E-theses Portal (Page 178-182)