γ
1−γ ǫ, which is significant in **the** usual situation when γ is close to 1. Given
Bellman operators that can only be computed with some error ǫ, a surprising consequence **of** this result is that **the** problem **of** “computing an approximately optimal **non**-**stationary** policy” is much simpler than that **of** “computing an approximately optimal **stationary** policy”, and even slightly simpler than that **of** “approximately computing **the** value **of** some fixed policy”, since this last problem only has a guarantee **of** 1

En savoir plus
Mean-variance optimization problems resembling ours have been studied in **the** literature. **For** example, (Guo, Ye, & Yin, 2012) consider a mean-variance optimization problem, but subject to a constraint on **the** vector **of** expected rewards starting from each state, which results in a simpler problem, amenable to a policy iteration approach. Collins (1997) provides an apparently exponential-time algorithm **for** a variant **of** our problem, and Tamar, Di-Castro, and Mannor (2012) propose a policy gradient approach that aims at a locally optimal solution. Expressions **for** **the** variance **of** **the** **discounted** reward **for** **stationary** **policies** were developed in Sobel (1982). However, these expressions are quadratic in **the** underlying transition probabilities, and do not lead to convex optimization problems. Similarly, much **of** **the** earlier literature (see Kawai (1987); Huang and Kallenberg (1994) **for** a unified approach) on **the** problem provides various mathematical programming formulations. In general, these formulations either deal with problems that differ qualitatively focusing on **the** variation **of** reward from its average (Filar, Kallenberg, & Lee, 1989; White, 1992) from ours or are nonconvex, and therefore do not address **the** issue **of** polynomial-time solvability which is our focus. Indeed, we are not aware on any complexity results on mean-variance optimization problems. We finally note some interesting variance bounds obtained by Arlotto, Gans, and Steel (2013).

En savoir plus
b /RJT, UPS-CNRS, I /8 route de Narbonne, 3/062 Toulous e, France c MIAT, UR 875, Universilé de Toulouse, INRA, F-3/320 Castanet-Tolosan, France
Abstract
Possibilistic **decision** theory has been proposed twenty years ago and has had several extensions since then. Even though ap pealing **for** its ability to handle qualitative **decision** problems, possibilistic **decision** theory suffers from an important drawback. Qualitative possibilistic utility criteria compare acts through min and max operators, which Jeads to a drowning effect. To over come this Jack **of** **decision** power **of** **the** theory, several refinements have been proposed. Lexicographie refinements are particularly appealing since they allow to benefit from **the** Expected Utility background, while remaining qualitative. This article aims at extend ing lexicographie refinements to sequential **decision** problems i.e., to possibilistic **decision** trees and possibilistic **Markov** **decision** **processes**, when **the** **horizon** is finite. We present two criteria that refine qualitative possibilistic utilities and provide dynarnic prograrnming algorithrns **for** calculating Jexicographically optimal **policies**.

En savoir plus
2 Related Work
**The** Kearns algorithm [13] is **the** classic ‘sparse sampling algorithm’ **for** large, **infinite** **horizon**, **discounted** MDPs. It constructs a ‘near optimal’ scheduler piece- wise, by approximating **the** best action from a current state using a stochastic depth-first search. Importantly, optimality is with respect to rewards, not prob- ability (as required by standard model checking tasks). **The** algorithm can work with large, potentially **infinite** state MDPs because it explores a probabilistically bounded search space. This, however, is exponential in **the** discount. To find **the** action with **the** greatest expected reward in **the** current state, **the** algorithm re- cursively estimates **the** rewards **of** successive states, up to some maximum depth defined by **the** discount and desired error. Actions are enumerated while prob- abilistic choices are explored by sampling, with **the** number **of** samples set as a parameter. **The** error is specified as a maximum difference between consecutive estimates, allowing **the** discount to guarantee that **the** algorithm will eventually terminate.

En savoir plus
Although **the** mean–payoff parity and **the** priority weighted function are both generalizations **of** parity and mean–payoff functions they have radically different prop- erties. **The** main difference is that using **the** mean–payoff parity function does not guarantee **the** existence **of** pure sta- tionary optimal strategies, and **the** controller may even need an **infinite** amount **of** memory to play optimally [3]. On **the** other hand, **the** **use** **of** a priority mean–payoff function guarantees **the** existence **of** optimal strategies that are pure and **stationary** (cf. Theorem 2 in this paper or also [12]). Another difference between **the** mean–payoff parity and **the** priority mean–payoff function arises when we consider **the** stochastic framework. **The** mean–payoff parity func- tion may have value −∞ and as soon as this occurs with positive probability **the** expected payoff is −∞. Hence, if there is a **non**–zero probability that **the** parity condition is violated, **the** controller **of** a mean–payoff parity MDP be- comes totally indifferent to **the** mean–payoff evaluation **of** rewards. Such a phenomenon does not occur in a priority mean–payoff MDP. Thus when MDPs are used to model stochastic systems with both fairness assumption and quan- titative constraints, using a priority mean–payoff function guarantees that **the** expected payoff always depends on both qualitative (parity) and quantitative (mean–payoff) aspects **of** **the** specification.

En savoir plus
II. R ELATED W ORK
**The** classic algorithms to solve MDPs are ‘policy iteration’ and ‘value iteration’ [31]. Model checking algorithms **for** MDPs may **use** value iteration applied to probabilities [1, Ch. 10] or solve **the** same problem using linear programming [3]. **The** principal challenge **of** finding optimal schedulers is what has been described as **the** ‘curse **of** dimensionality’ [2] and **the** ‘state explosion problem’ [7]. In essence, these two terms refer to **the** fact that **the** number **of** states **of** a system increases exponentially with respect to **the** number **of** interacting com- ponents and state variables. This phenomenon has motivated **the** design **of** lightweight sampling algorithms that find ‘near optimal’ schedulers to optimise rewards in **discounted** MDPs [19], but **the** standard model checking problem **of** finding extremal probabilities in **non**-**discounted** MDPs is significantly more challenging. Since nondeterministic and probabilistic choices are interleaved in an MDP, schedulers are typically **of** **the** same order **of** complexity as **the** system as a whole and may be **infinite**. As a result, previous SMC algorithms **for** MDPs have considered only memoryless schedulers or have other limitations.

En savoir plus
There are two natural perspectives to this work. First, as far as **the** **infinite** **horizon** case is concerned, other types **of** lexicographic refinements could be proposed. One **of** these options could be to avoid **the** duplication **of** **the** set **of** transitions that occur sev- eral times in a single trajectory and consider only those which are observed. A second perspective **of** this work will be to define reinforcement learning [14] type algorithms **for** P-MDPs. Such algorithms would **use** samplings **of** **the** trajectories instead **of** full dynamic programming or quantile-based reinforcement learning approaches [7].

En savoir plus
Section 2 is devoted to **the** definition **of** **the** controlled **non**-linear branch- ing **processes** and to **the** proof **of** preliminary properties. Using **the** criteria **of** [CV15a], we also state in Section 2 and prove in Section 5 that **for** all **Markov** control α, **the** controlled branching process X x,α admits a unique quasi- **stationary** distribution π α with absorption rate λ α > 0, and that **the** condi- tional distributions converge exponentially and uniformly in total variation to **the** QSD. We extend in Section 3 **the** problem **of** **infinite** **horizon** control (1) to positive values **of** β, and we also state our main results on **infinite** hori- zon control, among which **the** fact that, if f ≥ 0 and f (0, ·) ≡ 0, then, when β → λ ∗ := inf α∈A M λ

En savoir plus
There are two natural perspectives to this work. First, as far as **the** **infinite** **horizon** case is concerned, other types **of** lexicographic refinements could be proposed. One **of** these options could be to avoid **the** duplication **of** **the** set **of** transitions that occur sev- eral times in a single trajectory and consider only those which are observed. A second perspective **of** this work will be to define reinforcement learning [14] type algorithms **for** P-MDPs. Such algorithms would **use** samplings **of** **the** trajectories instead **of** full dynamic programming or quantile-based reinforcement learning approaches [7]. References

En savoir plus
Campus de Beaulieu 35042 Rennes, FRANCE. 12 October 1993
Abstract
We characterize **the** conditions under which an absorbing Markovian finite process (in dis- crete or continuous time) can be transformed into a new aggregated process conserving **the** Markovian property, whose states are elements **of** a given partition **of** **the** original state space. To obtain this characterization, a key tool is **the** quasi-**stationary** distribution associated with absorbing **processes**. It allows **the** absorbing case to be related to **the** irreducible one. We are able to calculate **the** set **of** all initial distributions **of** **the** starting process leading to an aggre- gated homogeneous **Markov** process by means **of** a finite algorithm. Finally, it is shown that **the** continuous time case can always be reduced to **the** discrete one using **the** uniformization technique.

En savoir plus
Unit´e de recherche INRIA Lorraine, Technopˆole de Nancy-Brabois, Campus scientifique, ` NANCY 615 rue du Jardin Botanique, BP 101, 54600 VILLERS LES Unit´e de recherche INRIA Rennes, Ir[r]

Before unmanned aircraft can fly safely in civil airspace, robust airborne collision avoid- ance systems must be developed. Instead **of** hand-crafting a collision avoidance algorithm **for** every combination **of** sensor and aircraft configuration, we investigate **the** automatic generation **of** collision avoidance algorithms given models **of** aircraft dynamics, sensor per- formance, and intruder behavior. By formulating **the** problem **of** collision avoidance as a **Markov** **Decision** Process (MDP) **for** sensors that provide precise localization **of** **the** in- truder aircraft, or a Partially Observable **Markov** **Decision** Process (POMDP) **for** sensors that have positional uncertainty or limited field-**of**-view constraints, generic MDP/POMDP solvers can be used to generate avoidance strategies that optimize a cost function that bal- ances flight-plan deviation with collision. Experimental results demonstrate **the** suitability **of** such an approach using four different sensor modalities and a parametric aircraft per- formance model.

En savoir plus
Remark 3. **The** optimistic part **of** **the** algorithm allows a deep exploration **of** **the** MDP. At **the** same time, it biases **the** expression maximized by ˆ π in (4) towards near-optimal actions **of** **the** deterministic realizations. Under **the** assumptions **of** theorem 1, **the** bias becomes insignificant.
Remark 4. Notice that we do not **use** **the** optimistic properties **of** **the** algorithm in **the** analysis. **The** analysis only uses **the** “safe” part **of** **the** SOP planning, i.e. **the** fact that one sample out **of** two are devoted to expanding **the** shallowest nodes. An analysis **of** **the** benefit **of** **the** optimistic part **of** **the** algorithm, similar to **the** analyses carried out in [9, 2] would be much more involved and is deferred to a future work. However **the** impact **of** **the** optimistic part **of** **the** algorithm is essential in practice, as shown in **the** numerical results.

En savoir plus
First we describe how **the** local minimum was found, which also shows that **the** ap- proach **of** finite-state controller with policy gradient is quite effective **for** this problem. **The** initial policy has equal action probabilities **for** all internal-state and observation pairs, and has 0.2 as **the** internal-transition parameter. At each iteration, **the** gradient is estimated from a simulated sample trajectory **of** length 20000 (a moderate number **for** **the** size **of** this problem), without using any estimates from previous iterations. We then, denoting **the** estimate by ˆ ∇η, project − ˆ ∇η to **the** feasible direction set, and update **the** policy param- eter by a small constant step along **the** projected direction. We used GPOMDP in this procedure, (mainly because it needs less computation). **The** initial policy has average cost −0.234. **The** cost monotonically decreases, and within 4000 iterations **the** policy gets into **the** neighborhood **of** a local minimum, oscillating around afterwards, with average costs in **the** interval [−0.366, −0.361] **for** **the** last 300 iterations. As a comparison, **the** optimal (liminf) average cost **of** this POMDP is bounded below by −0.460, which is computed using an approximation scheme from [YB04].

En savoir plus
169 En savoir plus

1 Introduction
Our goal is to find good, though not necessarily optimal, so- lutions **for** large, factored **Markov** **decision** **processes**. We present an approximate algorithm, DetH*, which applies two types **of** leverage to **the** problem: it shortens **the** **horizon** using an automatically generated temporal hierarchy and it reduces **the** effective size **of** **the** state space through state aggregation. DetH* uses connectivity heuristics to break **the** state space into a number **of** macro-states. It then assumes that transi- tions between these macro-states are deterministic, allowing it to quickly compute a top-level policy mapping macro-states to macro-states. Once this policy has been computed, DetH* solves **for** **policies** in each macro-state independently. We are able to construct and solve these hierarchies significantly faster than solving **the** original problem.

En savoir plus