Haut PDF On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

γ 1−γ ǫ, which is significant in the usual situation when γ is close to 1. Given Bellman operators that can only be computed with some error ǫ, a surprising consequence of this result is that the problem of “computing an approximately optimal non-stationary policy” is much simpler than that of “computing an approximately optimal stationary policy”, and even slightly simpler than that of “approximately computing the value of some fixed policy”, since this last problem only has a guarantee of 1
En savoir plus

5 En savoir plus

Algorithmic aspects of mean–variance optimization in Markov decision processes

Algorithmic aspects of mean–variance optimization in Markov decision processes

Mean-variance optimization problems resembling ours have been studied in the literature. For example, (Guo, Ye, & Yin, 2012) consider a mean-variance optimization problem, but subject to a constraint on the vector of expected rewards starting from each state, which results in a simpler problem, amenable to a policy iteration approach. Collins (1997) provides an apparently exponential-time algorithm for a variant of our problem, and Tamar, Di-Castro, and Mannor (2012) propose a policy gradient approach that aims at a locally optimal solution. Expressions for the variance of the discounted reward for stationary policies were developed in Sobel (1982). However, these expressions are quadratic in the underlying transition probabilities, and do not lead to convex optimization problems. Similarly, much of the earlier literature (see Kawai (1987); Huang and Kallenberg (1994) for a unified approach) on the problem provides various mathematical programming formulations. In general, these formulations either deal with problems that differ qualitatively focusing on the variation of reward from its average (Filar, Kallenberg, & Lee, 1989; White, 1992) from ours or are nonconvex, and therefore do not address the issue of polynomial-time solvability which is our focus. Indeed, we are not aware on any complexity results on mean-variance optimization problems. We finally note some interesting variance bounds obtained by Arlotto, Gans, and Steel (2013).
En savoir plus

26 En savoir plus

Lexicographic refinements in possibilistic decision trees and finite-horizon Markov decision processes

Lexicographic refinements in possibilistic decision trees and finite-horizon Markov decision processes

The second perspective of this work, not unrelated, is to develop simulation-based algorithms for finding lexico- graphic solutions to MDPs. Reinforcement Learning algorithms [ 41 ] allow to solve large size MDPs by making use of simulated trajectories of states and actions. It is not immediate to develop RL algorithms for possibilistic MDPs, since no unique stochastic transition function corresponds to a possibility distribution [ 42 ]. However, uniform simulation of trajectories (with random choice of actions) may be used to generate an approximation of the possibilistic decision tree (provided that both transition possibilities and utility of the leaf are given with the simulated trajectory). So, in- terleaving simulations and lexicographic dynamic programming may lead to RL-type algorithms for approximating lexicographically optimal policies for (large) possibilistic MDPs.
En savoir plus

26 En savoir plus

Lexicographic refinements in stationary possibilistic Markov Decision Processes

Lexicographic refinements in stationary possibilistic Markov Decision Processes

The present section defines an extension of lexicographic refinements to finite horizon possibilistic Markov decision processes and proposes a value iteration algorithm that looks for policies optimal with respect to these criteria. 3.1. Lexi-refinements of ordinal aggregations In ordinal (i.e. min-based and max-based) aggregation a solution to the drowning effect based on leximin and leximax comparisons has been proposed by [ 19 ]. It has then been extended to non-sequential decision making under uncertainty [ 13 ] and, in the sequential case, to decision trees [ 4 ]. Let us first recall the basic definition of these two preference relations. For any two vectors t and t ′ of length m built on the scale L:
En savoir plus

22 En savoir plus

Lexicographic refinements in possibilistic decision trees and finite-horizon Markov decision processes

Lexicographic refinements in possibilistic decision trees and finite-horizon Markov decision processes

b /RJT, UPS-CNRS, I /8 route de Narbonne, 3/062 Toulous e, France c MIAT, UR 875, Universilé de Toulouse, INRA, F-3/320 Castanet-Tolosan, France Abstract Possibilistic decision theory has been proposed twenty years ago and has had several extensions since then. Even though ap­ pealing for its ability to handle qualitative decision problems, possibilistic decision theory suffers from an important drawback. Qualitative possibilistic utility criteria compare acts through min and max operators, which Jeads to a drowning effect. To over­ come this Jack of decision power of the theory, several refinements have been proposed. Lexicographie refinements are particularly appealing since they allow to benefit from the Expected Utility background, while remaining qualitative. This article aims at extend­ ing lexicographie refinements to sequential decision problems i.e., to possibilistic decision trees and possibilistic Markov decision processes, when the horizon is finite. We present two criteria that refine qualitative possibilistic utilities and provide dynarnic prograrnming algorithrns for calculating Jexicographically optimal policies.
En savoir plus

27 En savoir plus

Lightweight Verification of Markov Decision Processes with Rewards

Lightweight Verification of Markov Decision Processes with Rewards

The Kearns algorithm is the classic ‘sparse sampling algorithm’ for large, in- finite horizon, discounted MDPs. It constructs a ‘near optimal’ scheduler piece- wise, by approximating the best action from a current state, using a stochastic depth-first search. The algorithm can work with large, potentially infinite state MDPs because it explores a probabilistically bounded search space. This, how- ever, is exponential in the discount. To find the action with the greatest expected reward in the current state, the algorithm recursively estimates the rewards of successive states, up to some maximum depth defined by the discount and de- sired error. Actions are enumerated while probabilistic choices are explored by sampling, with the number of samples set as a parameter. By iterating local exploration with probabilistic sampling, the discount guarantees that the algo- rithm eventually converges. The stopping criterion is when successive estimates differ by less than some error threshold.
En savoir plus

16 En savoir plus

Scalable Verification of Markov Decision Processes

Scalable Verification of Markov Decision Processes

2 Related Work The Kearns algorithm [13] is the classic ‘sparse sampling algorithm’ for large, infinite horizon, discounted MDPs. It constructs a ‘near optimal’ scheduler piece- wise, by approximating the best action from a current state using a stochastic depth-first search. Importantly, optimality is with respect to rewards, not prob- ability (as required by standard model checking tasks). The algorithm can work with large, potentially infinite state MDPs because it explores a probabilistically bounded search space. This, however, is exponential in the discount. To find the action with the greatest expected reward in the current state, the algorithm re- cursively estimates the rewards of successive states, up to some maximum depth defined by the discount and desired error. Actions are enumerated while prob- abilistic choices are explored by sampling, with the number of samples set as a parameter. The error is specified as a maximum difference between consecutive estimates, allowing the discount to guarantee that the algorithm will eventually terminate.
En savoir plus

13 En savoir plus

Limits of Multi-Discounted Markov Decision Processes

Limits of Multi-Discounted Markov Decision Processes

Although the mean–payoff parity and the priority weighted function are both generalizations of parity and mean–payoff functions they have radically different prop- erties. The main difference is that using the mean–payoff parity function does not guarantee the existence of pure sta- tionary optimal strategies, and the controller may even need an infinite amount of memory to play optimally [3]. On the other hand, the use of a priority mean–payoff function guarantees the existence of optimal strategies that are pure and stationary (cf. Theorem 2 in this paper or also [12]). Another difference between the mean–payoff parity and the priority mean–payoff function arises when we consider the stochastic framework. The mean–payoff parity func- tion may have value −∞ and as soon as this occurs with positive probability the expected payoff is −∞. Hence, if there is a non–zero probability that the parity condition is violated, the controller of a mean–payoff parity MDP be- comes totally indifferent to the mean–payoff evaluation of rewards. Such a phenomenon does not occur in a priority mean–payoff MDP. Thus when MDPs are used to model stochastic systems with both fairness assumption and quan- titative constraints, using a priority mean–payoff function guarantees that the expected payoff always depends on both qualitative (parity) and quantitative (mean–payoff) aspects of the specification.
En savoir plus

13 En savoir plus

Smart Sampling for Lightweight Verification of Markov Decision Processes

Smart Sampling for Lightweight Verification of Markov Decision Processes

II. R ELATED W ORK The classic algorithms to solve MDPs are ‘policy iteration’ and ‘value iteration’ [31]. Model checking algorithms for MDPs may use value iteration applied to probabilities [1, Ch. 10] or solve the same problem using linear programming [3]. The principal challenge of finding optimal schedulers is what has been described as the ‘curse of dimensionality’ [2] and the ‘state explosion problem’ [7]. In essence, these two terms refer to the fact that the number of states of a system increases exponentially with respect to the number of interacting com- ponents and state variables. This phenomenon has motivated the design of lightweight sampling algorithms that find ‘near optimal’ schedulers to optimise rewards in discounted MDPs [19], but the standard model checking problem of finding extremal probabilities in non-discounted MDPs is significantly more challenging. Since nondeterministic and probabilistic choices are interleaved in an MDP, schedulers are typically of the same order of complexity as the system as a whole and may be infinite. As a result, previous SMC algorithms for MDPs have considered only memoryless schedulers or have other limitations.
En savoir plus

14 En savoir plus

Efficient Policies for Stationary Possibilistic Markov  Decision Processes

Efficient Policies for Stationary Possibilistic Markov Decision Processes

There are two natural perspectives to this work. First, as far as the infinite horizon case is concerned, other types of lexicographic refinements could be proposed. One of these options could be to avoid the duplication of the set of transitions that occur sev- eral times in a single trajectory and consider only those which are observed. A second perspective of this work will be to define reinforcement learning [14] type algorithms for P-MDPs. Such algorithms would use samplings of the trajectories instead of full dynamic programming or quantile-based reinforcement learning approaches [7].
En savoir plus

12 En savoir plus

On the link between infinite horizon control and quasi-stationary distributions

On the link between infinite horizon control and quasi-stationary distributions

Section 2 is devoted to the definition of the controlled non-linear branch- ing processes and to the proof of preliminary properties. Using the criteria of [CV15a], we also state in Section 2 and prove in Section 5 that for all Markov control α, the controlled branching process X x,α admits a unique quasi- stationary distribution π α with absorption rate λ α > 0, and that the condi- tional distributions converge exponentially and uniformly in total variation to the QSD. We extend in Section 3 the problem of infinite horizon control (1) to positive values of β, and we also state our main results on infinite hori- zon control, among which the fact that, if f ≥ 0 and f (0, ·) ≡ 0, then, when β → λ ∗ := inf α∈A M λ
En savoir plus

31 En savoir plus

Efficient Policies for Stationary Possibilistic Markov  Decision Processes

Efficient Policies for Stationary Possibilistic Markov Decision Processes

There are two natural perspectives to this work. First, as far as the infinite horizon case is concerned, other types of lexicographic refinements could be proposed. One of these options could be to avoid the duplication of the set of transitions that occur sev- eral times in a single trajectory and consider only those which are observed. A second perspective of this work will be to define reinforcement learning [14] type algorithms for P-MDPs. Such algorithms would use samplings of the trajectories instead of full dynamic programming or quantile-based reinforcement learning approaches [7]. References
En savoir plus

11 En savoir plus

Exact aggregation of absorbing Markov processes using quasi-stationary distribution

Exact aggregation of absorbing Markov processes using quasi-stationary distribution

Campus de Beaulieu 35042 Rennes, FRANCE. 12 October 1993 Abstract We characterize the conditions under which an absorbing Markovian finite process (in dis- crete or continuous time) can be transformed into a new aggregated process conserving the Markovian property, whose states are elements of a given partition of the original state space. To obtain this characterization, a key tool is the quasi-stationary distribution associated with absorbing processes. It allows the absorbing case to be related to the irreducible one. We are able to calculate the set of all initial distributions of the starting process leading to an aggre- gated homogeneous Markov process by means of a finite algorithm. Finally, it is shown that the continuous time case can always be reduced to the discrete one using the uniformization technique.
En savoir plus

10 En savoir plus

A Stochastic Minimax Optimal Control Problem on Markov Chains with Infinite Horizon

A Stochastic Minimax Optimal Control Problem on Markov Chains with Infinite Horizon

Unit´e de recherche INRIA Lorraine, Technopˆole de Nancy-Brabois, Campus scientifique, ` NANCY 615 rue du Jardin Botanique, BP 101, 54600 VILLERS LES Unit´e de recherche INRIA Rennes, Ir[r]

22 En savoir plus

Collision Avoidance for Unmanned Aircraft using Markov Decision Processes

Collision Avoidance for Unmanned Aircraft using Markov Decision Processes

Before unmanned aircraft can fly safely in civil airspace, robust airborne collision avoid- ance systems must be developed. Instead of hand-crafting a collision avoidance algorithm for every combination of sensor and aircraft configuration, we investigate the automatic generation of collision avoidance algorithms given models of aircraft dynamics, sensor per- formance, and intruder behavior. By formulating the problem of collision avoidance as a Markov Decision Process (MDP) for sensors that provide precise localization of the in- truder aircraft, or a Partially Observable Markov Decision Process (POMDP) for sensors that have positional uncertainty or limited field-of-view constraints, generic MDP/POMDP solvers can be used to generate avoidance strategies that optimize a cost function that bal- ances flight-plan deviation with collision. Experimental results demonstrate the suitability of such an approach using four different sensor modalities and a parametric aircraft per- formance model.
En savoir plus

23 En savoir plus

Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

Remark 3. The optimistic part of the algorithm allows a deep exploration of the MDP. At the same time, it biases the expression maximized by ˆ π in (4) towards near-optimal actions of the deterministic realizations. Under the assumptions of theorem 1, the bias becomes insignificant. Remark 4. Notice that we do not use the optimistic properties of the algorithm in the analysis. The analysis only uses the “safe” part of the SOP planning, i.e. the fact that one sample out of two are devoted to expanding the shallowest nodes. An analysis of the benefit of the optimistic part of the algorithm, similar to the analyses carried out in [9, 2] would be much more involved and is deferred to a future work. However the impact of the optimistic part of the algorithm is essential in practice, as shown in the numerical results.
En savoir plus

9 En savoir plus

Approximate solution methods for partially observable Markov and semi-Markov decision processes

Approximate solution methods for partially observable Markov and semi-Markov decision processes

First we describe how the local minimum was found, which also shows that the ap- proach of finite-state controller with policy gradient is quite effective for this problem. The initial policy has equal action probabilities for all internal-state and observation pairs, and has 0.2 as the internal-transition parameter. At each iteration, the gradient is estimated from a simulated sample trajectory of length 20000 (a moderate number for the size of this problem), without using any estimates from previous iterations. We then, denoting the estimate by ˆ ∇η, project − ˆ ∇η to the feasible direction set, and update the policy param- eter by a small constant step along the projected direction. We used GPOMDP in this procedure, (mainly because it needs less computation). The initial policy has average cost −0.234. The cost monotonically decreases, and within 4000 iterations the policy gets into the neighborhood of a local minimum, oscillating around afterwards, with average costs in the interval [−0.366, −0.361] for the last 300 iterations. As a comparison, the optimal (liminf) average cost of this POMDP is bounded below by −0.460, which is computed using an approximation scheme from [YB04].
En savoir plus

169 En savoir plus

On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems

On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems

of playing a policy with the highest expected reward, and the regret grows as the logarithm of T . More recently, finite-time bounds for the regret have been derived (see Auer et al. (2002); Audibert et al. (2007)). Though the stationary formulation of the MABP allows to address exploration versus exploitation chal- lenges in a intuitive and elegant way, it may fail to be adequate to model an evolving environment where the reward distributions undergo changes in time. As an example, in the cognitive medium radio access problem Lai et al. (2007), a user wishes to opportunistically exploit the availability of an empty channel in a multiple channels system; the reward is the availability of the channel, whose distribution is unknown to the user. Another application is real-time optimization of websites by targetting relevant content at individuals, and maximize the general interest by learning and serving the most popular content (such situations have been considered in the recent Exploration versus Exploitation (EvE) PASCAL challenge by Hartland et al. (2006), see also Koulouriotis and Xanthopoulos (2008) and the references therein). These examples illus- trate the limitations of the stationary MAB models. The probability that a given channel is available is likely to change in time. The news stories a visitor of a website is most likely to be interested in vary in time.
En savoir plus

25 En savoir plus

Incorporating Bayesian networks in Markov Decision Processes

Incorporating Bayesian networks in Markov Decision Processes

The prescribed inspection type for the first time period, in the case of using the BN, is i 2 , whereas it is i 3 that has a smaller SD (hence, it is costlier) for the case of using a transition matrix. This result is because the BN will contribute to the decision process dur- ing each time period by introducing relevant available information. Thus, there less of a need for costly inspection techniques. The ex- pected direct costs decrease sharply after the first period. This is because the initial belief state of the pavement is very poor. Starting from the second time period, a continuous gradual decrease is ob- served, ending with a zero costs for the final period. This is because the manager of the pavement is less concerned with potential future costs. This result may change if a specified state at the end of the planning horizon was imposed on the manager. A total of 100 simulations were performed to test the obtained prescribed IMR strategy for the two cases: (1) using a simple transition matrix; and (2) using a BN. The evolution of the state of the pavement was generated randomly by using the two degradation models. For the obtained belief state at the beginning of each time period, the IMR strategy was implemented that was prescribed by the sol- ution of the problem obtained by using the BN methodology and the mean transition matrix methodology respectively. The results obtained by using the IMR strategy of the BN had an average ex- pected cost of US$ 11.78 per m 3 and an SD of US$ 1.36 per m 3 . The results obtained by using the mean transition matrix IMR strat- egy had an average expected cost of US$ 12.32 per m 3 and an SD of US$ 2.12 per m 3 . These findings confirm the results obtained by applying the proposed methodology. Also, the SD obtained by us- ing the DBN methodology is smaller than that obtained by using the classical transition matrix.
En savoir plus

11 En savoir plus

DetH*: Approximate Hierarchical Solution of Large Markov Decision Processes

DetH*: Approximate Hierarchical Solution of Large Markov Decision Processes

1 Introduction Our goal is to find good, though not necessarily optimal, so- lutions for large, factored Markov decision processes. We present an approximate algorithm, DetH*, which applies two types of leverage to the problem: it shortens the horizon using an automatically generated temporal hierarchy and it reduces the effective size of the state space through state aggregation. DetH* uses connectivity heuristics to break the state space into a number of macro-states. It then assumes that transi- tions between these macro-states are deterministic, allowing it to quickly compute a top-level policy mapping macro-states to macro-states. Once this policy has been computed, DetH* solves for policies in each macro-state independently. We are able to construct and solve these hierarchies significantly faster than solving the original problem.
En savoir plus

9 En savoir plus

Show all 10000 documents...