Haut PDF The approach in Markov decision processes revisited

The approach in Markov decision processes revisited

The approach in Markov decision processes revisited

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignemen[r]

22 En savoir plus

Collision Avoidance for Unmanned Aircraft using Markov Decision Processes

Collision Avoidance for Unmanned Aircraft using Markov Decision Processes

Radar EO/IR ADS-B Such a system would take as input models of the flight dynamics, intruder behavior, and sensor char- acteristics and attempt to optimize the avoidance strategy so that a predefined cost function is minimized. The cost function could take into account competing objectives, such as flight plan adherence and avoiding collision. One way to formulate a problem involving the optimal control of a stochastic system is as a Markov Decision Process (MDP), or more generally as a Partially Observable Markov Decision Process (POMDP) to also account for observation uncertainty. POMDPs have been studied in the operations research and artificial intelligence communities, but only in the past few years have generic POMDP solution methods been developed that can approximately solve problems with moderate to large state spaces in reasonable time. In this work we investigate the feasibility of applying state-of-the-art MDP and POMDP solution methods to the problem of collision avoidance.
En savoir plus

23 En savoir plus

Constrained Markov Decision Processes with Total Expected Cost Criteria

Constrained Markov Decision Processes with Total Expected Cost Criteria

4. MAIN RESULTS Below we identify a linear program that allows to com- pute the optimal value and an optimal stationary policy for CMDP. A similar result was already available in [1] but re- quired the strong assumption that s(β, u) is finite for any u. This excludes the shortest path problem in which policies that include cycles may have infinite cost. In order to han- dle such situations we note that one may assume throughout that c and d are uniformly bounded below by some positive constant c. The following Lemma shows that although we do not assume that all policies have finite occupation measures, those having infinite occupation measures are not optimal. It further shows that one may restrict the search of solutions to CMDP to stationary policies.
En savoir plus

3 En savoir plus

A Learning Design Recommendation System Based on Markov Decision Processes

A Learning Design Recommendation System Based on Markov Decision Processes

The learning object ݏ ᇱ is reached from ݏ after the transition ܽ ԡܶܵሺ݄ܶ݁ܽܿ݁ݎሻǡ ܶܵሺݏ ᇱ ሻԡ is a distance factor between the teacher’s teaching styles and the learning object ݏ ᇱ teaching styles. Consequently, ԡܮܵሺܷݏ݁ݎሻǡ ܮܵሺݏ ᇱ ሻԡ represents a distance factor between the learning styles of a learner or a group of learners and the learning styles associated to the learning object ݏ ᇱ . Usually in MDP, a policy π is defined based on the reward function to help the decision maker, in our case our prediction method, to make the right decision. ILD-RS policy, π, is defined as follows:
En savoir plus

9 En savoir plus

DetH*: Approximate Hierarchical Solution of Large Markov Decision Processes

DetH*: Approximate Hierarchical Solution of Large Markov Decision Processes

There are two important related lines of work, focusing ei- ther on factoring or on decomposition of large MDPs. Im- provements on factored MDP solvers continue to treat the MDP as a single problem, but find more compact [Sanner and McAllester, 2005; Sanner et al., 2010] or approximate [St- Aubin et al., 2000] representations. Recent work on topolog- ical and focused topological value iteration [Dai and Gold- smith, 2007; Dai et al., 2009] is similar to ours in that it de- composes a large MDP based on the connectivity of the states. However, TVI and FTVI cannot exploit a factored representa- tion and, in a well-connected domain, they are not guaranteed to find any hierarchy at all. FTVI has been run successfully on some smaller factored problems, but requires knowledge of an initial state to reduce the state space size. Moreover, the size of the value function output by TVI or FTVI is nec- essarily the size of the optimal value function and, therefore, these cannot be run in domains where the representation of the optimal value function is prohibitively large. By relaxing the requirement that the algorithm output the optimal policy, we can find a good approximate policy even in some large problems where the optimal value function is not compactly representable. Our work is also similar to MAXQ [Dietterich, 1998; Mehta et al., 2008] in that it uses a temporal hierarchy to reduce the size of the state space. However, unlike the orig- inal MAXQ algorithm, we are able to construct the hierarchy automatically. Although the work of Mehta et al. proposes a method for automatically creating MAXQ hierarchies, the hierarchy creation is costly and only worthwhile if it transfers to many domains.
En savoir plus

9 En savoir plus

Smart Sampling for Lightweight Verification of Markov Decision Processes

Smart Sampling for Lightweight Verification of Markov Decision Processes

In [6] the authors present learning algorithms to bound the maximum probability of reachability properties of MDPs. The algorithms work by refining upper and lower bounds associated to individual state-actions, which are initially all set to the most conservative values. Like the approaches of [15], [26], the algorithms are limited to memoryless schedulers of tractable size. Unlike the approach of [15], however, the algorithms do not learn by counting the occurrence of state- actions. When a state that satisfies the property is reached during simulation, the bounds of all the state-actions along the path that reached it are updated according to the true (or estimated) probabilities along the path. This ensures that the bounds remain correct with respect to the true optima, although convergence is very slow. Actions are initially cho- sen uniformly at random (as in [15]), such that the initial successful simulations will favour the “most popular” state- actions, rather than those that maximise the probability. Since the algorithms resolve nondeterminism by choosing uniformly at random an action that maximises the probability according
En savoir plus

14 En savoir plus

Strong Uniform Value in Gambling Houses and Partially Observable Markov Decision Processes

Strong Uniform Value in Gambling Houses and Partially Observable Markov Decision Processes

When the state space and action sets are finite, Blackwell [6] has proved the existence of a pure strategy that is optimal for every discount factor close to 0, and one can deduce that the strong uniform value exists. In many situations, the decision-maker may not be perfectly informed of the current state variable. For instance, if the state variable represents a resource stock (like the amount of oil in an oil field), the quantity left, which represents the state, can be evaluated, but is not exactly known. This motivates the introduction of the more general model of Partially Observable Markov Decision Process (POMDP). In this model, at each stage, the decision-maker does not observe the current state, but instead receives a signal which is correlated to it.
En savoir plus

25 En savoir plus

Approximate solution methods for partially observable Markov and semi-Markov decision processes

Approximate solution methods for partially observable Markov and semi-Markov decision processes

1.1.3 Reinforcement Learning Algorithms In the last part of this thesis (Chapters 10 and 11) we consider approximation algorithms for finite space POMDPs and MDPs under the reinforcement learning setting. This setting differs from the previous problem setting – often referred as the planning setting – in that a model of the problem is no longer required, and neither is an exact inference mechanism. The motivation as well as inspiration behind these methods comes from animal learning and artificial intelligence, and is to build an autonomous agent capable of learning good policies through trial and error while it is operating in the environment (thus the name “reinforce”). The reinforcement learning setting does require, however, that the per-stage costs or rewards are physically present in the environment, which is not true for many planning problems. So the reinforcement learning and the planning settings are complementary to each other. Via simulations of the model, reinforcement learning algorithms can be applied in the planning setting to obtain approximate solutions. For problems whose models are too large to handle exactly and explicitly, such methods are especially useful.
En savoir plus

169 En savoir plus

Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning

Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning

always moves to the right since MDP 0 does not capture this risk. As a result, the  = 0 case reflects a favorable evolution for DP-snapshot and a bad one for RATS. The opposite occurs with  = 1 where the cautious behavior dominates over the risky one, and the in-between cases mitigate this effect. In Figure 2a, we display the achieved expected return for each algorithm as a function of , i.e. as a function of the possible evolutions of the NSMDP. As expected, the performance of DP-snapshot strongly depends on this evolution. It achieves high return for  = 0 and low return for  = 1. Conversely, the performance of RATS varies less across the different values of . The effect illustrated here is that RATS maximizes the minimal possible return given any evolution of the NSMDP. It provides the guarantee to achieve the best return in the worst-case. This behaviour is highly desirable when one requires robust performance guarantees as, for instance, in critical certification processes. Figure 2b displays the return distributions of the three algorithms for  ∈ {0, 0.5, 1}. The effect seen here is the tendency for RATS to diminish the left tail of the distribution corresponding to low returns for each evolution. It corresponds to the optimized criteria, i.e. robustly maximizing the worst-case value. A common risk measure is the Conditional Value at Risk (CVaR) defined as the expected return in the worst q% cases. We illustrate the CVaR at 5% achieved by each algorithm in Table 1b. Notice that RATS always maximizes the CVaR compared to both DP-snapshot and DP-NSMDP. Indeed, even if the latter uses the true model, the optimized criteria in DP is the expected return.
En savoir plus

19 En savoir plus

Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

Figure 1: Comparison of ASOP to OP-MDP, UCT, and FSSS on the inverted pendulum benchmark problem, showing the sum of discounted rewards for simulations of 50 time steps. The algorithms are compared for several budgets. In the cases of ASOP, UCT, and FSSS, the budget is in terms of calls to the simulator. OP-MDP does not use a simulator. Instead, every possible successor state is incorporated into the planning tree, together with its precise probability mass, and each of these states is counted against the budget. As the benchmark problem is stochastic, and internal randomization (for the simulator) is used in all algorithms except OP-MDP, the performance is averaged over 50 repetitions. The algorithm parameters have been selected manually to achieve good performance. For ASOP, we show results for forest sizes of two and three. For UCT, the Chernoff-Hoeffding term multiplier is set to 0.2 (the results are not very sensitive in the value, therefore only one result is shown). For FSSS, we use one to three samples per state-action. For both UCT and FSSS, a rollout depth of seven is used. OP-MDP does not have any parameters. The results are shown in figure 1. We observe that on this problem, ASOP performs much better than OP-MDP for every value of the budget, and also performs well in comparison to the other sampling based methods, UCT and FSSS.
En savoir plus

9 En savoir plus

Efficient Policies for Stationary Possibilistic Markov  Decision Processes

Efficient Policies for Stationary Possibilistic Markov Decision Processes

Fig. 2. Bounded lexicographic value iteration VS Unbounded lexicographic value iteration and the possibility degree of the other one is uniformly fired in L. For each experience, 100 P-MDPs are generated. The two algorithms are compared w.r.t. 2 measures: (i) CPU time and (ii) Pairwise success rate: Success, the percentage of optimal solutions provided by Bounded value iteration with fixed (l, c) w.r.t. the lmax(lmin) criterion in its full generality. The higher Success, the more important the effectiveness of cutting matrices with BL-V I; the lower this rate, the more important the drowning effect.
En savoir plus

12 En savoir plus

Markov Decision Petri Net and Markov Decision Well-Formed Net Formalisms

Markov Decision Petri Net and Markov Decision Well-Formed Net Formalisms

Distributed systems often present symmetries i.e, in our framework, many components may have a similar behavior. Thus, both from a modeling and an analysis point of view, it is interesting to look for a formalism expressing and ex- ploiting behavioral symmetries. So we also define Markov Decision Well-formed nets (MDWN) similarly as we do for MDPNs. The semantics of a model is then easily obtained by translating a MDWN into a MDPN. Furthermore, we develop an alternative approach: we transform the MDWN into a WN, then we build the symbolic reachability graph of this net [5] and finally we transform this graph into a reduced MDP w.r.t. the original one. We argue that we can compute on this reduced MDP, the results that we are looking for in the original MDP. The different relations between the formalisms are shown in the figure depicted below. Finally we have implemented our analysis method within the GreatSPN tool [6] and performed some experiments.
En savoir plus

20 En savoir plus

Large Markov Decision Processes based management strategy of inland waterways in uncertain context

Large Markov Decision Processes based management strategy of inland waterways in uncertain context

1. INTRODUCTION It is now well recognized that human activities have a big impact on climate change. It is mainly due to the emission of greenhouse gas (GHG). The last report of IPCC [1] indicate that anthropogenic GHG emissions “came by 11% from transport” from 2000 to 2010. They recommend technical and behavioral mitigation measures in the transport sector. One solution should be a shift of the truck traffic to the inland waterway network that would provide both economic and environment benefits [2] [3]. These mitigation measures are also advocated by the last historical agreement of the COP21 in Paris. This one aims at limiting the temperature increase to 1.5°C from 2100. By focalizing on inland navigation, it is thus expected an increase of traffic [4], with an estimated growth of 35% [5], and an increase of the frequency and intensity of flood and drought periods in close future. Management of inland waterways must deal with this new challenge.
En savoir plus

12 En savoir plus

On the fastest finite Markov processes

On the fastest finite Markov processes

The plan of the paper is as follows. The above results (A) and (B) are proved in the next section via a dynamic programming approach, which also provides an alternative proof of the discrete time result of Litvak and Ejov [10]. In Section 3, we decompose the generators leaving π invariant into convex sums of generators associated to (not necessarily Hamiltonian) cycles and we differentiate the expectations of hitting times with respect to the generators. This is the basic tool for the proof of (C) (see Theorem 5 in Section 3) in Section 4 , through small perturbations of the uniform prob- ability measure. At the other extreme, large perturbations lead to the proof of (D) (cf. Theorem 6 in Section 3) at the end of the same section. Section 5 contains some observations about the links between continuous time and discrete time. In the appendix, we compute the fastest normalized birth and death generators leaving invariant any fixed positive probability measure π on t0, 1, 2u. The underlying graph is the segment graph of length 2, i.e. the simplest example not containing a Hamiltonian cycle.
En savoir plus

35 En savoir plus

Markov concurrent processes

Markov concurrent processes

The approach we consider in this paper is based on a treatment of concur- rency in a more structural way. A system is distributed over several sites. On each site a local process evolves in the usual way, and is thus rendered as a sequence of random variables taking values in the local state space. The inter- action of sites is taken into account by considering that local state spaces share some common values, so that local processes are forced to synchronize on these values. Local processes are otherwise totally asynchronous. Although local tra- jectories can be seen as sequences of local states, and thus unfolded as infinite paths in a regular tree, the global trajectories are not paths anymore. Each global trajectory is instead correctly represented as a partial order of events, resulting of the gluing of different local trajectories along synchronizing states. This paper presents some results for such models in the case where concur- rency is non trivial, but still kept to its simplest expression, namely, the case of a system with only two sites. It will be demonstrated that this case already con- tains interesting mathematical issues and original features that are completely absent from the traditional theory of probabilistic processes.
En savoir plus

21 En savoir plus

Approximate Policy Iteration for Generalized Semi-Markov Decision Processes: an Improved Algorithm

Approximate Policy Iteration for Generalized Semi-Markov Decision Processes: an Improved Algorithm

To summarize, the improvement of the current policy is performed online: for each visited state starting in s0 we perform one Bellman backup using the value function evaluation from the[r]

15 En savoir plus

Pilot Allocation and Receive Antenna Selection: A Markov Decision Theoretic Approach

Pilot Allocation and Receive Antenna Selection: A Markov Decision Theoretic Approach

The system model in this work consists of a transmitter with a single antenna and a receiver with N antenna elements. The receiver has a single RF chain, so it needs to decide on the antenna with which it should receive data from the transmitter. The transmitter sends data in frames, with each frame having L pilot or training symbols, followed by a data packet. The receiver then has the following trade-off. On the one hand it could allot many pilots out of the available L to one particular antenna, getting an accurate estimate of the channel on that antenna. However, this would mean losing track of possibly better channels on other antennas. Alternatively, fewer pilots can be allotted to each of the antennas, tracking all of their channels. But now the receiver will have poorer quality estimates of the channels on a larger number of antennas, leading to errors in subsequent AS decisions, and packet loss. As the receiver can vary the accuracy with which to estimate the channels at the antennas, and select the one to be used for packet reception, it can control the (partial) observability of the system. These controls must be applied so as to maximize some notion of long-term reward. Hence, the joint problem of pilot allotment and antenna selection at the receiver in each frame is modeled in this work as a Partially Observable Markov Decision Process (POMDP) [8]–[10] with the objective of maximizing the long-term packet success rate. The contributions of this work are as follows.
En savoir plus

7 En savoir plus

Decentralized Control of Partially Observable Markov Decision Processes Using Belief Space Macro-Actions

Decentralized Control of Partially Observable Markov Decision Processes Using Belief Space Macro-Actions

Decentralized Control of Partially Observable Markov Decision Processes using Belief Space Macro-actions Shayegan Omidshafiei, Ali-akbar Agha-mohammadi, Christopher Amato, Jonathan P. How Abstract— The focus of this paper is on solving multi- robot planning problems in continuous spaces with partial observability. Decentralized Partially Observable Markov De- cision Processes (Dec-POMDPs) are general models for multi- robot coordination problems, but representing and solving Dec- POMDPs is often intractable for large problems. To allow for a high-level representation that is natural for multi- robot problems and scalable to large discrete and continu- ous problems, this paper extends the Dec-POMDP model to the Decentralized Partially Observable Semi-Markov Decision Process (Dec-POSMDP). The Dec-POSMDP formulation allows asynchronous decision-making by the robots, which is crucial in multi-robot domains. We also present an algorithm for solving this Dec-POSMDP which is much more scalable than previous methods since it can incorporate closed-loop belief space macro- actions in planning. These macro-actions are automatically constructed to produce robust solutions. The proposed method’s performance is evaluated on a complex multi-robot package delivery problem under uncertainty, showing that our approach can naturally represent multi-robot problems and provide high- quality solutions for large-scale problems.
En savoir plus

9 En savoir plus

Strong Uniform Value in Gambling Houses and Partially Observable Markov Decision Processes

Strong Uniform Value in Gambling Houses and Partially Observable Markov Decision Processes

When the state space and action sets are finite, Blackwell [6] has proved the existence of a pure strategy that is optimal for every discount factor close to 0, and one can deduce that the strong uniform value exists. In many situations, the decision-maker may not be perfectly informed of the current state variable. For instance, if the state variable represents a resource stock (like the amount of oil in an oil field), the quantity left, which represents the state, can be evaluated, but is not exactly known. This motivates the introduction of the more general model of Partially Observable Markov Decision Process (POMDP). In this model, at each stage, the decision-maker does not observe the current state, but instead receives a signal which is correlated to it.
En savoir plus

26 En savoir plus

Pathwise uniform value in gambling houses and Partially Observable Markov Decision Processes

Pathwise uniform value in gambling houses and Partially Observable Markov Decision Processes

In many situations, the decision-maker may not be perfectly informed of the current state variable. For instance, if the state variable represents a resource stock (like the amount of oil in an oil field), the quantity left, which represents the state, can be evaluated, but is not exactly known. This motivates the introduction of the more general model of Partially Observable Markov Decision Process (POMDP). In this model, at each stage, the decision-maker does not observe the current state, but instead receives a signal which is correlated to it. Rosenberg, Solan and Vieille [18] have proved that any POMDP has a uniform value in behavior strategies, when the state space, the action set and the signal set are finite. In the proof, the authors highlight the necessity that the decision-maker resort to behavior strategies, and ask whether the uniform value exists in pure strategies. They also raise the question of the behavior of the time averages of the payoffs, which is linked to the AP criterion. Renault [15] and Renault and Venel [16] have provided two alternative proofs of the existence of the uniform value in behavior strategies in POMDPs, and also ask whether the uniform value exists in pure strategies.
En savoir plus

25 En savoir plus

Show all 10000 documents...

Sujets connexes