L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignemen[r]

Radar EO/IR ADS-B
Such a system would take as input models of **the** flight dynamics, intruder behavior, and sensor char- acteristics and attempt to optimize **the** avoidance strategy so that a predefined cost function is minimized. **The** cost function could take into account competing objectives, such as flight plan adherence and avoiding collision. One way to formulate a problem involving **the** optimal control of a stochastic system is as a **Markov** **Decision** Process (MDP), or more generally as a Partially Observable **Markov** **Decision** Process (POMDP) to also account for observation uncertainty. POMDPs have been studied **in** **the** operations research and artificial intelligence communities, but only **in** **the** past few years have generic POMDP solution methods been developed that can approximately solve problems with moderate to large state spaces **in** reasonable time. **In** this work we investigate **the** feasibility of applying state-of-**the**-art MDP and POMDP solution methods to **the** problem of collision avoidance.

En savoir plus
4. MAIN RESULTS
Below we identify a linear program that allows to com- pute **the** optimal value and an optimal stationary policy for CMDP. A similar result was already available **in** [1] but re- quired **the** strong assumption that s(β, u) is finite for any u. This excludes **the** shortest path problem **in** which policies that include cycles may have infinite cost. **In** order to han- dle such situations we note that one may assume throughout that c and d are uniformly bounded below by some positive constant c. **The** following Lemma shows that although we do not assume that all policies have finite occupation measures, those having infinite occupation measures are not optimal. It further shows that one may restrict **the** search of solutions to CMDP to stationary policies.

En savoir plus
There are two important related lines of work, focusing ei- ther on factoring or on decomposition of large MDPs. Im- provements on factored MDP solvers continue to treat **the** MDP as a single problem, but find more compact [Sanner and McAllester, 2005; Sanner et al., 2010] or approximate [St- Aubin et al., 2000] representations. Recent work on topolog- ical and focused topological value iteration [Dai and Gold- smith, 2007; Dai et al., 2009] is similar to ours **in** that it de- composes a large MDP based on **the** connectivity of **the** states. However, TVI and FTVI cannot exploit a factored representa- tion and, **in** a well-connected domain, they are not guaranteed to find any hierarchy at all. FTVI has been run successfully on some smaller factored problems, but requires knowledge of an initial state to reduce **the** state space size. Moreover, **the** size of **the** value function output by TVI or FTVI is nec- essarily **the** size of **the** optimal value function and, therefore, these cannot be run **in** domains where **the** representation of **the** optimal value function is prohibitively large. By relaxing **the** requirement that **the** algorithm output **the** optimal policy, we can find a good approximate policy even **in** some large problems where **the** optimal value function is not compactly representable. Our work is also similar to MAXQ [Dietterich, 1998; Mehta et al., 2008] **in** that it uses a temporal hierarchy to reduce **the** size of **the** state space. However, unlike **the** orig- inal MAXQ algorithm, we are able to construct **the** hierarchy automatically. Although **the** work of Mehta et al. proposes a method for automatically creating MAXQ hierarchies, **the** hierarchy creation is costly and only worthwhile if it transfers to many domains.

En savoir plus
When **the** state space and action sets are finite, Blackwell [6] has proved **the** existence of a pure strategy that is optimal for every discount factor close to 0, and one can deduce that **the** strong uniform value exists.
**In** many situations, **the** **decision**-maker may not be perfectly informed of **the** current state variable. For instance, if **the** state variable represents a resource stock (like **the** amount of oil **in** an oil field), **the** quantity left, which represents **the** state, can be evaluated, but is not exactly known. This motivates **the** introduction of **the** more general model of Partially Observable **Markov** **Decision** Process (POMDP). **In** this model, at each stage, **the** **decision**-maker does not observe **the** current state, but instead receives a signal which is correlated to it.

En savoir plus
1.1.3 Reinforcement Learning Algorithms
**In** **the** last part of this thesis (Chapters 10 and 11) we consider approximation algorithms for finite space POMDPs and MDPs under **the** reinforcement learning setting. This setting differs from **the** previous problem setting – often referred as **the** planning setting – **in** that a model of **the** problem is no longer required, and neither is an exact inference mechanism. **The** motivation as well as inspiration behind these methods comes from animal learning and artificial intelligence, and is to build an autonomous agent capable of learning good policies through trial and error while it is operating **in** **the** environment (thus **the** name “reinforce”). **The** reinforcement learning setting does require, however, that **the** per-stage costs or rewards are physically present **in** **the** environment, which is not true for many planning problems. So **the** reinforcement learning and **the** planning settings are complementary to each other. Via simulations of **the** model, reinforcement learning algorithms can be applied **in** **the** planning setting to obtain approximate solutions. For problems whose models are too large to handle exactly and explicitly, such methods are especially useful.

En savoir plus
169 En savoir plus

always moves to **the** right since MDP 0 does not capture this risk. As a result, **the** = 0 case reflects a
favorable evolution for DP-snapshot and a bad one for RATS. **The** opposite occurs with = 1 where **the** cautious behavior dominates over **the** risky one, and **the** **in**-between cases mitigate this effect. **In** Figure 2a, we display **the** achieved expected return for each algorithm as a function of , i.e. as a function of **the** possible evolutions of **the** NSMDP. As expected, **the** performance of DP-snapshot strongly depends on this evolution. It achieves high return for = 0 and low return for = 1. Conversely, **the** performance of RATS varies less across **the** different values of . **The** effect illustrated here is that RATS maximizes **the** minimal possible return given any evolution of **the** NSMDP. It provides **the** guarantee to achieve **the** best return **in** **the** worst-case. This behaviour is highly desirable when one requires robust performance guarantees as, for instance, **in** critical certification **processes**. Figure 2b displays **the** return distributions of **the** three algorithms for ∈ {0, 0.5, 1}. **The** effect seen here is **the** tendency for RATS to diminish **the** left tail of **the** distribution corresponding to low returns for each evolution. It corresponds to **the** optimized criteria, i.e. robustly maximizing **the** worst-case value. A common risk measure is **the** Conditional Value at Risk (CVaR) defined as **the** expected return **in** **the** worst q% cases. We illustrate **the** CVaR at 5% achieved by each algorithm **in** Table 1b. Notice that RATS always maximizes **the** CVaR compared to both DP-snapshot and DP-NSMDP. Indeed, even if **the** latter uses **the** true model, **the** optimized criteria **in** DP is **the** expected return.

En savoir plus
Figure 1: Comparison of ASOP to OP-MDP, UCT, and FSSS on **the** inverted pendulum benchmark problem, showing **the** sum of discounted rewards for simulations of 50 time steps.
**The** algorithms are compared for several budgets. **In** **the** cases of ASOP, UCT, and FSSS, **the** budget is **in** terms of calls to **the** simulator. OP-MDP does not use a simulator. Instead, every possible successor state is incorporated into **the** planning tree, together with its precise probability mass, and each of these states is counted against **the** budget. As **the** benchmark problem is stochastic, and internal randomization (for **the** simulator) is used **in** all algorithms except OP-MDP, **the** performance is averaged over 50 repetitions. **The** algorithm parameters have been selected manually to achieve good performance. For ASOP, we show results for forest sizes of two and three. For UCT, **the** Chernoff-Hoeffding term multiplier is set to 0.2 (**the** results are not very sensitive **in** **the** value, therefore only one result is shown). For FSSS, we use one to three samples per state-action. For both UCT and FSSS, a rollout depth of seven is used. OP-MDP does not have any parameters. **The** results are shown **in** figure 1. We observe that on this problem, ASOP performs much better than OP-MDP for every value of **the** budget, and also performs well **in** comparison to **the** other sampling based methods, UCT and FSSS.

En savoir plus
Fig. 2. Bounded lexicographic value iteration VS Unbounded lexicographic value iteration
and **the** possibility degree of **the** other one is uniformly fired **in** L. For each experience, 100 P-MDPs are generated. **The** two algorithms are compared w.r.t. 2 measures: (i) CPU time and (ii) Pairwise success rate: Success, **the** percentage of optimal solutions provided by Bounded value iteration with fixed (l, c) w.r.t. **the** lmax(lmin) criterion **in** its full generality. **The** higher Success, **the** more important **the** effectiveness of cutting matrices with BL-V I; **the** lower this rate, **the** more important **the** drowning effect.

En savoir plus
Distributed systems often present symmetries i.e, **in** our framework, many components may have a similar behavior. Thus, both from a modeling and an analysis point of view, it is interesting to look for a formalism expressing and ex- ploiting behavioral symmetries. So we also define **Markov** **Decision** Well-formed nets (MDWN) similarly as we do for MDPNs. **The** semantics of a model is then easily obtained by translating a MDWN into a MDPN. Furthermore, we develop an alternative **approach**: we transform **the** MDWN into a WN, then we build **the** symbolic reachability graph of this net [5] and finally we transform this graph into a reduced MDP w.r.t. **the** original one. We argue that we can compute on this reduced MDP, **the** results that we are looking for **in** **the** original MDP. **The** different relations between **the** formalisms are shown **in** **the** figure depicted below. Finally we have implemented our analysis method within **the** GreatSPN tool [6] and performed some experiments.

En savoir plus
1. INTRODUCTION
It is now well recognized that human activities have a big impact on climate change. It is mainly due to **the** emission of greenhouse gas (GHG). **The** last report of IPCC [1] indicate that anthropogenic GHG emissions “came by 11% from transport” from 2000 to 2010. They recommend technical and behavioral mitigation measures **in** **the** transport sector. One solution should be a shift of **the** truck traffic to **the** inland waterway network that would provide both economic and environment benefits [2] [3]. These mitigation measures are also advocated by **the** last historical agreement of **the** COP21 **in** Paris. This one aims at limiting **the** temperature increase to 1.5°C from 2100. By focalizing on inland navigation, it is thus expected an increase of traffic [4], with an estimated growth of 35% [5], and an increase of **the** frequency and intensity of flood and drought periods **in** close future. Management of inland waterways must deal with this new challenge.

En savoir plus
To summarize, the improvement of the current policy is performed online: for each visited state starting in s0 we perform one Bellman backup using the value function evaluation from the[r]

Decentralized Control of Partially Observable **Markov** **Decision** **Processes** using Belief Space Macro-actions
Shayegan Omidshafiei, Ali-akbar Agha-mohammadi, Christopher Amato, Jonathan P. How
Abstract— **The** focus of this paper is on solving multi- robot planning problems **in** continuous spaces with partial observability. Decentralized Partially Observable **Markov** De- cision **Processes** (Dec-POMDPs) are general models for multi- robot coordination problems, but representing and solving Dec- POMDPs is often intractable for large problems. To allow for a high-level representation that is natural for multi- robot problems and scalable to large discrete and continu- ous problems, this paper extends **the** Dec-POMDP model to **the** Decentralized Partially Observable Semi-**Markov** **Decision** Process (Dec-POSMDP). **The** Dec-POSMDP formulation allows asynchronous **decision**-making by **the** robots, which is crucial **in** multi-robot domains. We also present an algorithm for solving this Dec-POSMDP which is much more scalable than previous methods since it can incorporate closed-loop belief space macro- actions **in** planning. These macro-actions are automatically constructed to produce robust solutions. **The** proposed method’s performance is evaluated on a complex multi-robot package delivery problem under uncertainty, showing that our **approach** can naturally represent multi-robot problems and provide high- quality solutions for large-scale problems.

En savoir plus
When **the** state space and action sets are finite, Blackwell [6] has proved **the** existence of a pure strategy that is optimal for every discount factor close to 0, and one can deduce that **the** strong uniform value exists.
**In** many situations, **the** **decision**-maker may not be perfectly informed of **the** current state variable. For instance, if **the** state variable represents a resource stock (like **the** amount of oil **in** an oil field), **the** quantity left, which represents **the** state, can be evaluated, but is not exactly known. This motivates **the** introduction of **the** more general model of Partially Observable **Markov** **Decision** Process (POMDP). **In** this model, at each stage, **the** **decision**-maker does not observe **the** current state, but instead receives a signal which is correlated to it.

En savoir plus