Decentralized control of multi-robot systems using partially observable Markov Decision Processes and belief space macro-actions

(1)

Decentralized Control of Multi-Robot Systems

using Partially Observable Markov Decision

Processes and Belief Space Macro-actions

by

Shayegan Omidshafiei

B.A.Sc. in Engineering Science (Major in Aerospace Engineering)

University of Toronto (2012)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Master of Science in Aeronautics and Astronautics

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2015

c

Massachusetts Institute of Technology 2015. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

August 20, 2015

Certified by . . . .

Jonathan P. How

Richard C. Maclaurin Professor of Aeronautics and Astronautics

Thesis Supervisor

Accepted by . . . .

Paulo C. Lozano

Associate Professor of Aeronautics and Astronautics

Chair, Graduate Program Committee

(2)

(3)

Decentralized Control of Multi-Robot Systems using

Partially Observable Markov Decision Processes and Belief

Space Macro-actions

by

Shayegan Omidshafiei

Submitted to the Department of Aeronautics and Astronautics on August 20, 2015, in partial fulfillment of the

requirements for the degree of

Master of Science in Aeronautics and Astronautics

Abstract

Planning, control, perception, and learning for multi-robot systems present significant challenges. Transition dynamics of the robots may be stochastic, making it difficult to select the best action each robot should take at a given time. The observation model, a function of the robots’ sensors, may be noisy or partial, meaning that deterministic knowledge of the team’s state is often impossible to attain. Robots designed for real-world applications require careful consideration of such sources of uncertainty.

This thesis contributes a framework for multi-robot planning in continuous spaces with partial observability. Decentralized Partially Observable Markov Decision Pro-cesses (Dec-POMDPs) are general models for multi-robot coordination problems. However, representing and solving Dec-POMDPs is often intractable for large prob-lems. This thesis extends the Dec-POMDP framework to the Decentralized Partially Observable Semi-Markov Decision Process (Dec-POSMDP), taking advantage of high-level representations that are natural for multi-robot problems. Dec-POSMDPs allow asynchronous decision-making, which is crucial in multi-robot domains. This thesis also presents algorithms for solving Dec-POSMDPs, which are more scalable than previous methods due to use of closed-loop macro-actions in planning. The proposed framework’s performance is evaluated in a constrained multi-robot package delivery domain, showing its ability to provide high-quality solutions for large problems.

Due to the probabilistic nature of state transitions and observations, robots op-erate in belief space, the space of probability distributions over all of their possible states. This thesis also contributes a hardware platform called Measurable Augmented Reality for Prototyping Cyber-Physical Systems (MAR-CPS). MAR-CPS allows real-time visualization of the belief space in laboratory settings.

Thesis Supervisor: Jonathan P. How

(4)

(5)

Acknowledgments

This thesis would not have been possible without invaluable contributions from nu-merous people. First, I would like to thank my advisor Professor Jonathan How, who has been my mentor and guide to the world of academic research. Jon, it is a privilege to work with you. Thank you for taking the time so often to provide valuable feedback and insight into my research. It truly means a lot and I look forward to many more years of discussions and collaboration with you.

I owe a great deal of gratitude to my frequent co-authors and friends, Ali and Chris. You have each taught me so much, and have been extremely generous with both your time and expertise. As I take the next step in my academic journey, I look forward to the many more interesting research problems we aim to solve together.

I am lucky to have such a strong and collaborative team of peers at Aerospace Controls Lab, all of whom have been good friends to me over the last two years. Specifically, I would like to thank Mark for his patience and for taking the time to help me and others when we’ve faced implementation or other technical issues in the lab. Shih-Yuan, Steven, Kemal, Brett, Luke, Justin, Jack, and Ali, thank you for your hard work during the various hardware/lab demos we’ve done together over the years – though they may have been stressful at the time, looking back they have formed some of my fondest memories of ACL. Luke, Trevor, Beipeng, it was awesome working with you on the IAP challenge. Your feedback is invaluable and office conversations highly enjoyable. Special thanks to Aaron and Duncan for being great friends and study-mates through all my semesters at MIT, I am excited to hear about your inevitable successes in industry over the coming years.

A special thanks to Shih-Yuan and Brett for helping print and submit this thesis while I was off-campus. No one would be reading this had it not been for you guys!

I would like to thank my parents, Saied and Sara, for encouraging my passion for knowledge since I was young, even going through an immigration to provide me and my sister with a better life. On that note, a special thanks to my sister, Goli, for the cheerful banter and highly interesting conversations she’s provided me over the years.

(6)

You are both hilarious and intelligent, and I am excited for all the great things you have coming in life.

Finally, I would like to give a very special thanks to my wonderful girlfriend Leanne, for her love and support and constant encouragement. Thank you for always being there for me, for being someone I look forward to seeing each day after work, and for being the one who made me feel at home in Boston.

This work was funded by ONR MURI Grant N000141110688, The Boeing Com-pany, and a Natural Sciences and Engineering Research Council of Canada PGS-M grant.

(7)

All life is problem solving.

(8)

(9)

List of Figures

3-1 Hierarchy of the proposed planner. . . 38

3-2 Decision epochs tk. . . 46

3-3 Dec-POSMDP sequential decision-making overview. . . 49

4-1 Example FSA for a single robot. . . 56

4-2 Finite-state automata for two robots. . . 56

5-1 Components of Measurable Augmented Reality for Prototyping Cyber-Physical Systems. . . 71

5-2 Architecture overview for MAR-CPS. . . 72

5-3 Hardware overview for MAR-CPS. . . 74

5-4 Usage of physical props in MAR-CPS. . . 75

5-5 Multi-projection system coordinate transform calibration. . . 76

5-6 Closed-loop vision feedback in MAR-CPS. . . 80

5-7 Vehicle health monitoring messages in MAR-CPS. . . 82

5-8 Human-robot interaction in MAR-CPS. . . 82

5-9 Quadrotor fire fighting demonstration in MAR-CPS [1]. . . 84

5-10 Real-time vision-based planning under uncertainty in MAR-CPS. . . 85

5-11 Noisy hue-saturation-value image dataset captured in MAR-CPS for time closed-loop perception planning domain. . . 86

5-12 Multi-robot intruder monitoring mission in MAR-CPS. . . 88

5-13 Visualization of trajectory-planning in multi-robot systems. . . 91

5-14 Demonstration of motion planning under uncertainty in MAR-CPS [2]. 93 5-15 Multi-robot path planning implemented in MAR-CPS. . . 94

5-16 Human-robot interactivity using motion capture props in MAR-CPS. 94 5-17 Visualization of communication networks in multi-robot systems. . . . 95

5-18 Task allocation using the Hybrid Information and Plan Consensus al-gorithm visualized in MAR-CPS [3]. . . 96

(14)

6-1 Constrained multi-robot package delivery domain. . . 100

6-2 TMA policies demonstrated in lab space. . . 105

6-3 A graph of LMAs over 30 nodes (red). Connecting edges are indicated in blue, and obstacles in brown. . . 106

6-4 TMA policy for varying goal nodes, starting at any given node. . . . 106

6-5 Partial view of a single robot’s policy for the package delivery domain. 107 6-6 Policy differences between MC and MMCS. Moving averages of policy values (over 100 samples) are also indicated. . . 108

6-7 G-DICE maximum policy value vs. learning rate α . . . 110

6-8 G-DICE maximum policy value vs. controller size Nn . . . 110

6-9 Success probability of delivering the specified number of packages or more within a fixed time horizon. . . 111

6-10 Overview of constrained package delivery hardware domain. . . 116

6-11 Visualization of virtual package deliveries in MAR-CPS. . . 117

6-12 Health belief visualization in MAR-CPS using a “belief ring”. . . 118

6-13 Visualization of quadrotor package pickup in MAR-CPS. . . 119

(15)

List of Tables

4.1 CheckDominated linear program for FSA parameter generation. . . . 61 6.1 Comparison of the maximum values obtained by the search algorithms. 109 6.2 Health observation model, denoting probabilities of receiving a certain

(16)

(17)

List of Algorithms

1 TMA Construction (Offline) . . . 41

2 Policy iteration . . . 60

3 Pruning . . . 61

4 Exhaustive Backup . . . 62

5 MMCS . . . 63

(18)

(19)

Chapter 1 Introduction

1.1 Overview

Given the low cost of hardware, it is becoming more cost-effective to deploy multiple robots to complete a single or set of tasks. However, control and coordination of robot systems in real-world settings is difficult. For instance, real-world multi-robot coordination problems are in continuous spaces and multi-robots often possess partial and noisy sensory measurements. The transition dynamics of the robots may be un-known or stochastic, making it difficult to select the best action each robot should take at a given time. The observation model, a function of the robots’ sensor systems, may be noisy or partial, meaning that deterministic knowledge of the team’s state is often impossible to attain. Moreover, the actions each robot can take may have an associated success rate and a probabilistic completion time. Regardless of the control scheme, planning, or learning algorithms used for specific problems, robots designed for real-world applications require careful consideration of such sources of uncertainty. In addition, asynchronous decision-making is often needed due to stochastic action effects and the lack of perfect communication. Preferably, high-quality controllers for each robot would be automatically generated based on a high-level domain specifi-cation while considering uncertainty in the domain. Methods following this principle (such as those based on Markov decision processes [4] and partially observable Markov

(20)

decision processes [5]) have proven to be effective in single-robot domains, but pose significant scalability challenges when applied to multi-robot applications [6].

A general representation of the multi-robot coordination problem is the Decentral-ized Partially Observable Markov Decision Process (Dec-POMDP) [6]. Dec-POMDPs have a broad set of applications including networking problems, multi-robot explo-ration, and surveillance [7–11]. Unfortunately, current Dec-POMDP solution methods are limited to small discrete domains and require synchronized decision-making.

This thesis extends recent work on incorporating macro-actions (temporally ex-tended actions) [12–14] to solve continuous, large-scale problems that were infeasible for previous methods. A framework for formally representing asynchronous multi-robot coordination problems is developed, and a method for automatic generation of local planners within the framework is presented. While these local planners can be a set of hand-coded macro-actions, this thesis also presents an algorithm for automat-ically generating macro-actions that can then be sequenced to solve the multi-agent planning under uncertainty problem. The result is a principled method for solving large-scale coordination problems in probabilistic, multi-agent domains.

Macro-actions (MAs) have provided increased scalability in single-agent MDPs [15] and POMDPs [16–18], but are nontrivial to extend to multi-robot settings. Some of the challenges in extending MAs to decentralized settings are:

• In the decentralized setting, synchronized decision-making is not possible or problematic with these variable time macro-actions as some robots must remain idle while others finish their actions. The resulting solution would be difficult to implement and would yield poor quality results. Macro-actions that can be chosen asynchronously by the robots are proposed herein to overcome these limitations (an issue that has not been considered in the single-robot literature). • Incorporating principled asynchronous MA selection is a challenge, because it is not clear how to choose optimal MAs for one robot while other robots are still executing. Hence, a novel formal framework is needed to represent Dec-POMDPs with asynchronous decision-making and sets of MAs that may have

(21)

differing completion times.

• Designing these variable-time MAs also requires characterizing the stopping time and probability of terminating at every goal state of the MAs. Novel methods are needed that can provide this characterization.

MA-based Dec-POMDPs address these challenges by no longer attempting to solve for a policy at the primitive action level, but instead considering temporally-extended (macro) actions. The use of MAs also addresses scalability issues, as the size of the action space is considerably reduced.

In partially-observable planning under uncertainty domains, whether MAs are used or not, robots do not have a deterministic notion of their state. Thus, solu-tion methods are based in belief space, the space of probability distribusolu-tions over all of the robots’ possible states. Until now, the belief space has been an intangi-ble concept in hardware laboratory demonstrations of probabilistic planners such as POMDPs and Dec-POMDPs. However, the ability to conduct hardware experiments in an augmented laboratory supporting belief-space visualization would be extremely useful. In addition to a MA-based planning under uncertainty framework, this thesis introduces Measurable Augmented Reality for Prototyping Cyber-Physical Systems (MAR-CPS), a robotics prototyping platform which directly addresses the belief space visualization problem. MAR-CPS allows real-time visualization of belief information to aid hardware prototyping and performance testing of solution algorithms for prob-abilistic domains.

1.2 Literature Review

Decision-Making in Related Fields The focus of this thesis is on the multi-agent asynchronous planning under uncertainty problem. Although a decision-theoretic planning approach is used in this thesis, prior works have also considered this problem from a control theoretic, task allocation, and game theoretic perspective.

(22)

developed a principled mathematical framework targeted at human team organization problems. This work on the theory of teams was expanded in [20], which considered existence and uniqueness of optimal team solutions. This work was expanded in the scope of control theory, where a controller is considered a continuous decision-maker which minimizes some objective or loss function (e.g., time cost or distance from a nominal trajectory). Specifically, it has been shown that use of a linear quadratic loss function for teams operating under Gaussian observation noise guarantees an optimal, linear control law [21], and extensions have been made to systems with non-Gaussian noise [22].

Decision-making in a team setting can be performed in a centralized or a decen-tralized manner. Cendecen-tralized planners have been utilized in a variety of multi-agent domains [23–27], but rely on full sharing of all information through an assumed low-cost, low-latency, and high-reliance network. These frameworks, however, are difficult to make robust against communication infrastructure failures. Distributed planning algorithms [28–31], where local communication is used for consensus on agent policies, can be considered a middle-ground between centralized and decentralized decision-making and do not assume perfect communication with a central planning agent. However, if perfect communication between agents is assumed and consensus on joint agent state is desired, such algorithms can suffer from long convergence times which can prevent them from being implemented in real-time planning scenarios. In such scenarios, auction-based algorithms [32] are promising and more robust to commu-nication failures, but result in sub-optimal solutions. Extensions of auctioning algo-rithms allow asynchronous task allocation [33], which is better suited for real-time decision-making and removes reliance on the artificial delays needed to synchronize decision-making in teams.

Markovian Decision-Theoretic Frameworks Single-agent decision-making un-der uncertainty using a Markovian framework was explored in [34], which offered one of the first thorough characterizations of MDPs. These processes assume the dynamics of the system as well as its state are known. For infinite-horizon problems where agent rewards are discounted, existence of an optimal policy is guaranteed. Solution

(23)

meth-ods relying on dynamic programming [35] can be used to solve MDPs. The POMDP is a framework allowing single-agent decision-making in domains where agents receive partial observations of their states [36]. The optimal value function for finite-horizon POMDPs was shown to be piecewise-linear and convex in [37], which also provided an algorithm for finding the optimal policy. The discounted infinite-horizon POMDP was later shown to be solvable using policy iteration in [38]. Various exact and approxi-mate POMDP algorithms are outlined in [39]. More recent algorithms can be found in [16, 40–47]. The state-of-the-art POMDP solvers are the point-based PBVI [40], SARSOP [46], HSVI2 [45], and GapMin [47], which rely on probabilistic sampling to solve high-dimensional POMDPs (with 10,000 to 100,000 states). The belief-space FIRM framework [16] is highly promising for tractable POMDP solving due to its ability to funnel a robot’s belief to a set of reachable nodes in belief-space, averting the curse of history.

Multi-agent Markov Decision Processes (MMDPs) were introduced in [48], and rely on a centralized planner for a team of agents. As discussed earlier, centralization is a strong assumption for multi-agent teams operating in real-world domains. De-centralized POMDPs (Dec-POMDPs) were introduced in [6], where they were shown to be NEXP-complete. Dec-POMDPs were shown to have a close relationship to game theoretic decision-making frameworks, and can be represented as games with imperfect information [49]. Though game theory approaches have typically focused on analysis of equilibrium conditions for multi-agent domains, work on extending game theory for control of multi-agent teams has also been conducted [50].

Dec-POMDP Solution Methods Due to the decentralized nature of Dec-POMDPs, POMDP algorithms cannot be easily extended to solve them. In addi-tion, finding the optimal Dec-POMDP solution is NEXP-complete, and the infinite-horizon problem is undecidable [6]. Numerous algorithms have emerged for obtain-ing exact solutions for finite-horizon Dec-POMDPs or approximate solutions for the infinite-horizon variant [51–65]. However, due to the complexity of Dec-POMDPs and high-memory requirements for very large domains [6, 54, 62], solution methods are limited to small problems. For short, finite-horizon problems, point-based methods

(24)

[40, 41, 44, 45, 66] perform well, similar to their POMDP counterparts. Policy rep-resentation as finite-state controllers (either Moore or Mealy machines) is especially promising for the infinite-horizon case due to their periodic nature [67–69]. In [68], policy search is formulated as a non-linear program, which can then be solved ap-proximately for a locally optimal policy. Similarly, the Dec-POMDP can be cast and solved as a mixed integer linear program [50]. Recent research efforts have also fo-cused on addressing the scalability problem of Dec-POMDPs by compressing policies and agent observation histories [59, 59, 61, 63].

Dec-POMDPs remain computationally difficult to solve as they suffer from the curse of dimensionality and the curse of history. Recent effort has been placed on re-casting Dec-POMDPs into higher-level planning frameworks which use macro-actions (rather than primitive actions). Macro-actions, or options, are temporally-extended actions which have been successful in aiding representation and solutions in single robot MDP and POMDP domains [15, 18, 70]. Recent efforts have focused on inte-grating actions into Dec-POMDPs [12], although design and selection of macro-actions relies on a human domain-expert. The principled integration of automatically-generated macro-actions into the belief-space Dec-POMDP framework has remained a research gap, and is one of the core contributions of this thesis.

Belief Space Visualization in Hardware Experiments Due to the belief-space nature of Dec-POMDPs, conducting related hardware experiments in an augmented laboratory supporting belief-space visualization would be extremely useful. Various prototyping environments for CPS have been developed in the past [71–77], but the addition of augmented visualization capabilities to indoor platforms is of ongoing interest. For example, the display of dynamically changing events using projectors for hardware experiments involving quadrotors has been investigated [78]. Specifi-cally, reward and damage information for quadrotors involved in an aerial surveillance mission was displayed in real time, although simulation of complex mission scenar-ios, measurement of the augmented environment using onboard sensors, and display of state transition and observation probability distributions were not demonstrated. Also, [76] utilized small blocks as surrogates of building components enabling the

(25)

construction of small-scale structures.

An investigation of augmented reality for multi-robot mission scenarios has also been conducted [79], including applications in pedestrian perception and tracking of swarm robotics. However, their applications are limited to display of this information in software only, and integration of the data into a physical laboratory space has not been conducted yet.

Onboard projection systems have also been investigated for human-robot inter-action situations [80], with applications in robot training demonstrated, though the projection footprints are limited. Due to increasing affordability of virtual reality headsets, such as the Oculus VR [81] and SteamVR [82], their usage in a CPS-prototyping setting may be possible. Usage of virtual reality head-mounted displays to superimpose mission data over a live camera feed has previously been investigated [83], with applications to intruder monitoring in swarm robotics. However, for demon-strations involving many spectators, virtual reality headsets are infeasible due to the large amount of hardware and supporting infrastructure required. Additionally, infor-mation displayed in a virtual reality headset is not measurable using onboard sensors on a vehicle, whereas projected images are physically present in a lab and can be directly measured.

MAR-CPS leverages the emergence of motion capture technology as well as multi-projection systems to change the state-of-the-art for CPS prototyping. MAR-CPS not only allows display of belief space information, but also enables hardware-level inter-action of vehicles’ sensor systems with augmented, customizable mission domains of arbitrary complexity with little hardware overhead. Previous work on virtual and augmented reality interfaces is additionally extended by demonstrating that measure-ment of projected environmeasure-ments using noisy hardware sensors is a useful validation tool in situations where outdoor testing is infeasible.

(26)

1.3 Thesis Contributions and Structure

In this thesis, the Dec-POMDP is extended to the Decentralized Partially Observable Semi-Markov Decision Process (Dec-POSMDP), which formalizes the use of closed-loop MAs and allows asynchronous decision-making. The Dec-POSMDP represents the theoretical basis for asynchronous decision-making in Dec-POMDPs. Automatic MA-generation using a graph-based planning technique is also presented. The re-sulting MAs are closed-loop and the completion time and success probability can be characterized analytically, allowing them to be directly integrated into the Dec-POSMDP framework. This framework can, therefore, generate efficient decentralized plans which take advantage of estimated completion times to permit asynchronous decision-making.

The proposed Dec-POSMDP framework enables solutions for large domains (in terms of state/action/observation space) with long horizons, which are otherwise computationally intractable to solve. The Dec-POSMDP framework is leveraged and a greedy Monte Carlo-based discrete search algorithm and an entropy-based probability distribution optimization algorithm are introduced for solving it. The performance of the methods are demonstrated for a constrained variant of the multi-robot package delivery under uncertainty problem (Fig. 6-1).

The second major contribution of the thesis is MAR-CPS, an indoor, experimen-tal architecture developed to enable visualization of the belief space for planning under uncertainty domains. MAR-CPS also allows controlled testing of planning and learning algorithms in an indoor setting which closely emulates outdoor conditions. The presented architecture leverages motion capture technology with edge-blended multi-projection displays to significantly improve state-of-the-art indoor testing fa-cilities by augmenting them with interactive, dynamic, partially unknown simulated environments. Several demonstrations of this capability are presented in MAR-CPS, with a focus on planning, perception, and learning algorithms for autonomous single-robot and multi-single-robot systems which actively sense and interact with the augmented laboratory space.

(27)

The structure of this thesis is as follows:

• Chapter 2 presents preliminaries for solving planning under uncertainty prob-lems in a Markovian setting. The Markov Decision Process is presented, as well as extensions which allow partial observability and decentralization.

• Chapter 3 motivates the need for macro-actions in decentralized decision-making, and introduces them in a principled manner. Automatic generation of macro-actions is described, and several derivations and definitions are pre-sented to allow for utilization of macro-actions in asynchronous decision-making. Finally, the Dec-POSMDP framework is introduced.

• Chapter 4 focuses on solution methods for Dec-POSMDPS. First, policy repre-sentation in the form of finite state automata is introduced, and an algorithm for solving policies through evaluation and pruning is presented. Next, the heuristics-based Masked Monte Carlo Search (MMCS) algorithm is presented for fast approximate solving of the Dec-POSMDP. Finally, the Graph-based Di-rect Cross-Entropy algorithm (G-DICE), a method with probabilistic optimality guarantees, is presented.

• Chapter 5 considers the problem of visualizing the belief space while conducting hardware experiments, and presents a novel hardware prototyping architecture called Measurable Augmented Reality for Prototyping Cyber-Physical Systems (MAR-CPS). Technical details regarding the system are presented, as well as hardware experiments for several learning and planning domains.

• Chapter 6 presents both simulated and hardware experiment results for multi-robot planning using Dec-POSMDPs. The constrained package delivery domain is introduced, where a heterogeneous team of robots are tasked with delivering packages from bases to delivery destinations. In the simulated experiments, macro-actions allowing joint robot actions (e.g., two robots picking up a single package) are also considered. In the hardware experiments, the robots consider a belief over their health state (e.g., fuel or damage), and a macro-action allowing

(28)

robot repair is included. The experiments include a comparison of policy quality resulting from the algorithms introduced in Chapter 4.

• Chapter 7 concludes the thesis and presents a summary of future work.

The content of this thesis is based on work that has been published in the following conferences:

• “Decentralized control of Partially Observable Markov Decision Processes using Belief Space Macro-actions”, Shayegan Omidshafiei, Ali-akbar Agha-mohammadi, Christopher Amato, Jonathan P. How, in the 2015 IEEE Interna-tional Conference on Robotics and Automation (ICRA), 2015 [84]

• “MAR-CPS: Measurable Augmented Reality for Prototyping Cyber-Physical Systems”, Shayegan Omidshafiei, Ali-akbar Agha-mohammadi, Yu Fan Chen, N. Kemal Ure, Rajeev Surati, John Vian, Jonathan P. How, in AIAA Infotech @ Aerospace, 2015 [85]

(29)

Chapter 2 Background

2.1 Overview

This chapter provides a review of Markovian planning frameworks and belief space planning. Starting with the single-robot fully-observable case, the problem is incre-mentally generalized to the multi-robot decentralized partially-observable domain.

2.2 Markov Decision Processes

In the sequential planning under uncertainty problem, a robot takes a series of actions (with stochastic outcome) in order to achieve a set of tasks. Taking an action in a given state results in a reward (or cost) for the robot. The robot’s goal is then to maximize the cumulative reward obtained over a series of actions. This problem can be formulated as a Markov Decision Process (MDP), which is parameterized as follows:

• S is a finite set of states for the robot. The state s ∈ S is observable. • U is a finite set of actions the robot can take.

• T : S × U → Π(S) is the state transition function, where Π(S) is a probability distribution over states. If the robot is in state s ∈ S and takes action u ∈ U,

(30)

the probability of it transitioning to new state s0 ∈ S0 _{is P (s}0_{|s, u) = T (s, u, s}0_).

Thus, T is a function which produces a probability distribution over all next-timestep states, allowing capture of model uncertainty.

• R : S × U → R is a reward function, where if a robot in state s ∈ S takes action u ∈ U, it receives award R(s, u). For instance, for a path planning problem, the MDP can be defined such that the robot receives positive reward at the goal state and a slight negative cost at all other states.

• γ ∈ [0, 1] is a discount factor on rewards, used to prioritize actions which yield earlier rewards.

2.2.1 Optimal Policy

If the robot visits a sequence of states s1, s2, · · · and takes actions u1, u2, · · · , its payoff

is defined:

R(s0, u0) + γR(s1, u1) + γ2R(s2, u2) + · · · (2.1)

The robot’s policy π : S × U → U dictates the action it should take given its current state and action. Since state transitions are stochastic, the value function given a fixed robot policy π is

Vπ_{(s) = E}si[R(s0, u0) + γR(s1, u1) + γ 2_R(s 2, u2) + · · · |s0 = s, u0 = π(s0), π] (2.2) = R(s, π(s)) + γX s0_∈S P (s0|s, π(s))Vπ_(s0 ) (2.3) = R(s, π(s)) + Es0_{∼P (s}0_|s,u)[Vπ(s0)] (2.4)

Note that in Eq. (2.4), the expectation is taken over the next-state s0 ∼ P (s0_{|s, u).}

The expectation’s subscript is omitted henceforth for readability.

Solving the MDP results in the optimal policy which maximizes the value function, π∗(s) = argmax

π

(31)

The optimal policy induces the best sequence of actions the robot can take to obtain the highest payoff.

2.3 Partially Observable Markov Decision

Processes

If the robot state is not explicitly known, but is rather observed through a noisy sensor, the problem can be formulated as a Partially Observable Markov Decision Process (POMDP) [5], which is parameterized by the following:

• S is a finite set of states for the robot. The state is not directly observable, and the robot typically has uncertainty regarding which state it is currently in. • U is a finite set of actions the robot can take.

• Ω is a finite set of observations the robot can receive.

• T : S × U → Π(S) is the state transition function (same as MDP case).

• O : S × U → Π(Ω) is the observation function, where Π(Ω) is a probability distribution over observations. If the robot takes action u ∈ U and transitions to new state s0 _{∈ S, the probability of it receiving observation o ∈ Ω is P (o|s}0, u) = O(s0, u, o).

• R : S × U → R is the reward function. • γ ∈ [0, 1] is the discount factor on rewards.

2.3.1 Belief Space

In the partially-observable domain, since the state is not explicitly known, decision-making is performed in belief space. Belief b(s) is a probability distribution over state s of the robot [5], in other words b(s) = P (s) and P

(32)

belief b(s), action u, and observation o, the robot transitions to a new belief b0(s0), which is derived as b0(s0|b, u, o) = P (s 0_{, b, u, o)} P (b, u, o) = P (o|s 0_{, b, u)P (s}0_{|b, u)P (b, u)} P (o|b, u)P (b, u) ∝ P (o|s0, u)X s∈S P (s0|b, s, u)P (s|u, b) ∝ P (o|s0, u)X s∈S P (s0|s, u)b(s), (2.6)

where P (o|b, u) is a normalizing constant [5]. The belief of a robot is a sufficient statistic for the history of actions taken and observations received [86]. In other words, the robot need only use its current belief state (not its full action-observation history) to choose the optimal action to take at the next timestep.

2.3.2 Optimal Policy

Policy π is defined as a mapping from the robot’s current belief to its next action, u = π(b). Given a policy, the value Vπ_{(b) for an infinite-horizon POMDP problem}

with initial belief b is defined

Vπ_{(b) = E} " _∞ X t=0 γtR(b(st), ut)|π, b(s0) = b # = E " _∞ X t=0 γtX st∈S R(st, ut)b(st)|π, b(s0) = b # . (2.7)

The POMDP problem is then to find the optimal policy, π∗, where

π∗(b) = argmax π ( E " _∞ X t=0 γtX st∈S R(st, ut)b(st)|π, b(s0) = b #) . (2.8)

To summarize, the planning under uncertainty problem as posed in the POMDP framework involves the robot updating and keeping track of its belief state using

(33)

Eq. (2.6), and applying the optimal policy found by solving Eq. (2.8).

2.4 Decentralized Partially Observable Markov

Decision Processes

A Decentralized Partially Observable Markov Decision Processes (Dec-POMDP) [6] is a sequential decision-making problem where multiple robots operate under uncer-tainty based on different streams of observations. At each step, every robot chooses an action (in parallel) based purely on its local observations, resulting in an immediate reward and an observation for each individual robot based on stochastic (Markovian) models over continuous states, actions, and observation spaces. The robots share a single reward function based on the actions of all robots, making the problem coop-erative, but their local views mean that execution is decentralized.

A notation aimed at reducing ambiguities when discussing multi-robot teams must first be introduced. A generic parameter p related to the i-th robot is noted as p(i)_,

whereas a joint parameter for a team of n robots is noted as ¯p = {p(1)_{, p}(2)_{, · · · , p}(n)_}.

Environment parameters or those referring to graphs are indicated without parenthe-ses, for instance vi _{may refer to a parameter of a graph node, and w}ij _{to a parameter}

of a graph edge starting at node i and ending at node j.

Formally, the Dec-POMDP problem considered in this thesis1 _{is described by the}

following elements:

• I = {1, 2, · · · , n} is a finite set of robots’ indices.

• ¯_{S is a continuous set of joint states. The joint state space can be factored as} ¯

S = X × X¯ e where Xe denotes the environmental state and ¯X = ×iX(i) is the

joint state space of robots, with X(i) being the state space of the i-th robot. ¯_{S is} a super-state which contains both the robots’ states as well as the environment-state, where X(i) is a continuous space and Xe is assumed to be a finite set.

1_{The standard (and more general) Dec-POMDP definition does not assume factored state spaces}

(34)

• ¯_{U is a continuous set of joint actions, which can be factored as U = ×}iU(i), where U(i) _{is the set of actions for the i-th robot.}

• State transition probability density function is denoted as P (¯s0|¯s, ¯u), specifying the probability density of transitioning from state ¯s ∈ ¯_{S to ¯}s0 ∈ ¯_{S when joint} action ¯u ∈ ¯_{U is taken by the robots.}

• ¯R is a joint reward function: ¯R : ¯_{S ×}_{U → R, the immediate reward for being}¯ in joint state ¯s ∈ ¯_{S and taking the joint action ¯}u ∈ ¯_U.

• ¯Ω is a continuous set of observations obtained by all robots. It is factored as ¯

Ω = ×iΩ(i) = ¯Z ×Z¯e, where ¯Z = ×iZ(i) and ¯Ze = ×iZe(i). The set Z(i)× Ze(i) is

the set of observations obtained by the i-th robot. Environmental observation oe(i) _{∈ Z}e(i) _{is the observation that is a function of the environmental state}

xe _{∈ X}e_{. It is assumed the set of environmental observations Z}e(i) is finite for any robot i.

• Observation probability density function h(¯o|¯s0, ¯u) encodes the probability of seeing observations ¯o ∈ ¯Ω given joint action ¯u ∈ ¯_{U and the resulting joint state} ¯

s0 ∈ ¯_S.

2.4.1 Action-observation History

The full action-observation history (obtained observations and taken actions) for the i-th robot is defined as follows,

˘

H_t(i) = {˘o(i)₀ , u(i)₀ , ˘o(i)₁ , u(i)₁ , · · · , ˘o(i)_t−1, u(i)_t−1, ˘o(i)_t }, (2.9) where ˘o(i) ∈ Ω(i)_{. For applications where robots operate in bursts of time where}

environment state xe _{is not modified, the above can be represented more compactly}

as the action-observation history,

(35)

where o(i) _{∈ Z}(i)_.

Note that the final observation o(i)_t after action u(i)_t−1 is included in the action-observation history. Due to interactivity between multiple robots and the environ-ment, environmental observations oe(i) _{are of particular importance in the}

decentral-ized setting and are further detailed in Section 3.3.

2.4.2 Optimal Joint Policy

The solution of a Dec-POMDP is a collection of decentralized policies ¯η = {η(1)_{, η}(2)_{, · · · , η}(n)_{}. In general, each robot does not have access to the observations}

of other robots, so each policy depends only on local information. Also, it is beneficial for robots to remember history (or equivalently, their belief state), since the full state is not directly observed. As a result, η(i) _{maps the individual full action-observation}

history of the i-th robot to its next action: ui_t= η(i)( ˘H_t(i)).

The value associated with a given policy ¯η starting from an initial joint belief (or state distribution) ¯b = P (¯s) can be defined as

¯ Vη¯(¯_{b) = E} " _∞ X t=0 γtR(¯¯ st, ¯ut)|¯η, P (¯s0) = ¯b # . (2.11)

Then, a solution to a Dec-POMDP formally can be defined as the optimal joint policy ¯ η∗ = argmax ¯ η ¯ Vη¯. (2.12)

2.5 Summary

This chapter presented preliminaries for solving planning under uncertainty prob-lems in Markovian settings. The MDP was presented as a framework for single-robot decision-making when the state is known. The MDP can be extended to the POMDP, which allows decision-making in partially observable settings where sensory measurements are noisy. Finally, the Dec-POMDP was introduced in order to al-low decentralized decision-making for a team of robots. Although the Dec-POMDP

(36)

allows a principled representation of multi-robot problems, it is NEXP-complete [6] and difficult to scale up to large problems. The next chapter introduces a new frame-work called the Decentralized Partially-Observable Semi-Markov Decision Process (Dec-POSMDP), which leverages high-level macro-actions to allow tractable solving of large multi-robot problems.

(37)

Chapter 3 Decentralized Planning using

Macro-actions

3.1 Overview

The Dec-POMDP problem stated in Eq. (2.12) is undecidable (as is the infinite-horizon POMDP problem even in discrete settings [87]). In fact, no Dec-POMDP solution methods currently exist for problems with continuous state spaces.

In discrete settings, recent work has extended the Dec-POMDP model to incor-porate macro-actions which can be executed in an asynchronous manner [12]. In planning with MAs, decision-making occurs in a two layer manner (see Fig. 3-1). A higher-level policy will return a MA for each robot and the selected MA will re-turn a primitive action to be executed. This approach is an extension of the options framework [15] to multi-robot domains while dealing with the lack of synchronization between robots. The options framework is a formal model of MAs [15] that has been very successful in aiding representation and solutions in single robot domains [18, 70]. Unfortunately, this method requires a full, discrete model of the system (including macro-action policies of all robots, sensors, and dynamics at a low level).

As an alternative, this thesis proposes the Decentralized Partially Observable Semi-Markov Decision Process (Dec-POSMDP) framework which only requires a

(38)

Dec-POSMDP

Task Macro-actions (TMA) Local Macro-actions (LMA) in belief space

Agent 1 Agent 2 Agent 3 Agent 4

Figure 3-1: Hierarchy of the proposed planner. In the highest level, a decentralized planner assigns a TMA to each robot. Each TMA encompasses a specific task (e.g. picking up a package). Each TMA in turn is constructed as a set of Local Macro-actions (LMAs). Each LMA (the lower layer) is a feedback controller that acts as a funnel. LMAs funnel a large set of beliefs to a small set of beliefs (termination belief of the LMA).

high-level model of the problem. The Dec-POSMDP provides a high-level discrete planning formalism which can be defined on top of continuous spaces. As such, continuous multi-robot coordination problems can be approximated with a tractable Dec-POSMDP formulation.

3.2 Hierarchical Graph-based Macro-actions

Before discussing the more general Dec-POSMDP model, the MAs that allow for effi-cient planning within the proposed framework must be formally defined. This section introduces a mechanism to generate complex MAs based on a graph of lower-level simpler MAs, which is a key point in solving Dec-POSMDPs without explicitly com-puting success probabilities, times, and rewards of MAs in the decentralized planning level. The generated complex MAs are called Task MAs (TMAs). When utilizing

(39)

TMAs, the continuous planning problem is considered within them and the high-level decentralized framework is left with a finite set of TMAs and environmental observations. This allows usage of discrete-space search algorithms for solving them.

3.2.1 Open-loop versus Closed-loop

To clarify the concepts, before describing TMAs, open-loop and closed-loop MAs must be distinguished. In general, MAs refer to temporally-extended actions [88]. An l-step long open-loop MA is a sequence of pre-defined actions such as u0:l= {u0, u1, · · · , ul}.

However, a closed-loop MA is policy π(·) which maps histories to actions, ut= π(Ht).

As discussed in the next section, for a seamless incorporation of MAs in Dec-POMDP planning, closed-loop MAs must be automatically generated. This is a challenge in partially-observable settings.

3.2.2 Task Macro-action

Generating a closed-loop MA that accomplishes a single-robot task such as fpick-up-a-package, deliver-a-fpick-up-a-package, open-a-door, and so on, itself requires solving a POMDP problem. In this thesis, information roadmaps [16] are used as a substrate to gen-erate such TMAs. First, the structure of feedback controllers in partially-observable domains is discussed.

3.2.3 Belief within Macro-actions

A macro-action π(i) _{for the i-th robot maps the histories H}π(i)

of actions and obser-vations that have occurred to the next action the robot should take, u = π(i)(Hπ(i)). Note that the environmental observations are only obtained when the MA terminates. This history is compressed into a belief b(i) = P (x(i)|Hπ(i)

), with joint belief for the team denoted by ¯b = {b(1)_{, b}(2)_{, · · · , b}(n)_{}. It is well-known [89] that making decisions}

based on belief b(i) is equivalent to making decisions based on the history Hπ(i) in a POMDP.

(40)

3.2.4 Feedback in Belief Space

For any given robot, a feedback controller in the partially-observable environment comprises a Bayesian filter bk+1 = τ (bk, uk, zk+1) that evolves the belief and a

sep-arated controller uk+1 = µ(bk+1) that generates control signals based on the

cur-rent belief. Therefore, a feedback controller L in belief space can be viewed as a function that maps the current belief bk, control uk, and observation zk+1 to

the pair of next belief bk+1 and control uk+1; i.e., (bk+1, uk+1) = L(bk, uk, zk+1) =

(τ (bk, uk, zk+1), µ(τ (bk, uk, zk+1))).

3.2.5 Local Macro-action

A Local Macro-action (LMA) considered herein is a feedback controller that is ef-fective (has basin of attraction) locally in a region of the state/belief space. Many controllers that rely on linearization fall into this category as the linearization as-sumption is valid locally around the linearization point. The goal of LMA in the partially-observable setting is to drive the system’s belief to a particular belief. In [16] it has been shown that in Gaussian belief space, utilizing a combination of Kalman filter and linear controllers, the system’s belief can be steered toward certain proba-bility distributions. In other words, LMAs act like a funnel in belief space (see Fig. 3-1). Starting from a belief in the mouth of funnel, the LMA drives the belief toward a target belief that is referred to as a milestone herein.

3.2.6 Linear LMAs

Since LMAs act locally on belief space, the system can be linearly localized and corresponding simple LMAs can be designed. In this thesis, the belief space is assumed to be Gaussian and thus mean vector ˆx+ and covariance matrix P+ characterize the belief, which is denoted as b ≡ (ˆx+_{, P}+_{). For a given mean value v, the nonlinear}

process and measurement equations are linearized, resulting in a stationary linear system with Gaussian noises. Associated with this linear system, a stationary Kalman filter and a linear separated controller are designed, µ(b) = −L(ˆx+− v). Thus, the

(41)

Algorithm 1: TMA Construction (Offline)

1 Procedure : ConstructTMA(b, vgoal, M)

2 input : Initial belief b, mean of goal belief vgoal, task POMDP M;

3 output : TMA policy π∗, success probability of TMA P (Bgoal|b₀; π∗), value of

taking TMA V (b0; π∗);

4 Sample a set of LMA parameters {θj}n−2_j=1 from the state space of M, where

θn−2 _{includes v}goal_;

5 Corresponding to each θj, construct a milestone Bj in belief space;

6 Add to them the (n − 1)-th node as the singleton initial milestone Bn = {b},

and the n-th node as the constraint milestone B0_; 7 Connect milestones using LMAs Lij;

8 and compute the LMA rewards, execution time, and transition probabilities by

simulating LMAs offline;

9 Solve the LMA graph DP in Eq. (3.1) to construct TMA π∗;

10 Compute the associated success probability P (Bgoal|b; π∗), completion time

Tg_(Bgoal_{|b; π}∗_{), and value V (b; π}∗_);

11 return TMA policy π∗, success probability P (Bgoal|b; π∗), completion time

Tg(Bgoal|b; π∗_{), and value V (b; π}∗₎

utilized LMA is parametrized by feedback gain matrix L and point v; i.e., µ(b; θ), where θ = (L, v). It can be shown that under the appropriate choice of L and mild observability conditions, this linear LMA acts as a funnel in belief space that drives the belief toward the milestone ˇb ≡ (v, ˇP ), where ˇP is the solution of the Riccati equation corresponding to the Kalman filter [16].

3.2.7 Constructing TMAs by Graphing Linear LMAs

A chain of funnels is a sequence of funnels where the target belief of each funnel falls into the mouth (or pre-image) of the next funnel in the chain. A richer way of combining funnels is via graphication (to form a graph of funnels). An information roadmap [16] is defined as a graph of funnels, where each node of this graph is a milestone and each edge is an LMA funnel (see Fig 3-1).

Alg. 1 recaps the construction of a TMA using a graph of linear LMAs. To construct a graph of LMAs, a set of parameters {θj _{= (L}j_{, v}j_{)} are sampled}

(Alg. 1, Line 4) and the corresponding LMAs {Lj} are generated as described above. Associated with the j-th LMA, the j-th milestone ˇbj _{is computed. The}

(42)

j-th node of j-the LMA graph is defined as an -neighborhood around j-the milestone; i.e., Bj _{= {b : kb − ˇ}_bj_{k ≤ } (Alg. 1, Line 5).}

The set of all nodes is denoted V = {Bj_{}. Node B}j _{is connected to its k-nearest}

neighbors via their corresponding LMAs. If neighboring nodes i and j are so far from each other that the j-th LMA Lj _{cannot take the belief from i to j (since the}

linearization used to construct Lj_{is not valid around B}i_{), an edge controller is utilized}

(as detailed in [16]). An edge controller is a finite-time controller whose role is to take the mean of distribution close enough to the node Bj _{via tracking a trajectory that}

connects vi to vj in the state space. Once the distribution mean gets close enough to the target node, the system’s control is handed over to the LMA associated with the target node, Lj. The concatenation of the edge controller and the funnel utilized to take the belief from Bi _{to B}j _{is denoted by L}ij_{, which is defined as the (i, j)-th graph}

edge. The set of all edges are denoted by L = {Lij}. The set of available LMAs at Bi _{is denoted by L(i). One can view a TMA as a graph whose nodes are V = {B}j_}

and whose edges are LMAs L = {Lij} (Fig. 3-1).

To incorporate the lower-level state constraints (e.g., avoiding collisions with ob-stacles) and control constraints, B0is considered as a hypothetical node, hitting which represents violation of constraints. Node B0 _{is added to the set of nodes V.}

There-fore, taking any Lij, there is a chance that system ends up in B0. For more details on this procedure, refer to [16, 90].

3.2.8 Edge Rewards and Probabilities

The behavior of LMA Lij at Bi can be simulated offline (Alg. 1, Line 8) and the probability of landing in any given node Br_{, which is denoted by P (B}r_|Bi_{, L}ij_{), can be}

computed. Similarly, the reward of taking LMA Lij at Bi can be computed offline. This reward is denoted by R(Bi_{, L}ij_{) and defined as the sum of one-step rewards}

under this LMA. Finally, Tij = T (Bi, Lij) denotes the time it takes for LMA Lij to complete its execution starting from Bi_.

(43)

3.2.9 Utilizing TMAs in the Decentralized Setting

In utilizing TMAs in the decentralized setting the following properties of the TMAs need to be available to the high-level decentralized planner: (i) TMA policy and value from any given initial belief, (ii) TMA completion time from any given initial belief, and (iii) TMA success probability from any given initial belief. What makes comput-ing these properties challengcomput-ing is the requirement that they need to be calculated for every possible initial belief. Every belief is needed because when one robot’s TMA terminates, the other robots might be in any belief while still continuing to execute their own TMA. This information about the progress of robots’ TMAs is needed for nontrivial asynchronous TMA selection.

The following sections discuss how the graph-based structure of the proposed TMAs allows computation of a closed-form equation for the success probability, value, and time. As a result, when evaluating the high-level decentralized policy, these values can be efficiently retrieved for any given start and goal states. This is particularly important in decentralized multi-robot planning since the state/belief of the j-th robot is not known a priori when the TMA of the i-th robot terminates.

3.2.10 TMA Value and Policy

A TMA policy π ∈ T is defined as a policy that is found by performing dynamic programming on the graph of LMAs. Consider a graph of LMAs that is constructed to perform a simple task such as open-the-door, pick-up-a-package, move-a-package, and so on. An important feature of this graph is that it is multi-query, meaning that it may be valid for any starting and goal belief. Depending on the goal belief of the task, the dynamic programming problem on the LMA graph that leads to a policy can be solved to achieve the goal while trying to maximize the accumulated reward and take into account the probability of hitting failure set B0_{. Formally, the following}

(44)

DP needs to be solved, Vπ∗(Bi) =max L∈L(i) n R(Bi, L)+X j P (Bj|Bi_{, L)V}∗ (Bj)o, ∀i (3.1) π∗(Bi) = argmax L∈L(i) n R(Bi, L)+X j P (Bj|Bi_{, L)V}∗_(Bj₎o_{, ∀i} _(3.2)

where Vπ∗(·) is the optimal value defined over the graph nodes with V (Bgoal) set to zero and V (B0_{) set to a suitable negative reward for violating constraints. The}

resulting optimal TMA is π∗(·).

Primitive actions can be retrieved from a TMA via a two-stage computation: the TMA first picks the best LMA at each milestone and the LMA generates the next primitive action based on the perceived observations until the belief reaches the next milestone; i.e., uk+1 = L(bk, uk, zk+1) = π∗(B)(bk, uk, zk+1), where B is the last visited

milestone and L = π∗(B) is the best LMA chosen by the TMA at milestone B.

3.2.11 Success Probability of TMA

For a given optimal TMA π∗, the associated optimal value Vπ∗(Bi) from any node Bi is computed via solving Eq. (3.1). Also, using Markov chain theory, the probability P (Bgoal|Bi_{; π}∗_{) of reaching the goal node B}goal _{under the optimal TMA π}∗ _starting

from any node Bi _{can be analytically computed in the offline phase [16].}

3.2.12 Completion Time of TMA

Similarly, the time it takes for the TMA to go from Bi to Bgoal under any TMA π can be computed as follows,

Tg(Bi; π) = T (Bi; π) +X

j

P (Bj|Bi; π)Tg(Bj; π), ∀i (3.3)

where Tg_{(B; π) denotes the time it takes for TMA π to take the system from B to}

the TMA’s goal, whereas T (Bi_{; π) is the one-step time associated with taking the}

(45)

¯

T = (T1_{, T}2_{, · · · , T}n₎T_{, Eq. (3.3) can be written in its matrix form as,}

¯

Tg = ¯T + ¯P ¯Tg ⇒ ¯Tg = (I − ¯P )−1T¯ (3.4) where ¯Tg _{is a column vector with its i-th element equal to T}g_(Bi_{; π) and ¯}_{P is a matrix}

with its (i, j)-th entry equal to P (Bj|Bi_{; π).}

Therefore, using the formulation presented in this section, a TMA can be used in a higher-level planning algorithm as a MA whose success probability, execution time, and reward can be computed offline.

3.3 Environmental State and Updated Model

TMAs can be extended to the multi-robot setting where there is an environmental state that is locally observable by robots and can be affected by other robots.

3.3.1 Decision Epochs

Robots can choose TMAs at decision epochs. Since TMA selection is asynchronous, a decision epoch is considered to be a timestep in which any robot finishes its current TMA and chooses a subsequent TMA. The set of the first k decision epochs is defined as an ordered set t0:k = (t0, · · · , tk), which consists of all the timesteps at which any

one (or multiple) of the robots completed a TMA. Refer to Fig. 3-2 for an illustrated example of epochs. Defining ¯π = {π(1)_{, · · · , π}(n)_{} as the collection of TMAs currently}

assigned to the robots, then ¯π only changes at epochs (since, per definition, an epoch occurs when a robot completes its current TMA and chooses a new one). The timestep at the k-th epoch is denoted by tk, where tk = minimint{t > tk−1 : b

(i)

t ∈ B(i),goal},

where t0 = 0 and b (i)

t is the belief of robot i at timestep t.

3.3.2 Time Notation

Define k as the epoch number and tkas the time associated with it. Below, we assume

(46)

Agent 3 Agent 2 Agent 1 Agent 4 𝑡 = 𝑡0= 0 𝑘 = 0 𝑡 = 𝑡1= 4.6 𝑘 = 1 𝑡 = 𝑡2= 11.3 𝑘 = 2 𝑡 = 𝑡3= 15.8 𝑘 = 3 𝑡 = 𝑡4= 21.4 𝑘 = 4 𝑡 𝑜 ₁𝑒(2) = (𝑜1𝑒 2, 𝑏12 ,𝑓) 𝑜 ₂𝑒(1) = (𝑜₂𝑒 1, 𝑏₂1 ,𝑓) 𝑜 2 𝑒(4) = (𝑜2𝑒 4, 𝑏24 ,𝑓) 𝑜 3 𝑒(3) = (𝑜3 𝑒 3 , 𝑏3 3 ,𝑓 ) 𝑜₀𝑒 2 𝑜0𝑒 3 𝑜0𝑒 4 𝑜0𝑒 1

Figure 3-2: Decision epochs tk are defined as timesteps when any robot finishes

its current TMA. At a decision epoch, each robot receives a TMA-observation ˘ok,

consisting of its belief and latest environment observation. In some scenarios, two robots may finish their TMAs at the same decision epoch, though this is a rare event due to TMA completion times being continuous.

both refer to the same belief (at epoch k or time tk).

3.3.3 Environment State

The environment state (e-state) at the k-th epoch is defined as xe

k ∈ Xe. The e-state

encodes the information in the environment that can be manipulated and observed by different robots. It can be interpreted as a state which is shared amongst the robots, and can be manipulated for allowing implicit communication between robots. It is assumed xe

k is only locally (partially) observable. An example for xek in the package

delivery application (presented in Chapter 6) is “there is a package in this base”. A robot can only get this example measurement if the robot is in the base (hence the entire process is partially observable). There is a limited set of TMAs available at each xe_{, which is denoted by T(x}e_).

3.3.4 Environmental Observation

Each time a robot i completes a TMA, it receives a partial observation of the environ-mental state xe, denoted by oe(i). Though the e-state is shared, noisy observations of

(47)

it are made independently by each robot, thus allowing each robot to have a different observation model for it.

The e-state observation likelihood function for robot i is denoted as P (oe(i)

k |xek, b (i) k ).

Note that at every epoch k, only robots whose TMA has completed at that specific epoch, i.e., b(i)_k ∈ Bgoal,(i)_{, will receive an environmental observation. The remaining}

robots (which continue to execute their TMAs) do not receive an environmental observation, i.e., oe_k(i) = null, for all i where b(i)_k ∈ B/ (i),goal_.

3.3.5 Belief at Epochs

To incorporate environmental variables into the formulation, all robots need to be considered simultaneously since the e-state xeis not local to a given robot and can be manipulated by other robots. Thus, ¯bk := (b

(1)

k , · · · , b (n)

k ) is defined as the collection

of beliefs of all robots at the k-th epoch, where b(i)_t is the belief of the i-th robot at time step t. Similarly, ¯B_kgoal = (Bgoal,(1)_k , · · · , B_kgoal,(n)) is defined as the collection of goal regions in belief space at the k-th epoch for all robots.

3.3.6 Extended Transition Probabilities

Accordingly for π ∈ T(xe_{), the single robot transition probabilities P (B}goal_{|b; π) are}

extended to take the e-state into account and the extended transition probability is defined as P ( ¯B_k+1goal, xe0

k+1|¯bk, xek; ¯πk). This extended transition probability denotes the

probability of getting to the joint goal region ¯B_k+1goal and e-state xe_k+10 starting from joint belief ¯bk and e-state xek under the joint TMA policy ¯πk at the k-th epoch.

3.3.7 Extended One-step Reward

The extended (joint) reward ¯R(¯b, xe, ¯u) encodes the reward obtained by the entire team, where ¯b = {b(1)_{, · · · , b}(n)_{} is the joint belief and ¯}_{u is the joint action defined}

previously.

It is assumed that the extended (joint) reward is a multi-linear function of a set of reward functions {R(1), · · · , R(n)} and Re_{, where R}(i) _{only depends on the i-th robot’s}

(48)

state and Re _{depends on all the robots. In other words,}

¯

R(¯x, xe, ¯u) = g R(1)(x(1), xe, u(1)), R(2)(x(2), xe, u(2)), · · · , (3.5) R(n)(x(n), xe, u(n)), Re(¯x, xe, ¯u) .

In multi-robot planning domains, often computing Re _{is computationally less}

ex-pensive than computing ¯R, which is the property exploited in designing the higher-level decentralized algorithm. This is due to Re _{essentially being an environment}

reward, generated for a certain e-state and configuration of robots. On the other hand, ¯R is related to the robots’ sequences of joint actions and their impact on the environment.

3.3.8 Joint Policy

Similarly, the joint policy ¯φ = {φ(1), · · · , φ(n)} is the set of all decentralized poli-cies, where φ(i) _{is the decentralized policy associated with the i-th robot. The next}

section discusses how these decentralized policies can be computed based on the Dec-POSMDP formulation.

3.3.9 Joint Value

Finally, joint value ¯Vφ¯(¯b, xe₀) = ¯V (¯b, xe₀; ¯φ) encodes the value of executing the collec-tion ¯φ of decentralized policies starting from e-state xe

0 and initial joint belief ¯b. The

optimal joint policy ¯φ∗ is defined as the policy which results in the maximum joint value ¯V∗(¯b, xe

0; ¯φ ∗_).

3.4 Dec-POSMDP Framework

This section formally introduces the Dec-POSMDP framework. It discusses how to transform a continuous Dec-POMDP to a Dec-POSMDP using a finite number of TMAs, allowing discrete domain algorithms to generate a decentralized solution for general continuous problems.

(49)

3.4.1 TMA History

In the decentralized setting, each robot has to make a decision based on its individual action-observation history. In utilizing TMAs, this history includes the chosen TMAs {π_k(i)}, the action-observation histories {H_k(i)} under chosen TMAs, and the environ-mental observations {oe(i)_k } received at the termination of TMAs for all epochs. In other words, the TMA history for the i-th robot at the beginning of the k-th epoch is defined as

ξ_k(i)= (oe(i)₀ , π₀(i), H₁(i), oe(i)₁ , π(i)₁ , H₂(i), oe(i)₂ , π₂(i), . . . , π_k−1(i) , H_k(i), oe(i)_k ) (3.6) which includes the chosen TMAs π_1:k−1(i) , the action-observation histories under chosen TMAs H_1:k−1(i) , and the environmental observations oe(i)_1:k received at the termination of TMAs. 𝒙𝟎 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒐𝟏 … 𝒐𝒊 … 𝒐𝒕 𝒖𝟏 … 𝒖𝒕−𝟏 𝑯𝟏 𝑯𝟐 𝑯𝟑 … 𝝅𝟎 𝒐𝟏𝒆 𝒐𝟐𝒆 𝒐𝟑𝒆 𝒐𝟎𝒆 𝝅𝟏 𝝅𝟐 𝝅𝟑

Figure 3-3: The i-th robot moves from state x0 through a sequence of TMAs,

π0, π1, · · · , πk. Under each TMA it receives primitive actions u0, u1, · · · , ut and

per-ceives primitive observations o0, o1, · · · , ot. At the end of each TMA (as well as

beginning of the mission), it also obtains environmental observations oe₀, · · · , oe_k.

3.4.2 Compressed TMA History

The TMA history, as introduced above, is a full representation of a robot’s decisions and observations, both high-level and low-level. However, in large-scale problems,

Decentralized control of multi-robot systems using partially observable Markov Decision Processes and belief space macro-actions

Decentralized Control of Multi-Robot Systems

using Partially Observable Markov Decision

Processes and Belief Space Macro-actions

by

Shayegan Omidshafiei

B.A.Sc. in Engineering Science (Major in Aerospace Engineering)

University of Toronto (2012)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Master of Science in Aeronautics and Astronautics

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2015

c

Massachusetts Institute of Technology 2015. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

August 20, 2015

Certified by . . . .

Jonathan P. How

Richard C. Maclaurin Professor of Aeronautics and Astronautics

Thesis Supervisor

Accepted by . . . .

Paulo C. Lozano

Associate Professor of Aeronautics and Astronautics

Chair, Graduate Program Committee

Decentralized Control of Multi-Robot Systems using

Partially Observable Markov Decision Processes and Belief

Space Macro-actions

by

Shayegan Omidshafiei

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1

Introduction

1.1

Overview

1.2

Literature Review

1.3

Thesis Contributions and Structure

Chapter 2

Background

2.1

Overview

2.2

Markov Decision Processes

2.2.1

Optimal Policy

2.3

Partially Observable Markov Decision

Processes

2.3.1

Belief Space

2.3.2

Optimal Policy

2.4

Decentralized Partially Observable Markov

Decision Processes

2.4.1

Action-observation History

2.4.2

Optimal Joint Policy

2.5

Summary

Chapter 3

Decentralized Planning using

Macro-actions

3.1

Overview

3.2

Hierarchical Graph-based Macro-actions

3.2.1

Open-loop versus Closed-loop

3.2.2