Airborne collision avoidance with three-dimensional policy

(1)

Airborne Collision Avoidance with Three-Dimensional

Policy

by

Osmany L. Corteguera

B.S. Computer Science

Massachusetts Institute of Technology, 2018

Submitted to the Department of Electrical Engineering and Computer Science in partial

fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

May 2020

©2020 Massachusetts Institute of Technology. All rights reserved.

Signature of Author:

Department of Electrical Engineering and Computer Science

May 18, 2020

Certified by:

Michael P. Owen

Technical Staff, MIT Lincoln Laboratory

Thesis Supervisor

Accepted by:

Katrina LaCurts

Chair, Master of Engineering Thesis Committee

(2)

by

Osmany L. Corteguera

Submitted to the Department of Electrical Engineering and Computer Science

on May 18, 2020

in Partial Fulfillment of the Requirements for the Degree of

Master of Engineering in Electrical Engineering and Computer Science

ABSTRACT

Current airborne collision avoidance systems are limited in the number of variables they can use to describe an encounter between aircraft. The solution method currently used, dynamic programming (DP), requires the discretization of continuous variables and a split between the set of horizontal and vertical variables. These approximations limit how extensible the system is to new environments, and can negatively impact its performance.

In this work, we present a method to build a collision avoidance logic with DP that uses all the variables in a three-dimensional encounter for the purpose of evaluating its performance. We do this by limiting the values that some variables can take, and building small tables that can be evaluated on partitions of a large encounter set. The 3D logic is able to achieve similar performance to ACAS X while using a simpler model. To exploit the benefits of the fully 3D logic approach without the computational limitations, we build a simulation environment and implement a deep reinforcement learning algorithm that can learn successful collision avoidance logics consistently. Our experiments show that this approach can learn logics using models of the problem that would be impractical for a DP solution to solve.

Thesis Supervisor: Michael P. Owen

Title: Technical Staff, MIT Lincoln Laboratory

(3)

Acknowledgements

I would like to first thank my supervisor Michael Owen for his supportive guidance throughout my time at Lincoln. The work for this thesis would not have been possible without his endless help and advice at every point along the way. I greatly benefited from his knowledge, his ability to thoughtfully answer questions, and his willingness to regularly meet and discuss our work.

I am indebted to Wes Olson, who brought me into Group 42 at Lincoln. I am grateful to Wes, Lynn Graham, and Lynne Adamian for making welcoming me to the group and always making sure I had the resources to conduct this research.

I am very grateful to Mykel Kochenderfer, Blake Wulfe, Sydney Katz, and Kyle Julian for their advice and feedback. Their expertise and technical knowledge were invaluable to the work done in this thesis.

I am also very grateful to Luis Alvarez, Emilie Cowen, Robert Moss, Chris Edwards, Sam Wu, Christian Madali, Adam Panken, Adam Gjersvik, and the rest of the members of Group 42. Their support and interactions greatly enriched my time Lincoln.

I would also like to thank the MIT Sport Taekwondo team for being a caring and welcoming group of people, and for making my time as a graduate student meaningful and enjoyable. I learned many valuable lessons from the examples of selfless leadership of my instructors Master Chuang, Isaac, Nick, Renee, Sam, Tahin, Ashley, Machine, Yang, and club president Iris. I am eternally grateful for all the friends I made and all the memories we share.

Finally, I thank my parents and my sister for their unconditional love, and the sacrifices they made to give me this opportunity.

(4)

(5)

List of Figures

3.1 Distribution of encounters. . . 28

3.2 Policy plots for 3D state space logic. . . 30

3.3 3D logic, Xu 4.1, and Xu 5 comparison. . . 31

4.1 DRL horizontal policies. . . 39

4.2 Performance metrics (P(NMAC), P(Alert), P(Reversal)) on the control policies. . . 40

4.3 Example encounter of the learned policy with another aircraft. . . 41

5.1 Policy plots for three deep RL 3D state space logic. . . 49

5.2 Example encounter of the 3D DRL policy with another aircraft. . . 50

6.1 Policy plot of policy trained with a dropout rate of 1%. . . 58

(8)

(9)

List of Tables

3.1 Three dimensional state space variables. . . 24

3.2 Aircraft maneuver rates. . . 25

3.3 Reward function for tabular 3D logic. . . 25

3.4 Number of encounters with low speed variance for each table cutpoint (4.8M total). . . 28

4.1 Aircraft state variables. . . 34

4.2 Observation space variables. . . 35

4.3 Reward space variables. . . 35

4.4 Performance of the best model for each seed with alert rate ≤ 0.7, reversal rate ≤ 0.1. . . 40

4.5 Model performance with dropout regulatization. . . 42

4.6 Model performance with L2 _{regulatization.} _{. . . .} ₄₃

4.7 Model performance with different minibatch sizes. . . 43

4.8 Model performance with n-step returns. . . 44

5.1 Aircraft state variables. . . 46

5.2 Aircraft maneuver rates. . . 46

5.3 Observation space variables. . . 47

5.4 Reward space variables. . . 47

5.5 Test-set performance of each trained model. . . 48

5.6 Performance of comparison over 10K encounters. . . 51

6.1 Horizontal Deep RL model default set of hyperparameters. . . 57

6.2 Model performance with reversal reward. . . 57

6.3 Model performance with different architectures. . . 58

6.4 Model performance with fixed-length encounters with length-based terminal states. . . 59

6.5 Model performance with fixed-length encounters with range-based terminal states. . . 59 9

(10)

(11)

Chapter 1

Introduction

The first efforts into developing collision avoidance systems were spurred by a collision between two airliners over the Grand Canyon in 1956. While there already existed ground-based air traffic control policies set up to prevent collisions, these systems failed to prevent this collision. The major loss of life and publicity caused by the event made it evident to airlines and aviation authorities that there was a need for another layer of safety that could act as a last resort system when other methods failed. Prompted by this disaster, the U.S. government established the Federal Aviation Administration (FAA) and tasked them with improving the safety of the airspace to prevent further catastrophic midair collisions [Kochenderfer et al., 2012].

Several efforts were made from the late 1950s to the 1970s to create collision avoidance systems that could operate from each aircraft and could act independently from the existing air traffic control systems. From these efforts, the Beacon Collision Avoidance System (BCAS) emerged, which used data from the air traffic radar beacon systems to sense nearby aircraft in the airspace [tca, 2011]. Despite these efforts, in 1978 there was a collision between a light aircraft and a commercial airliner over San Diego which resulted in 144 fatalities, and another collision in 1986 over Cerritos, California resulting in 82 fatalities. In response to these collisions, a Congressional mandate was passed that required U.S. commercial aircraft to be equipped with the Traffic Control and Collision Avoidance System (TCAS), which had the capability not only to sense aircraft in the nearby airspace, but also to issue Resolution Advisories (RAs) that would recommend maneuvers to take to avoid collisions.

TCAS is a highly sophisticated system that has significantly improved aircraft safety. However, the evolution of the airspace over the following decades revealed limitations in the TCAS approach to collision avoidance. In order to address these limitations and improve performance, the Airborne Collision Avoidance System X (ACAS X) was developed at MIT Lincoln Laboratory [Kochenderfer et al., 2012].

(12)

ACAS X uses a dynamic programming (DP) [Bertsekas, 2005] approach to compute a decision-making solution that balances safety with sensible rate of alerts and a sensible sequence of alerts. The solution is encoded in a numerical table that acts as a function, taking in variable values such as range, altitude, and speed for each aircraft that specify the state of an encounter between aircraft, and estimating the future cost of taking each action. At each point in time, the system takes an action with the minimum expected future cost. However, the space of variables to consider in an aircraft encounter is large and many variables have a continuous domain.

In order to make the tabular solution work, ACAS X adopts a discretization scheme for these variables. It also has to split the variables into a set of horizontal variables and a set of vertical variables, and finds solutions for each separately. The work in this thesis evaluates the performance of a system where the horizontal and vertical sets of variables are combined, and a solution is computed for the combined set. While a solution in this combined state space would usually be very costly to compute, we describe a procedure to make it significantly less costly while remaining accurate. We then go on to describe an alternative approach that, instead of using dynamic programming, uses deep reinforcement learning (DRL) to create a logic. The DRL solution has the advantage of being able to use the combined state space while also not needing to discretize the variables, but it can suffer from more instability in the computed solution and requires the careful adjustment of a large set of hyperparameters.

1.1 Thesis Outline and Contributions

The rest of the thesis is organized as follows:

In Chapter 2, we give a formal description of the collision avoidance problem for aircraft. We describe the ACAS X system in more detail, which is composed of several different versions designed to accom-modate a variety of aircraft designs and capabilities. One version is targeted towards large commercial aircraft (ACAS Xa), while the one we focus in this thesis targets unmanned aircraft (ACAS Xu). We go over the details of the algorithms and techniques used by ACAS X to develop solutions, and why finding optimal solutions to the problem poses difficulties. We conclude by going over the efforts that have been made to address these difficulties, and by introducing the frameworks used in this thesis to evaluate and attempt to overcome these difficulties.

In Chapter 3 we evaluate the performance of a modified version of ACAS Xu that uses a different problem formulation. This different problem formulation uses a 3D state space for the problem that is usually impractical to find a solution for. We modify the variables in the collision avoidance problem

(13)

1.1. THESIS OUTLINE AND CONTRIBUTIONS 13 to compute a solution in a reasonable amount of time. We use the policies we train with this 3D state space and compare them to the different versions of the ACAS Xu implementation. We show that these 3D policies can achieve similar performance to the ACAS Xu policies, and that the 3D state space may give better performing logics for collision avoidance.

In Chapter 4 we find a collision avoidance policy using deep Q-learning. We find the solution to a collision avoidance problem restricted to a horizontal plane, where both aircraft are always at the same altitude. This restriction allows us to experiment and tune the algorithm to the requirements of our problem with short training times, while allowing us to later extend it to the 3D collision avoidance problem.

In Chapter 5 we build a 3D deep Q-learning solution. We describe how the problem is modified and evaluate the performance of the policy. We compare the performance of the 3D deep RL logic to ACAS Xu. We discuss the possible advantages and disadvantages of this approach, and propose recommendations for further improvements that could be made.

Finally, in Chapter 6 we summarize our findings and discuss remaining open problems and opportunities for further research.

(14)

(15)

Chapter 2

Background

In this chapter, we formally define the collision avoidance problem. A system solving this problem is com-posed of many parts, including detection, communication, and decision-making. In this work, we focus on the decision-making part of the problem. We discuss the previous work done in this area and the issues that motivate the work done in this thesis.

2.1 The Collision Avoidance Problem

In this work, we explore collision avoidance systems that issue Resolution Advisories (RAs), also referred to as alerts, recommending maneuvers for the purpose of avoiding collisions with nearby aircraft. The main objective is to create a collision avoidance system that reliably prevents collision without excessive alerting. If it alerts too much, the system disrupts the normal operation of the aircraft and has a higher chance of being ignored by the pilot. If it does not alert enough, we suffer a higher risk of getting into a collision. An ideal system would issue alerts if and only if there would be a collision if they were not issued.

The collision avoidance problem is made difficult by the sources of uncertainty that accompany the problem. First, there is uncertainty in the state of an encounter between aircraft. Sensor noise and limited, noisy measurements make it difficult to estimate the variables relevant for collision avoidance, like the position and velocity of the aircraft. There is also uncertainty in the dynamics of the encounter. The evolution of the state of an encounter depends on the decisions made by the pilots of each aircraft, which cannot be known in advance and so can be difficult to model. It also depends on the specifications and capabilities of each of the aircraft involved, as well as other natural sources of error and noise in the physical environment.

A collision avoidance system is composed of several modules. There is the sensing module, which detects other aircraft in the vicinity and estimates their position, velocity, and directions. It also has communication

(16)

systems and protocols that make it possible to coordinate maneuvers between aircraft and communicate between them and air traffic control. The part of the system we explore in this thesis is the decision-making module, whose purpose is to come up with recommendations for the best maneuvers to take to avoid a collision if there is a risk of a collision occurring. We assume that we get observations of the nearby environment from the sensing module. These observations provide the data that can be used to decide whether to alert or not, and what maneuver to issue if we send out an alert.

This work focuses on the problem where we assume we are in control of a single aircraft (the ownship), and another aircraft (the intruder) enters the nearby airspace of the ownship. The job of the collision avoidance logic is to minimize the probability of getting into a near mid-air collision (NMAC), while keeping the alert rate below some threshold. An NMAC occurs when two aircraft come within ρh horizontally and

ρv vertically of each other. Typical values for these variables are ρh = 500ft and ρv = 100ft. Essentially,

the NMAC zone is a cylinder of radius ρhand height 2ρvaround the ownship, and the goal is to keep other

aircraft from entering that cylinder. Using NMACs instead of collisions is more convenient, as it allows the same logic to be applied to aircraft of different shapes and sizes, whereas to measure collisions these properties would be significant.

2.2 TCAS

One example of a collision avoidance system that has been in use for decades is TCAS. The TCAS system uses data from radar and beacon surveillance to monitor and track other nearby aircraft, and to issue advisories to the pilot if needed. The advisories can take the form of traffic alerts, which announce to the pilot that aircraft are nearby and highlights the aircraft on a traffic display. If the logic determines that the risk of collision is high, then it issues resolution advisories (RA), which instruct the pilot to either climb or descend in order to maintain a safe distance.

While TCAS has been a major contributor to airspace safety around the world [Kuchar and Drumm, 2007], it does have limitations that are accentuated as the airspace changes. To adequately address the wide variety of situations that an aircraft may encounter, TCAS uses a set of rules and heuristics to estimate the closest point of approach for each aircraft, and to issue RAs if necessary [tca, 2011]. The model of the airspace used by TCAS is deterministic, so these rules must be modified to compensate for the state and dynamic uncertainty inherent to the problem. This makes the set of rules that govern decision-making in the system complex and interrelated, which limits the capacity of the system to be extended and modified.

Changes in the types of aircraft present in the airspace have also accentuated some of the limitations of TCAS. It is missing support for a wider range of aircraft that inhabit the airspace, such as unmanned

(17)

2.3. MODELING THE PROBLEM 17 and autonomous aircraft. As the density of traffic in the airspace increases and close encounters become more frequent, it is difficult to adjust TCAS maintain an operationally acceptable RA rate while protecting against collisions.

2.3 Modeling the Problem

More recent solutions like ACAS X have explicitly modeled the problem by explicitly taking into account the uncertainty in the model framework. In this section, we describe the framework used by ACAS X.

We can describe the collision avoidance problem as a Partially Observable Markov Decision Process (POMDP) [Kaelbling et al., 1998]. The POMDP framework allows us to model the problem as consisting of an agent (the ownship) interacting in an environment where the ground truth cannot be fully observed. The agent can take actions in the environment, while the environment provides the agent with observations about the state of the environment and rewards. The rewards given by the environment are used to guide the agent toward some goal (e.g. our goal is to avoid collisions with other aircraft).

More formally, a POMDP is defined by the tuple (S, A, T, R, Ω, O), where

S is the set of states s. Each member of S describes a state of the encounter, which includes information like the vertical separation, ownship vertical rate, etc. Essentially, each s ∈ S is a snapshot of the encounter at one moment in time.

A(s) 7→ Asis the set of advisories a that can be issued at every state. These can include climb, descend,

turn right, etc.

T (s, a) 7→ p(s0_{|s, a) is the stochastic transition function, which outputs a distribution over next states}

s0, given that we are at state s and take action a. This function is stochastic to represent the dynamic uncertainty of both aircraft involved in the encounter.

R(s, a) 7→ r is the reward function for being at state s and taking action a. ACAS X uses negative rewards to discourage the aircraft from engaging in NMACs, issuing too many alerts, etc. For example, having an NMAC can result in a reward of -1000, while issuing an alert can result in a reward of -1. Ω is the set of observations o. This includes the input from the aircraft sensors. Since we cannot

observe the entire state of the encounter due to measurement error, noise, and sensor inaccuracies, Ω represents the set of possible observed measurements.

O(s, a) 7→ p(o | s, a) gives the distribution over possible observations given a state and an action taken. The shape of this distribution is determined by the amount of noise from each of the observation

(18)

variables.

The POMDP framework is used in the rest of the thesis as the base upon which the DP and DRL models and solutions are built.

A solution to a POMDP takes the form of a policy, which is a function π(b) → a that maps the agent’s belief of what the current state is to an action. The optimal policy π∗is the one that maximizes the expected

returns (i.e. the expected sum of future rewards).

2.4 ACAS X and Dynamic Programming

ACAS X is the solution developed at Lincoln Lab to overcome the limitations of TCAS [Kochenderfer et al., 2012]. It is designed to handle many combinations of aircraft types, like manned and unmanned aircraft, with vary-ing sizes and maneuvervary-ing capabilities. In addition, it can handle many different sensor types and many combinations. The ACAS X solution explicitly models the uncertainty of the problem and uses optimization to solve it.

To provide a framework in which to make decisions in the face of both dynamic and state uncertainty, ACAS-X formulates the collision avoidance problem as a POMDP, and uses Dynamic Programming (DP) [Bertsekas, 2005] to solve it. The ACAS X logic is built and tested with models of the airspace built from encounter data collected by data that approximate the behavior of real aircraft in the airspace. The problem can be formulated in different ways to account for different sensor modalities, and different aircraft capabilities while still retaining the same solution framework. This increases the flexibility of what logics can be learned by providing a principled approach for constructing a solution.

While the ACAS X approach works well and overcomes many of the limitations of TCAS, it also has several limitations. An airborne encounter in ACAS X is described by several variables at each timestep. These variables include: distance from ownship to intruder, velocity of ownship, velocity of intruder, etc. Some of the variables used to represent the state of an encounter are continuous variables. The DP solution used by ACAS X needs to compute a cost at each state of the encounter for it to be able to choose the best RA to issue. Since there are infinitely many points in the state space of the problem, DP cannot be applied directly to solve it. Instead, the continuous variables are discretized by selecting a set of points across their range where the cost is calculated. While the DP solution guarantees an optimal solution to the problem it is applied to, the discretized problem is an approximation of the real problem to be solved.

The DP algorithm used to solve the problem is value iteration. In value iteration, we store a table that associates a cost to every state s in the state space S. Let the cost of s be represented by v(s). The table is initialized to arbitrary values, and then the following update is performed for all s ∈ S:

(19)

2.4. ACAS X AND DYNAMIC PROGRAMMING 19

v0(s) = min

a E[rt+1+ v(st+1)|st= s, at= a]

In our case, we sweep over the discretized state space performing the value iteration updates. The cost v(s) is guaranteed to converge to the optimal cost v∗(s) [Bellman, 1957]. From the cost table we calculate, it

is then possible to retrieve the optimal policy, which decides which action to take at every state. Let q(s, a) represent the cost of being at state s and taking action a, then

q(s, a) =X

s0

p(s0|s, a)v(s0)

We refer to q(·, ·) as the Q-function. We can get the optimal policy π∗ from the optimal Q-function by

simply choosing the action that minimizes the cost

π∗(s) = min a q∗(s, a)

One issue with a discrete representation of a continuous state space is that the DP solution scales linearly with the number of points in the state space. This means that, for example, increasing the number of state points for a discretized variable by a factor of 2 also increases the time and storage needed to find a solution by a factor of 2. It also implies that adding new variables to the state, with a set of k different values would increase the solution time and storage space by a factor of k. Thus, for a problem with n variables with k possible values each, the solution would be O(kn), so the solution scales exponentially with the number of variables in the state space. This is a significant limit because an airborne encounter has many significant variables, and this property of a discrete POMDP formulation forces the ACAS X solution to approximate the state space by limiting the number of variables in it.

The size of the state space for a three-dimensional aircraft encounter makes the solution too impractical to find due to high storage and computational complexity. In order to get around this, the logic that issues vertical maneuvers is separated from the logic that issues horizontal maneuvers. The horizontal logic defines a state space with variables in the horizontal XY plane, and summarizes the vertical components with a variable τ representing the time until loss of vertical separation, which is calculated by using a probabilistic model of the aircraft dynamics [Kochenderfer and Chryssanthacopoulos, 2011]. The analogous process is followed to get a vertical logic, which is combined with the horizontal logic to get a logic that works in three spatial dimensions.

(20)

2.4.1 ACAS Xa and Xu and sXu

ACAS Xa is a collision avoidance system designed for large, manned aircraft [Kochenderfer et al., 2012]. Its collision avoidance logic consists of maneuvers solely in the vertical direction. The ACAS Xu system is designed for unmanned systems. It is different in that it has the ability to maneuver not only in the vertical direction, but also in the horizontal direction. There is also the ACAS sXu system for small unmanned aircraft, which adapts the Xu solution approach to the requirements and regulations that apply to such aircraft [Alvarez et al., 2019]. For the purposes of this work, we explore collision avoidance logics similar to ACAS Xu, which can issue alerts both for the horizontal and the vertical dimensions.

In this work, we use two different versions of ACAS Xu as baseline comparisons to evaluate the perfor-mance of our 3D logics. The Xu versions are 4.1 and 5. Version 4.1 carries both horizontal and vertical logics for maneuver selection. It uses a logic selection function that uses the state of the encounter to determine whether to use the vertical logic or the horizontal logic, and then continues using the selected logic until the conflict is resolved [Owen and Kochenderfer, 2016]. Version 5 also has horizontal and vertical logics, however, it uses both logics at the same time for every decision step of an encounter, allowing it to issue horizontal and vertical advisories simultaneously if necessary [Owen et al., 2019].

2.5 Neural Networks Logic Representation

In this thesis, we use deep reinforcement learning algorithms as a different approximation scheme to the discretized tabular solution. DRL solutions reduce the number of parameters needed to encode a policy by using a neural network instead of a table. The solution is then found by updating the neural network’s parameters. One method of doing this, and the one used here is deep learning, which extends the Q-learning algorithm to learn a Q-function in a neural network.

Previous work on ACAS X has shown that deep neural networks (DNNs) can be successful in the task of approximating the Q-function for the collision avoidance problem. In [Julian et al., 2019], the authors used a DNN to replace a numerical table as the Q-function. They used a table that had been pre-trained with DP as the target for the DNN, and trained the DNN to approximate the function encoded by the table. This approach was successful in approximating the Q-function and was able to outperform the table policy in simulated encounters. However, this approach does not solve the curse of dimensionality problem because it uses supervised learning to learn the DNN Q-function, which requires a table to first be trained with DP to serve as the regression target for the DNN.

(21)

2.6. FRAMEWORK FOR PERFORMANCE EVALUATION 21 The logic was learned directly onto the DNN, which was trained using experience from simulation. The author showed that this approach could outperform the horizontal part of ACAS Xu on a set of validation encounters. We take a similar approach to solving the problem in Chapter 4, this time exploring a set of modifications that can stabilize the policy learned, and we extend it to three dimensions in Chapter 5.

2.6 Framework for Performance Evaluation

We use a set of metrics throughout this thesis to evaluate the performance of the logics we train. The metrics we consider are P(NMAC), P(Alert), and P(Reversal). Our goal is to minimize P(NMAC), while keeping P(Alert) and P(Reversal) under acceptable thresholds. P(NMAC) refers to the proportion of encounters where an NMAC occurred, P(Alert) the proportion of encounters where at least one RA was issued, and P(Reversal) the proportion of encounters where there was a reversal. Reversals are defined as switching the direction of the RA from one timestep to the next (e.g. going from issuing a climb maneuver to a descend maneuver). There are many more metrics that are important to measure the performance of a collision avoidance system and to meet the specifications, but these give us a high-level view of how the logics are performing.

These performance metrics are collected by running the logics in simulated aircraft encounters. The logics are tested on fast-time Monte Carlo simulation environments that incorporate models of the aircraft behaviors, surveillance sources, and aircraft dynamics [Owen et al., 2019]. We use a set of encounters that generated from a probabilistic model derived from radar observations of real-world encounters (LLCEM – Lincoln Laboratory Correlated Encounter Model) [Kochenderfer et al., 2008]. This set contains a mix of encounters that have a high risk of resulting in a collision, as well as encounters with low risk to test the logics in a wide range of scenarios.

(22)

(23)

Chapter 3

Tabular 3D Logic

In this chapter, we evaluate the performance that we would achieve if ACAS X used a full 3D representation of the state space. We first describe how the POMDP is set up with a 3D state space. We show that solving this problem directly is not feasible in practice due to the computational and storage requirements. Next, we describe how ACAS Xu (Xu) modifies the problem to make it tractable. We present two versions of ACAS Xu and describe how they work. Version 4 chooses either the horizontal or the vertical logic for each encounter, and then proceeds through the encounter with the sense chosen [Owen and Kochenderfer, 2016]. Version 5 uses alerts for horizontal and vertical maneuvers simultaneously. It evaluates a horizontal logic and a vertical logic, and outputs the maneuvers issued by both of them [Owen et al., 2019].

Then, we describe how the 3D logic is trained and evaluated. While the problem is not solvable in practice, we can divide the state space of the problem into smaller sets that can be trained and evaluated in a reasonable amount of time. We evaluate the logic in a simulation environment, and compare its performance with different parameters to the performance of Xu Version 4 (Xu 4) and Xu Version 5 (Xu 5). We discuss how the trade-offs made by the solution approach taken by Xu affect its performance, and the potential benefits that a 3D solution could offer.

3.1 Three-Dimensional Tabular Solution

We first define the POMDP of the 3D collision avoidance problem in more detail. We show that using the 3D state space would make the problem intractable for the operational restrictions of the collision avoidance problem. We discuss how to modify the problem so that it becomes tractable for the purpose of performance evaluation. In section 3.2, we cover the POMDP used by ACAS X and compare it to the 3D POMDP.

(24)

3.1.1 POMDP Definition

The state variables describe a snapshot of an encounter between two aircraft at a point in time. It includes the position variables required to identify each aircraft ((x0, y0, z0) and (x1, y1, z1)). However, because every

maneuver is taken relative to the ownship aircraft, the state can be equivalently described only by the relative position of the intruder aircraft, (x, y, z). The (x, y, z) representation is still not the best, because NMACs are defined in terms of horizontal and vertical range, and because policies exhibit bilateral symmetry. For this reason, we use a polar representation of position for the horizontal plane along with the relative vertical position, (r, θ, z), where r =px2_{+ y}2 _{and θ = tan}−1 y

x.

Another variable we care about is the heading of the ownship and intruder. The heading is the direction of displacement in the horizontal plane. Again, we can center our coordinate system on the ownship, so that its heading is always 0 radians, and let the intruder heading relative to the ownship be ψ.

We keep track of the horizontal speed of each aircraft, denoted by s0 and s1 for ownship and intruder

speeds, respectively. We also keep track of the vertical speed of each aircraft, denoted by dz0 and dz1. The

last variable we keep track of is the previous maneuver issued by the ownship, ra. This is done so that we can add rewards to the POMDP for starting new alerts, ceasing to alert too early, reversing a previous advisory, etc. The set of state space variables is shown in Table 3.1.

Variable # Edges Units Description

r 43 ft Horizontal distance to intruder.

θ 89 rad Intruder position angle relative to ownship heading. ψ 89 rad Intruder heading relative to ownship heading.

z 17 ft Intruder altitude relative to ownship altitude.

s0 7 ft/s Horizontal speed of ownship.

s1 7 ft/s Horizontal speed of intruder.

dz0 3 ft/s Vertical speed of ownship.

dz1 3 ft/s Vertical speed of intruder.

ra 9 N/A Previous RA issued by logic.

Table 3.1: Three dimensional state space variables.

Next, we describe the set of maneuvers available to the ownship at each timestep. We trained two different 3D logics and tested their effectiveness. The first one, which we will call single-RA logic, can issue RAs from the set {NONE, LEFT, RIGHT, UP, DOWN}, so at each timestep it decides either to not alert, or to give a vertical or horizontal alert. The second one, called combined-RA logic, can issue a combination of vertical and horizontal maneuvers at each timestep, effectively choosing an alert from the set {NONE, LEFT, RIGHT, UP, DOWN, LEFT-UP, LEFT-DOWN, RIGHT-UP, RIGHT-DOWN}. We provide the details of the turn and climb rates of each maneuver in Table 3.2.

(25)

3.1. THREE-DIMENSIONAL TABULAR SOLUTION 25

Maneuver Name Maneuver Type Maneuver Strength

COC (Clear of Conflict) N/A N/A

LEFT Horizontal 3 deg/s

RIGHT Horizontal -3 deg/s

UP Vertical 16.67 ft/s

DOWN Vertical -16.67 ft/s

Table 3.2: Aircraft maneuver rates.

To find a solution to the problem, we also have to define the reward function that assigns rewards to each state-action pair. The reward function for this problem issues penalties to different undesirable scenarios in an encounter. The magnitude of the penalty roughly corresponds to how undesirable the scenario is. For example, a small penalty is given if the logic issues an alert because not issuing any alerts is preferable to issuing one if there is no risk of collision. However, there is a large penalty if there is an NMAC, so if the state of an encounter is such that the risk of a collision is high, the penalty will be lower if alerts are issued and an NMAC is avoided than if we issued no alerts and got into an NMAC.

In Table 3.3 we present all the different penalties used in the 3D logic. The choice of when to give a reward and the magnitude of the reward varies depending on the problem. It is up to the designer of the system to choose an appropriate reward function that will effectively solve the problem being approached. The reward function chosen determines what kind of behavior is learned by the agent, so the tuning of the reward parameters and the choice of events that are rewarded can have a significant influence on the performance of the logic. Table 3.3 shows a typical reward function for the aircraft collision avoidance problem, which is very similar to the reward function used by ACAS X (Section 3.2). We discuss changes to the reward function in Section 3.3.3.

Reward Name Value Description of Event

NMAC -100 r ≤ 500 and |z| ≤ 100

Alert -0.02 Maneuver other than COC is issued.

Reversal -0.3 Current maneuver has opposite sense of previous maneuver. (e.g. LEFT → RIGHT)

Single Switch -0.21 Current maneuver is single-action, while previous is combined-action. (e.g. LEFT-DOWN → LEFT)

Combined Switch -0.15 Current maneuver is combined-action, while previous is single-action. (e.g. LEFT → LEFT-DOWN)

Cease Alerting -0.3 Current maneuver is COC and previous maneuver is not COC. Table 3.3: Reward function for tabular 3D logic.

Let us now briefly discuss why a practical solution cannot be achieved with the 3D state space definition of the POMDP, and why ACAS X takes its approach of dividing the state space into two parts. The size of

(26)

the state-action space for the 3D POMDP is |S| × |A| =Q

s∈Sv|s| × |A|, where Sv is the set of variables in

the state space, and s is the set of values each variable can take. The value |S| × |A| represents the number of floating-point numbers needed to store the tabular logic. With our current definition of the state space, and assuming we store the values as 32-bit floating-point numbers, the logic would need a table of size > 1TB. The size of this table makes it impractical, since most aircraft that would benefit from this logic have a memory capacity in the order of a couple of GB. The time complexity of solving the problem is also a limiting factor in learning a logic. Using the Value Iteration algorithm to learn the logic, and training on a 64-core Intel Xeon processor, it would take more than a month to train a single logic table. Because new logics need to be trained frequently to adjust for new requirements and to test new changes, the long training times make a 3D tabular logic impractical to work with.

3.2 ACAS X POMDP

While the problem described in the previous section cannot be solved by naively applying a DP solution to it, it can be transformed into a similar problem with a tractable solution in practice. In this section, we describe the modifications made by the ACAS X approach to find a solution.

The vertical logic is responsible for issuing maneuvers that affect an aircraft’s altitude. However, the vertical logic assumes that it cannot control the horizontal variables. Thus, any state variables related to the horizontal plane (e.g. the horizontal speed, the relative bearing θ of the intruder, etc.) are not affected by the maneuvers issued by the vertical logic, so the solution to the vertical control problem can be found separately from the horizontal problem. The problem is thus divided into a controlled problem (vertical states and ac-tions), and an uncontrolled problem (horizontal states). The vertical controlled problem is then solved under the assumption that the time until the aircraft lose horizontal separation is τ . The uncontrolled problem can then be solved by using a model for the aircraft’s horizontal dynamics and estimating the probability distri-bution of τ at for each state in the uncontrolled problem [Kochenderfer and Chryssanthacopoulos, 2011].

By splitting the problem into a controlled and uncontrolled problem, the size of the logic tables is significantly reduced. In the vertical logic, we can view the variable τ as a summary of the uncontrolled horizontal problem, so horizontal state space variables like range, speed, etc. are replaced with τ . This reduces the training time of a vertical logic to just a couple of hours, down from the month-long training that would be required by the naive 3D state space approach.

(27)

3.3. THREE-DIMENSIONAL LOGIC – TRAINING AND EVALUATION 27

3.3 Three-Dimensional Logic – Training and Evaluation

While the 3D logic to computationally costly in practice, we can use the properties of the aircraft encounter POMDP to find a solution for the purpose of performance evaluation. We wish to compare the performance of the 3D logic and the ACAS Xu logic on a set of aircraft encounters that are similar to those found in the real airspace.

One property that we can take advantage of when evaluating the collision avoidance logic is that, in our model, the speed of the aircraft do not change significantly during an encounter. Over the course of most of the encounters in the LLCEM evaluation set, the speed of the ownship and the intruder stay within a relatively small range. This means that if we had a logic trained on the 3D state space, for most of the encounters in the evaluation set, the same variable edge value in the Q-table would be used for the ownship speed and the intruder speed throughout the whole encounter.

This property of aircraft encounters gives us a simple way to estimate the performance of a 3D logic table, while significantly reducing the training time and storage requirements. Our approach is to train a separate 3D logic table for each possible pair of values (s0, s1) of ownship speed and intruder speed. Then, we run

the tables on simulated encounters with an intruder aircraft where the speed of each aircraft is close to the (s0, s1) values throughout the whole encounter. The safety and performance metrics are then calculated for

each subset of the encounter space, and the results combined to get an estimate of the performance in the entire validation encounter set. These metrics can then be compared to the metrics collected from simulating the ACAS Xu logics on the same validation encounter set.

3.3.1 State Space Restriction for Training

The horizontal aircraft speed variables were chosen to remain constant for each table because they are the only variables that remain fairly constant during the course of an encounter. Position variables like the distance and angle between aircraft can go through a large number of values for each encounter. The same is true for the vertical speed variables, because they are directly controlled by the logic. Since the horizontal speed is assumed to be out of the logic’s control, it is a good candidate to be removed from the state space for the purpose of speeding up training.

One problem that arises when restricting the speeds to a single point is that the performance and safety metrics collected during evaluation do not accurately reflect the performance of a 3D logic. We want to estimate the performance metrics for a true 3D logic, but a 3D logic would be able to use speed as an input, and adjust the RAs issued depending on the speed of the aircraft. The LL validation set of encounters contains encounters where the speed of aircraft vary throughout the encounter. It also contains a wide range

(28)

Figure 3.1: Distribution of 5.5M encounters with low variance in aircraft speed (out of 10M).

of encounters with different speeds.

The first problem can be addressed by filtering the encounters in the validation set to those encounters where the range of speeds that each variable goes through during the encounter is small. In Figure 3.1 we show the distribution of ownship speeds for the encounters whose speed range varies by less than 10 ft/s during the entire encounter. Out of a total of 10M encounters, around 5.5M have this property so there is still a significant number of encounters we can use to evaluate our logic. We can also see that the majority of the encounters occur between the speeds of 170 and 500 ft/s. If, for example, we used 6 different values for the speeds of each the ownship and the intruder, we could partition encounters into sets depending on which speed pairs they are closest to. These partitions are shown in Table 3.4.

Intruder Speed (ft/s) Ownship Speed (ft/s) 197.5 252.5 307.5 362.5 417.5 472.5 197.5 342,232 340,969 342,209 93,363 80,988 81,460 252.5 340,969 342,172 342,513 93,738 81,401 81,525 307.5 342,209 342,513 342,276 93,914 81,615 81,787 362.5 93,363 93,738 93,914 27,878 25,075 24,564 417.5 80,988 81,401 81,615 25,075 22,004 21,769 472.5 81,460 81,525 81,787 24,564 21,769 22,094

Table 3.4: Number of encounters with low speed variance for each table cutpoint (4.8M total).

(29)

3.3. THREE-DIMENSIONAL LOGIC – TRAINING AND EVALUATION 29 filter the encounters to those where the speed of both aircraft change little throughout the encounter (i.e. less than 10 ft/s). Because the goal is to evaluate tables where there is only one ownship speed and one intruder speed, we partition the encounter set into groups where the ownship-intruder speed pair is close to a vertex in the logic. For example, say that one of the tables has speed values of (s0, s1) = (200, 400).

In order to evaluate this table, we want to run it only in encounters where the ownship speed is near 200 and the intruder speed is near 400, and where the speed remains approximately constant throughout each encounter. This is repeated for the rest of the possible values of (s0, s1).

Using the partitions we created from the set of validation encounters, we can then train the appropriate 3D table for each encounter set, and use them to estimate the performance of a 3D logic. The details of the training are covered in the next section.

3.3.2 Training Details

The discretized values for each of the variables were chosen to match those of the horizontal ACAS Xu logic where possible. The number of values of the angle variables θ and ψ were reduced slightly, and the number of values for the vertical variables z, dz0, and dz1 was significantly reduced compared to their ACAS Xu

counterparts.

The tables for each speed pair were trained using the Value Iteration algorithm, along with the QMDP heuristic to account for uncertainty in the state due to noisy observations. The details for training the logic are the same as those for ACAS Xu [Owen et al., 2019], except that the whole problem is now assumed to be controlled so there is no τ variable.

Because each table is separate from each other, the tables can be created in parallel. The training for each table was conducted on the Lincoln Laboratory Supercomputing Center on Intel Xeon 64 Core machines each [Reuther et al., 2018]. The Value Iteration algorithm was found to converge well after 50 iterations for a range of reward functions used. The training of all the tables took a total of approximately 8 hours. The state value table for each table had a size of around 2GB.

In Figure 3.2 we include an example of a trained 3D policy. In the left column, we show a horizontal slice of the policy for an intruder slightly below the ownship, and in the right column, we show a vertical slice of the policy for an intruder approaching head-on and slightly from the right. Note how in all cases the logic issues combined alerts to increase separation in both the horizontal and vertical directions, despite combined alerts having a higher cost. The policies are evaluated using nearest-neighbor interpolation, which accounts for the sharp decision boundaries. We don’t show plots where the intruder is above or to the left of the ownship because the policy issues maneuvers symmetric to the ones shown in the figure.

(30)

Figure 3.2: Policy plots for 3D state space logic. The left column shows horizontal slices of the policy for an intruder 30ft below the ownship, approaching (1) codirectionally, (2) head-on, and (3) crossing. The right column shows a vertical slice where the intruder is approaching head-on and 6 degrees to the right, and is (1) descending, (2) level, and (3) climbing.

(31)

3.4. DISCUSSION 31

Figure 3.3: Performance of 54 different trained models of 3D logic, compared with ACAS Xu versions 4.1 (green) and 5 (red).

3.3.3 Performance Comparison

We selected a set of reward parameters and performed random hyperparameter search over a defined set of possible rewards. Each reward variable was chosen from a range of values, and we sampled values for each variable from their respective ranges. The result of this hyperparameter search is shown in Figure 3.3. The 54 different models trained are used to construct the estimated safety curve of the 3D model, which represents the trade-off between performance and safety of the 3D logic. We can see that the trained models perform better than Xu 4.1, and has approximately the same performance as Xu 5.

Each trained model shown in Figure 3.3 consists of a set of 3D tables evaluated on their respective horizontal speed value pair set of validation encounters. The Xu 4.1 and Xu 5 models are evaluated on the same set of encounters.

3.4 Discussion

In this chapter, we showed a technique for quickly evaluating the performance of a 3D state space table without incurring the computational and storage costs of training one. We showed that by restricting the values of the horizontal speed variables, and training smaller tables for each speed value pair, we could use

(32)

a simulation environment to measure how a 3D logic would perform.

We found that, despite being limited in the resolution of the state space variables and having fewer maneuvers available, the 3D logic could perform as well as the ACAS Xu 5 logic. Given the limitations our 3D logic had, such as the low resolution of the state variables, a limited set of maneuvers, and no online costs, these results provide evidence for the possibility of improving the performance of a collision avoidance system that uses a 3D state space. In the next chapter, we explore one possible approach that can do this.

(33)

Chapter 4

Deep RL Logic – Horizontal

4.1 Introduction

This chapter demonstrates an alternative approach to solving the collision avoidance problem. We use the algorithms from the deep reinforcement learning (DRL) literature to overcome the limitations in the tabular approach. The family of DRL algorithms we use is DQN [Mnih et al., 2015], which replaces the Q-table used in DP with a neural network, and learns from experience in a simulation environment. The time to get a solution with the DQN algorithm does not scale exponentially with the number of variables in the state, so we can use the full set of variables of the 3D state space to train a logic. Because neural networks can work naturally with real-valued inputs, we also avoid having to discretize the state space to learn a solution, which leads to smoother decision boundaries for the policy.

We begin by defining the problem’s POMDP in detail, covering the relevant parameters for the state, observation, action, and reward spaces. We implicitly define the state-action transition distribution p(s0|s, a) by creating a simulation environment, defining a stochastic model for the behavior of the intruder aircraft, and defining how aircraft states are propagated at every state. Having defined the problem, we then proceed to describe how it is solved by learning a policy. We cover the relevant parts of the modified DQN algorithm used, and why each is included. We then analyze the policies learned using a base set of parameters. Finally, we experiment with different modifications of the learning algorithm and critical hyperparameters to choose the settings that learn the best policies.

(34)

4.2 POMDP Specifications

The following sections fully define the POMDP model under which our ownship aircraft operates. We define the states, observations, rewards, and actions in the environment.

4.2.1 State Space

The state space defines the ground truth of the encounter. It consists of the position, velocities, and acceleration variables for each aircraft, along with a variable representing the previous maneuver taken by the ownship. The full set of aircraft variables is shown in Table 4.1.

Variable Range Units Description

x [−30,000, 30,000] ft X-coordinate position of aircraft. y [−30,000, 30,000] ft Y-coordinate position of aircraft.

v [200, 225] ft/s Speed of aircraft with respect to the ground.

φ [−π, π] rad Angle of aircraft heading.

φ0 [−3, 3] deg/s Turn rate of aircraft.

Table 4.1: Aircraft state variables.

The collection of variables in Table 4.1 is enough to fully identify the state of each aircraft in our simulation environment. Letting the state at time t of the ownship aircraft be denoted by s(t)₀ = (x(t)₀ , y₀(t), v₀(t), φ(t)₀ , φ0(t)₀ ), the state of the intruder by s(t)₁ = (x(t)₁ , y₁(t), v₁(t), φ(t)₁ , φ0(t)₁ ), and the previous maneuver taken by the ownship as a(t−1)₀ , then the state of the POMDP at time t is s(t)_{= (s}(t)

0 , s (t) 1 , a (t−1) 0 ).

4.2.2 Action Space

The action space describes the set of maneuvers that the ownship aircraft can take. It consists of three different actions: RIGHT, LEFT, and NONE, corresponding to right turn, left turn, and no maneuver. The rate at which the aircraft turns is determined by the environment.

4.2.3 Observation Space

The set of variables in the observation space is what the collision avoidance policy has available for learning a policy, and for evaluating the policy to issue maneuvers.

There is no set of prescribed variables that we are forced to use, so the observation variables are defined according to how we want to model the environment. We choose a set of observation variables that is reasonable with the amount of information an aircraft would have in an aircraft encounter. Another factor we consider is how well suited they are for evaluation by a neural network. The observation variables are

(35)

4.3. SIMULATION ENVIRONMENT 35 scaled down to comparable ranges, and broken into parts whenever that makes it easier to optimize the parameters of the neural network. The variables are normalized to be in the range [−1, 1].

In Table 4.2 we present a base set of variables that we use in our environment. Variable Description

ρ Distance between aircraft.

θx Cosine of angle of intruder position w.r.t. ownship heading.

θy Sine of angle of intruder position w.r.t. ownship heading.

ψx Cosine of angle of intruder heading w.r.t. ownship heading.

ψy Sine of angle of intruder heading w.r.t. ownship heading.

v0 Speed of ownship aircraft.

v1 Speed of intruder aircraft.

φ0₀ Turn rate of ownship aircraft. φ0₁ Turn rate of intruder aircraft.

Table 4.2: Observation space variables.

4.2.4 Reward Space

We use two negative rewards to penalize the ownship for undesirable behavior. Our objective is to minimize the number of NMACs in the airspace, while at the same time keeping the rate of alerts issued at a reasonable level. The rewards used are shown in Table 4.3.

Reward Name Value Description

NMAC -1 Distance between aircraft goes below the NMAC threshold, ρ ≤ ρNMAC.

ALERT -0.01 Ownship aircraft issues a maneuver other than NONE. Table 4.3: Reward space variables.

The choice of when to issue a reward and the value of the reward are choices made when modeling the problem.

4.3 Simulation Environment

Part of the definition of a POMDP is a transition function, which assigns a probability distribution over next states, given a current state and an action, i.e. p(s0|s, a). This function is difficult to define analytically for our problem of collision avoidance, and we will soon see that our solution method does not require it. Instead, we implicitly define a transition function by creating a simulation environment where aircraft encounters evolve, approximating the evolution of aircraft encounters in the real world.

(36)

4.3.1 Encounter Generation

Our model learns a policy by collecting experience from the environment and trying to maximize the rewards it collects in an encounter. An encounter consists of a starting state, and a series of intruder states that describe how the intruder aircraft will behave. The maneuvers issued by the ownship aircraft determine how its own state evolves. During the training phase, the aircraft collects sets of state transitions from the environment and learns what the best actions are at each state from the reward signal it receives.

The behavior of the intruder aircraft is generated as follows. Each environment has two variables which determine the behavior of the intruder: a, the average length of NONE maneuvers, and b, the average length of maneuvers other than NONE. We model the behavior of the intruder as a Markov chain where the next intruder action depends only on the current intruder action. Letting p = 1_a and q = 1_b, the transition probabilities for the intruder’s actions are as follow,

P =        1 − p p₂ p₂ q 2 1 − q q 2 q 2 q 2 1 − q       

where Pij represents the probability of transitioning to action j given that we are at action i, and the

actions are mapped to integers like so (NONE → 0, LEFT → 1, RIGHT → 2).

The starting state for each encounter is sampled so that, for both the ownship and the intruder, the starting heading and speed are sampled from a uniform distribution. The sequence of intruder actions is then sampled according to the transition matrix P . The ownship is then placed at a starting state so that around 40% of the time there will be an NMAC in the encounter if the ownship takes no action.

This method of encounter generation is a way to sample the data such that we get informative training samples. NMACs are very rare in randomly-generated samples, so this way of creating training encounters increases the likelihood of having NMACs in the training set, and thus increases the strength of the reward signal in states whose values estimates are critical to avoiding NMACs.

Most states in an encounter should have values that are very close to 0, so we want to avoid spending too much time training on those states. We would rather learn from states that will give us a reward signal on NMACs. States generated with this strategy also have the benefit that they might have good information on alerts and reversals, since the logic has to learn to take some action sequence to escape an NMAC.

The way that training encounters are generated is very important for what type of policy we learn. Other ways of generating training data are explored in the experiments run in section 4.6.

(37)

4.4. HORIZONTAL LOGIC WITH DEEP RL 37

4.3.2 Simulation Dynamics

At each step of the simulation, the aircraft are propagated forward according to their velocity vector. The environment’s position propagation dynamics are simplified, so changes in direction happen instantaneously and positions are propagated afterward. We also add some Gaussian noise to the state variables at each timestep to prevent the agent from overfitting to the environment dynamics. The aircraft’s speed, position, turn rate, and direction are all modified with small amounts of noise.

Each episode terminates when the distance between aircraft is greater than some threshold ρr. In our

control parameters, this threshold is ρr= 30,000ft.

4.4 Horizontal Logic with Deep RL

4.4.1 DQN Description

The model we use to learn the collision avoidance policy is DQN [Mnih et al., 2015]. The basic idea is that the state-action value function Q(s, a) is represented by a deep neural network. This deep neural net takes in the observations from the environment simulation, and outputs a real number for every action, representing the value of taking that action at a given step.

The neural net is trained by optimizing the following loss function:

L(x, x0, a, r|θ, θ0) = (max

a0 Qθ

0(x0, a0) + r − Q_θ(x, a))2

where Qθ0 is the static target network. During training, the back-propagation gradients are only taken

with respect to the θ parameters of the learning network. The θ0 parameters of the target network are updated with the θ parameters at periodic intervals. The freezing of the target network is done to increase stability during training.

The training data D is a set consisting of state transitions of the form (x, x0, a, r), where x is the current state, a the action taken at x, x0 the next state, and r the reward collected from the transition. The training data is collected from the simulation environment and stored in a replay buffer that stores the most recently seen transitions. A mini-batch of transitions is sampled from the replay buffer at each step during training, and the loss above is minimized through gradient descent.

The Q-learning algorithm is known to overestimate state-action values. It has this bias because the same Q-function is used to choose the best-next action and to calculate the value of the next state [Van Hasselt et al., 2016].

(38)

To decrease this bias, we use the double Q-learning modification to the loss function:

L(x, x0, a, r, θ, θ0) = (Qθ0(x0, arg max

a0

Qθ(x0, a0)) + r − Qθ(x, a))2

The modification uses the target Q network to evaluate the value of the best next action, rather than using the training network to both choose an action and estimate its value.

4.4.2 Training Strategy

We use the DQN learning algorithm implemented in the OpenAI Baselines [Dhariwal et al., 2017], modified as described in this section and in section 4.6. This implementation of DQN uses Tensorflow version 1.14 as a back-end for building the Q-function neural network. The simulation environments used to generate the training data was built as a custom OpenAI Gym environment [Brockman et al., 2016]. The gym environ-ment provides a standard interface that is used by the Baselines model during training and evaluation of the logic.

The observations fed into the neural net are normalized to be in the range [-1,1], and rewards are normalized to be in [-1,0], following the common practice of normalizing inputs and outputs to improve learning speed. The network architecture used has 7 fully-connected layers of 256 ReLU units each. At every timestep, the environment is advanced forward by one step, and a batch of 32 transitions are sampled from the replay buffer to perform a gradient step. In order to minimize our loss function, we use the Adam optimizer [Kingma and Ba, 2014]. We anneal the learning rate over the course of training.

The following are additional modifications made to the learning algorithm which we found were helpful in learning successful policies. These training details are incorporated into all the policies we present in the sections that follow:

1. Learning starts + replay buffer size: The environment is allowed to run for a number of steps equal to the size of the replay buffer. This allows the replay buffer to fill up with a varied distribution of samples by the time we start learning. The size of the buffer should be large enough that it contains a variety of transitions representative of the state space. The distribution of significant transitions in the buffer (e.g. transitions with NMACs, reversals, alerts, etc.) influences the safety and coverage of the policy learned.

2. Learning rate annealing: The learning rate is decreased throughout the training. This means that the gradient steps are larger at the beginning of training, and get smaller as the training progresses. The annealing is done on an exponential schedule.

(39)

4.5. BASELINE RESULTS 39

Figure 4.1: DRL horizontal policies learned with 3 different seeds with the same parameter set.

3. Target network update frequency: The frequency with which the target network is updated is changed throughout the training. During the early stages of training, when the reward signal is noisy and the learning rate is large, the network is updated more frequently. As the agent learns, the network is updated less frequently to improve the stability of the policy.

4. Exploration fraction annealing: The probability of choosing an action at random is decreased over the first portion of training. In the early stages of training it starts at = 1, meaning that the actions taken by the agent are completely random. The value of is then decreased over the course of training until it reaches a chosen value of 0. It is important that 0not be too small to encourage the agent to

continue to explore. Randomness in actions also adds a degree of regularization to the learned logic; because the agent cannot reliably take the desired action at each timestep, a successful policy must be strategic and issue preventive alerts to avoid NMACs.

4.5 Baseline Results

Our testing of the learning algorithm begins with a set of parameters that were heuristically found to be effective in learning a good policy. (We include the full set of parameters in the Appendix). The validity of these parameters is explored in experiments in the sections that follow. Here we present the results from training a policy with this base set of parameters, and analyze the performance.

We trained 3 Q-networks, each with a different random seed. Plots of the policies learned can be seen in Figure 4.1 for each of the trained models.

The policies were trained for 1.5M timesteps. In Figure 4.2 we show plots of the performance metrics for each of the policies. We got these performance metrics by evaluating the policies every checkpoint of 100K timesteps on a set of 2K encounters. This figure shows that the policies usually start with high alert rates, which then trends downward throughout training as logic learns to better identify false alarms. From these checkpoint policies, we choose the best policy, where best is defined as having the lowest P(NMAC) with

(40)

Figure 4.2: Performance metrics (P(NMAC), P(Alert), P(Reversal)) on the control policies.

P(Alert) ≤ 0.7 and P(Reversal) ≤ 0.1 on a holdout set of simulated encounters. The results of simulation on the holdout validation set are shown in Table 4.4.

Model ID P(NMAC) P(Alert) P(Reversal)

1 0.0241 0.6286 0.0587

2 0.0331 0.6001 0.0328

3 0.0201 0.6437 0.0330

Table 4.4: Performance of the best model for each seed with alert rate ≤ 0.7, reversal rate ≤ 0.1. The base NMAC rate on this validation set for an aircraft with no logic is 0.4228.

The performance on the holdout set of encounters suggests that the policy is able to find a balance between safety and operational performance by significantly reducing the NMAC rate, while keeping the alert and reversal rates relatively low. Figure 4.3 shows an example of an encounter from the validation set where the logic successfully maneuvers to avoid an intruder aircraft. In this encounter, the aircraft are approaching head-on, and the ownship adjusts its trajectory to the left slightly to successfully avoid what would otherwise have been an NMAC. This encounter had never been seen before in training, and the parameters used to generate the encounter are different from those used in the training encounter generation.

4.6 Experiments

In this section, we add features to the learning algorithm and measure their effect on the policies learned. We judge our policies qualitatively, by looking at plots of the policy on a slice of the state space and plots of the trajectory taken at each encounter. We also judge them quantitatively by comparing their performance

(41)

4.6. EXPERIMENTS 41

Figure 4.3: Example encounter of the learned policy with another aircraft.

on holdout sets of encounters.

The procedure we took to run the experiments is as follows:

1. Using the set of hyperparameters from Section 4.5 as a starting point, choose one hyperparameter and modify it. This allows us to measure the effect of the hyperparameters on the performance, all else being equal.

2. Train a policy for each value of the hyperparameter in the set of possible values we want to test.

3. Evaluate the performance of the policy at regular intervals during training. We do this by running the policy on a set of simulated encounters and measuring the rate of NMACs, alerts, and reversals on that set.

4. Choose the model checkpoint with the best performance on the validation set, and evaluate it on a different, larger test set of encounters to get an estimate for the true performance of the policy.

In the sections that follow, we present the modifications we tested and an analysis of the results.

4.6.1 Experimental Variables

The algorithm parameters we evaluate are described below, along with an explanation of why that parameter is being tested:

Airborne collision avoidance with three-dimensional policy