On learning and generalization in unstructured taskspaces

(1)

Université de Montréal

On Learning and Generalization in Unstructured Task

Spaces

par

Bhairav Mehta

Département d’informatique et de recherche opérationnelle Faculté des arts et des sciences

Mémoire présenté en vue de l’obtention du grade de Maître ès sciences (M.Sc.)

en Informatique

August 21, 2020

c

(2)

Université de Montréal

Faculté des arts et des sciences Ce mémoire intitulé

On Learning and Generalization in Unstructured Task Spaces

présenté par

Bhairav Mehta

a été évalué par un jury composé des personnes suivantes :

Pierre-Luc Bacon (président-rapporteur) Liam Paull (directeur de recherche) Christopher J. Pal (codirecteur) Hugo Larochelle (membre du jury)

(3)

(4)

Résumé

L’apprentissage robotique est incroyablement prometteur pour l’intelligence artificielle incar-née, avec un apprentissage par renforcement apparemment parfait pour les robots du futur: apprendre de l’expérience, s’adapter à la volée et généraliser à des scénarios invisibles.

Cependant, notre réalité actuelle nécessite de grandes quantités de données pour former la plus simple des politiques d’apprentissage par renforcement robotique, ce qui a suscité un regain d’intérêt de la formation entièrement dans des simulateurs de physique efficaces. Le but étant l’intelligence incorporée, les politiques formées à la simulation sont transférées sur du matériel réel pour évaluation; cependant, comme aucune simulation n’est un modèle parfait du monde réel, les politiques transférées se heurtent à l’écart de transfert sim2real: les erreurs se sont produites lors du déplacement des politiques des simulateurs vers le monde réel en raison d’effets non modélisés dans des modèles physiques inexacts et approximatifs.

La randomisation de domaine - l’idée de randomiser tous les paramètres physiques dans un simulateur, forçant une politique à être robuste aux changements de distribution - s’est avérée utile pour transférer des politiques d’apprentissage par renforcement sur de vrais robots. En pratique, cependant, la méthode implique un processus difficile, d’essais et d’er-reurs, montrant une grande variance à la fois en termes de convergence et de performances. Nous introduisons Active Domain Randomization, un algorithme qui implique l’apprentis-sage du curriculum dans des espaces de tâches non structurés (espaces de tâches où une notion de difficulté - tâches intuitivement faciles ou difficiles - n’est pas facilement dispo-nible). La randomisation de domaine active montre de bonnes performances sur le pourrait utiliser zero shot sur de vrais robots. La thèse introduit également d’autres variantes de l’al-gorithme, dont une qui permet d’incorporer un a priori de sécurité et une qui s’applique au domaine de l’apprentissage par méta-renforcement. Nous analysons également l’apprentis-sage du curriculum dans une perspective d’optimisation et tentons de justifier les avantages de l’algorithme en étudiant les interférences de gradient.

Mots Clés: domain randomization, robotics, transfer learning, curriculum learning

(5)

(6)

Abstract

Robotic learning holds incredible promise for embodied artificial intelligence, with reinforce-ment learning seemingly a strong candidate to be the software of robots of the future: learning from experience, adapting on the fly, and generalizing to unseen scenarios.

However, our current reality requires vast amounts of data to train the simplest of robotic reinforcement learning policies, leading to a surge of interest of training entirely in efficient physics simulators. As the goal is embodied intelligence, policies trained in simulation are transferred onto real hardware for evaluation; yet, as no simulation is a perfect model of the real world, transferred policies run into the sim2real transfer gap: the errors accrued when shifting policies from simulators to the real world due to unmodeled effects in inaccurate, approximate physics models.

Domain randomization - the idea of randomizing all physical parameters in a simulator, forcing a policy to be robust to distributional shifts - has proven useful in transferring reinforcement learning policies onto real robots. In practice, however, the method involves a difficult, trial-and-error process, showing high variance in both convergence and perfor-mance. We introduce Active Domain Randomization, an algorithm that involves curriculum learning in unstructured task spaces (task spaces where a notion of difficulty - intuitively easy or hard tasks - is not readily available). Active Domain Randomization shows strong performance on zero-shot transfer on real robots. The thesis also introduces other variants of the algorithm, including one that allows for the incorporation of a safety prior and one that is applicable to the field of Meta-Reinforcement Learning. We also analyze curriculum learning from an optimization perspective and attempt to justify the benefit of the algorithm by studying gradient interference.

Keywords: domain randomization, robotics, transfer learning, curriculum learning

(7)

(8)

List of tables

2.1 Both algorithms, when given trajectories representative of the target environment, perform strongly. . . 56 2.2 When given noisy or non-representative trajectories, GAIL fails to recover on the

harder environment, whereas ADR+ is able to stay robust via the prior term. . . . 56 2.3 ADR seems slightly more robust in data-limited scenarios, whereas GAIL fails to

converge in limited data settings on the harder environment. . . 57 4.1 We compare agents trained with random curricula on different but symmetric task

distributions p(τ ). Changing the distribution leads to counter-intuitive drops in performance on tasks both in- and out-of-distribution. . . 78 4.2 Evaluating tasks that are qualitatively similar, for example running at a heading

offset from the starting heading by 30 degrees to the left or right, leads to different performances from the same algorithm. . . 79 A.1 We summarize the environments used, as well as characteristics about the

randomizations performed in each environment. . . 94

(13)

(14)

List of figures

1.1 A diagram of data generation setup. We define regions of the training set by zoning off rectangles of (µ, σ) pairs, and then use standard library tools to generate samples from that distribution . . . 36 1.2 Two starkly different stories of generalization, differing only in the data distribution

shown to the neural network. Figure (b) adds a small training area (upper-right red box) that, when included, leads to almost perfect generalization across the domain. Darker is higher error. . . 37 2.1 ADR proposes randomized environments (c) or simulation instances from

a simulator (b) and rolls out an agent policy (d) in those instances. The

discriminator (e) learns a reward (f) as a proxy for environment difficulty

by distinguishing between rollouts in the reference environment (a) and randomized instances, which is used to train SVPG particles (g). The particles propose a diverse set of environments, trying to find the environment parameters (h) that are currently causing the agent the most difficulty. . . 40 2.2 Along with simulated environments, we display ADR on zero-shot transfer tasks

onto real robots. . . 50 2.3 Agent generalization, expressed as performance across different engine strength

settings in LunarLander. We compare the following approaches: Baseline (default environment dynamics); Uniform Domain Randomization (UDR); Active Domain Randomization (ADR, our approach); and Oracle (a handpicked randomization range of MES [8, 11]). ADR achieves for near-expert levels of generalization, while both Baseline and UDR fail to solve lower MES tasks. . . 51 2.4 Learning curves over time in LunarLander. Higher is better. (a) Performance on

particularly difficult settings - our approach outperforms both the policy trained on a single simulator instance ("baseline") and the UDR approach. (b) Agent generalization in LunarLander over time during training when using ADR. (c) Adaptive sampling visualized. ADR, seen in (b) and (c), adapts to where the

(15)

agent is struggling the most, improving generalization performance by end of training. . . 51 2.5 In Pusher-3Dof, the environment dynamics are characterized by friction and

damping of the sliding puck, where sliding correlates with the difficulty of the task (as highlighted by cyan, purple, and pink - from easy to hard). (a) During training, the algorithm only had access to a limited, easier range of dynamics (black outlined box in the upper right). (b) Performance measured by distance to target, lower is better. . . 52 2.6 Learning curves over time in (a) Pusher3Dof-v0 and (b) ErgoReacher on

held-out, difficult environment settings. Our approach outperforms both the policy trained with the UDR approach both in terms of performance and variance. 53 2.7 Zero-shot transfer onto real robots (a) ErgoReacher and (b) ErgoPusher. In

both environments, we assess generalization by manually changing torque strength and puck friction respectively. . . 54 3.1 Effects of curriculum order on agent performance. A wrong curriculum, shown in

blue, can induce high variance, and overall poor performance. . . 60 3.2 Generalization for various agents who saw different MES randomization ranges.

Crossing the Solved line "earlier" on x-axis means more environments solved (better generalization). . . 61 3.3 Each axis represents a discretized main engine strength, from which

gradients are sampled and averaged across 25 episodes. The heatmap shows the cosine similarity between the corresponding task on the X and Y axes. The different panels show gradient similarity (higher cosine similarity shown as darker blues) after 25%, 50%, 75%, and 100% training respectively. A bad curriculum, UberLow then UberHigh, as shown in the blue

curve in Figure 3.1, seems to generate patches of incompatible tasks, earlier in training, as shown by growing amounts of yellow. . . 62 3.4 A good curriculum, UberHigh then UberLow, as shown in the orange curve in

Figure 3.1, seems to maintain better transfer through training, as shown by less patches of yellow in later stages of optimization (see Panel 3). . . 62 3.5 Policies trained with traditional domain randomization show large patches of

incompatible tasks that show up early in training (interfering gradients shown as

(16)

yellow), potentially leading to high variance in convergence, as noted in Chapter 2. . . 63 3.6 Policies trained with focused domain randomization seem to exhibit high transfer

between tasks, which seems to enable better overall generalization by the end of training. . . 63 3.7 Active Domain Randomization shows its ability to adapt and bring the policy back

to areas with less inter-task interference, as shown between Panel 2 and Panel 3. 64 3.8 In (a), final policies trained with UberHigh, UberLow (the orange curve in Figure

3.1), seem to be local minima. In (b) Final policies trained with UberLow,

UberHigh (the blue curve in Figure 3.1), seem to be saddles. In (c) final Policies

trained with MES ∼ U (8, 11) (Oracle in Figure 3.2) seem to be local minima, while in (d), final Policies trained with MES ∼ U (8, 20) (UDR in Figure 3.2) seem to be saddles . . . 66 3.9 Curvature analysis through training for various curricula. In (a), the

worse performing agent (blue curve, UberLowUberHigh), seems to have a smoother optimization path than its more better-performing, flipped-curriculum counterpart. In (b) Policies trained with full domain randomization (blue curve), also empirically shown to be less stable in practice, show smoother optimization paths. . . 67 4.1 Various agents’ final adaption to a range of target tasks. The agents vary only

in training task distributions, shown as red overlaid boxes. Redder is higher

reward. . . 71

4.2 Meta-ADR proposes tasks to a meta-RL agent, helping learn a curriculum of tasks rather than uniformly sampling them from a set distribution. A discriminator learns a reward as a proxy for task-difficulty, using pre- and post-adaptation rollouts as input. The reward is used to train SVPG particles, which find the tasks causing the meta-learner the most difficulty after adaption. The particles propose a diverse set of tasks, trying to find the tasks that are currently causing the agent the most difficulty. . . 73 4.3 When a curriculum of tasks is learned with Meta-ADR, we see the stability of

MAML improve. Redder is higher reward. . . 75 4.4 In the Ant-Navigation task, both uniformly sampled goals (top) and a learned

curriculum of goals with Meta-ADR (bottom) are stable in performance. We

(17)

attribute this to the extra components in the reward function. Redder is higher reward. . . 76 4.5 Ant-Velocity sees less of a benefit from curriculum, but performance is greatly

affected by a correctly-calibrated task distribution (left). In a miscalibrated one (right), we see that performance from a learned curriculum is slightly more stable. 77 4.6 In the high-dimensional Humanoid Directional task, we evaluate many different

training distributions to understand the effect of p(τ ) on generalization in difficult continuous control environments. In particular, we focus on symmetric variants of tasks - task distributions that mirror each other, such as 0−π and π−2π in the right panel. Intuitively, when averaged over many trials, such mirrored distrbutions should produce similar trends of in and out-of-distribution generalization. . . 78 4.7 In complex, high-dimensional environments, training task distributions can wildly

vary performance. Even in the Humanoid Directional task, Meta-ADR allows MAML to generalize across the range, although it too is affected in terms of total return when compared to the same algorithm trained with "good" task distributions. . . 79 4.8 Uniform sampling causes MAML to show bias towards certain tasks, with the

effect being compounded with instability when using "bad" task distributions, here shown as ±0.3, ±1.0, ±2.0 in the 2D-Navigation-Dense environment.. . . 80 4.9 When we correlate the final performance (x-axis) with the quality of adaption

(denoted as post-pre on the y-axis), we see a troubling trend. MAML seems to overfit to certain tasks, with many tasks that were already neglected during training showing worse post-adaptation returns. . . 82 A.1 Learning curves over time reference environments. (a) LunarLander (b)

Pusher-3Dof (c) ErgoReacher. . . 91 A.2 Sampling frequency across engine strengths when varying the randomization

ranges. The updated, red distribution shows a much milder unevenness in the distribution, while still learning to focus on the harder instances. . . 92 A.3 Generalization and default environment learning progression on LunarLander-v2

when using ADR to bootstrap a new policy. Higher is better. . . 93 A.4 An example progression (left to right) of an agent moving to a catastrophic failure

state (Panel 4) in the hard ErgoReacher-v0 environment. . . 93

(18)

A.5 Generalization on LunarLander-v2 for an expert interval selection, Active Domain Randomization (ADR), and Uniform Domain Randomization (UDR). Higher is better. . . 94

(19)

(20)

Liste des sigles et des abréviations

RL Reinforcement Learning

DR Domain Randomization

VDR Vanilla Domain Randomization

DRL Deep Reinforcement Learning

ADR Active Domain Randomization

MDP Markov Decision Process

SVGD Stein Variational Gradient Descent

SVPG Stein Variational Policy Gradient

MES main engine strength

DoF degrees of freedom

RBF Radial Basis Function

(21)

A2C Advantage Actor-Critic

UDR Uniform Domain Randomization

BO Bayesian Optimization

RARL Robust Adversarial Reinforcement Learning

MaxEnt RL Maximum Entropy Reinforcement Learning

GAIL Generative Adversarial Reinforcement Learning

GP Gaussian Process

CLRL Curriculum Learning for Reinforcement Learning

(22)

Remerciements

Most of this work is the product of countless conversations I have been lucky to have had with dozens of my closest collaborators, friends, and mentors. Thank you to:

• The Mila Slack And all of the brilliant people that come along with it!

• IVADO, Jane Street, Depth First Learning, UdeM, Mila, and Duckietown for providing me with financial support to pursue research and teaching without stress.

• IFT-6760A and Guillaume Rabusseau who showed me that hard things can be taught incredibly well.

• Jacob Miller for answering every question I have ever had about teaching, physics, and grad school.

• Kris Sankaran for our discussions on how to use my work to better our community and society.

• My REAL labmates Sai, Gunshi, Nithin, Vincent, Dhaivat, Krishna, Breandan, Mark, and Dishank, for stimulating discussions, (not-so) "awkward" group meetings, and board game sessions during the Montreal winters.

• My Duckietown family What an incredible experience to continue to be a part of! • My mentors Florian and Maxime, who helped me get things moving on countless projects, and by showing me the ropes on everything from simulators to Duckietown. • My closest collaborators Sharath - thank you for your incredible hard work; Ju-lian - thank you for being the most persevering and optimistic researcher I have come across (and, for the book recommendations and countless enjoyable conversa-tions); Manfred - thank you for introducing me to graphs and eight-dollar meals at Polytechnique... and, for asking the toughest questions about my worst ideas. • My advisers Liam Paull, Christopher Pal, and Andrea Censi, who, during my time

at Mila, helped me grow into a (sort-of) productive researcher, but more importantly, a curious mind.

(23)

(24)

Chapter 1 Introduction

Machine learning often concerns itself with learning information from data. In regression settings, we often want to predict things like housing prices, or, as in classification, whether this particular image is a dog or cat. However, implicit in all main branches of machine learning - supervised, unsupervised, and reinforcement - is the notion of a data distribution. Despite the fact that we may never see it, there is a data distribution - some distribution over real estate pricing, animal images, etc. from which the data we observe is sampled from. The machine learning paradigm comprises of two, coarsely-defined stages: training and testing. During training, algorithms aim to extract patterns in data: mapping, for example, from images to labels or from states to actions. The data in this stage comes with some supervisory signal, enabling an algorithm to learn. Learning consists of updating parameters via incorporation of the signal into an algorithm’s decision-making process.

During testing, the algorithm, using its parameters acquired from the training stage, generates open-loop predictions given data. Often, at this stage, the supervisory signal is known only to the human experimenter, enabling fair comparison across algorithms and approaches.

This thesis generally concerns itself with problems under the jurisdiction of transfer

learning. The previous paragraphs, which describe the theoretical setup of most machine

learning problems, require the existence of a data distribution p(x). p(x) can be the distri-bution of handwritten digits, natural images, or states seen by a policy in a reinforcement learning environment. The true p(x) is often high-dimensional, vaguely-defined, and most importantly, unattainable. However, its existence is guaranteed in most machine learning problem settings and is often held constant across training and testing. Transfer learning comes into play when it doesn’t.

What does it mean to change a distribution? Mathematically, it is simple: given some random variable X (representing our data), we train an algorithm on data distribution p(x)

(25)

which has some support. Then, we test our output on a different distribution q(x), one with a different support than p.

While transfer learning can be simple shifts in data distribution (i.e the label distribution of cats vs. dogs being 60% − 40% at testing, when equal at training), many real-world problems can see dramatic changes in distribution. Transfer learning is precisely the issue that makes things like deep learning-based autonomous driving [31] and in-the-wild chatbots [88] so difficult to deploy.

Throughout much of this thesis, the transfer learning problem studied is sim2real transfer : a robotics-centric problem that occurs when training policies fully in simulation while testing them on real robotic hardware. This problem is introduced in its entirety in Chapter 2.

The rest of this thesis is organized as follows:

• Section 1.1 covers mathematical preliminaries common to the work covered in this thesis.

• Section 1.2 covers related work in the space of reinforcement learning, robot lear-ning, multi-task learlear-ning, and curriculum learning.

• Section 1.3 covers a phenomenon of deep learning and generalization in supervised learning, one which is exploited by the rest of the work covered in the thesis.

• Chapter 2 covers Active Domain Randomization [46], a novel algorithm which aims to address the sim2real transfer problem.

• Chapter 3 covers optimization characteristics of curriculum learning for neural net-works.

• Chapter 4 covers an application of Active Domain Randomization within the meta-reinforcement learning domain.

1.1. Preliminaries

In this section, we lay out the necessary theoretical and experimental tools to understand the remainder of the thesis. For brevity, we provide external references, and summarize only the required information below.

1.1.1. Reinforcement Learning

We consider a Reinforcement Learning (RL) framework [76] where some task T is defined by a Markov Decision Process (MDP) consisting of a state space S, action space A, state transition function P : S × A 7→ S, reward function r : S × A 7→ R, and discount factor

γ ∈ (0,1). The goal for an agent trying to solve T is to learn a policy π with parameters θ that

maximizes the expected total discounted reward. We define a rollout τ = (s0, a0..., sT, aT) to be the sequence of states st and actions at ∼ π(at|st) executed by a policy π in the environment.

(26)

To maximize the reward, we maximize a utility function, J (πθ), with policy gradient methods [77]: J (πθ) = Es,a∼πθ ∞ X t=0 γtr(st, at) (1.1.1) In our work, we use various variants of policy gradient estimators (REINFORCE [89], REINFORCE with baseline, Actor-Critic [37, 40]), but in general, we can use various me-thods to compute the policy gradient (direction of maximum ascent): ∇θJ (πθ).

1.1.2. Maximum Entropy Reinforcement Learning

Oftentimes, actions in reinforcement learning policies are defined as distributions over the action distribution: p(a|s). Such a decision allows us to cleanly incorporate randomness into the learning process, which is crucial for agents to explore the state spaces of their environments.

Traditionally, these action distributions are characterized by simple distributions: cate-gorical, for discrete action spaces, or by a Gaussian in the continuous action case. Using distributions allows for the characterization of entropy, which can be roughly thought of as the randomness of the distribution.

H(X) = EX[I(X)] = − X

x

p(x) log p(x) (1.1.2) where I(X) is the self-information of the random variable X.

Maximum Entropy RL (MaxEnt RL) deals with maximizing the long-term entropy jointly with the expected return. MaxEnt RL maximizes entropy inside of the expectation operator, rather than the single-step entropy (which is what is known as an entropy bonus). This can be written as: π∗ = arg max π Eπ ∞ X t=0 γt(rt+ αH) (1.1.3) with entropy H being controlled by temperature parameter α. The entropy H is shorthand for the entropy of the policy distribution at each timestep.

While MaxEnt RL is commonly thought of as maximizing entropy, the entropy term itself is derived from a KL-Divergence term:

π∗ = arg max π Eπ ∞ X t=0 γt(rt+ αDKL[π||¯π]) (1.1.4) where the KL-Divergence of π is calculated with respect to a reference policy, ¯π. Following

[71], when we take ¯π to be a constantly-uniform policy, we recover the MaxEnt RL

for-mulation (up to a constant, which correctly gets dropped when taking the maximization).

(27)

Taking a uniform prior, we encourage our learned policy to maximize reward while acting as randomly as possible. But, if given a prior policy to follow, MaxEnt RL provides a natural way to incorporate it into the decision making and learning process.

1.1.3. Domain Randomization

Domain randomization (DR) is a technique to increase the generalization capability of policies trained in simulation. Domain Randomization (DR) requires a prescribed set of

Nrand simulation parameters to randomize, as well as corresponding ranges to sample them from. A set of parameters is sampled from a randomization space Ξ ⊂ RNrand_{, where each}

randomization parameter ξ(i) is bounded on a closed interval {hξ(i)_low, ξ_high(i) i}Nrand

i=1 .

When a configuration ξ ∈ Ξ is passed to a non-differentiable simulator S1, it generates an environment E. At the start of each episode, the parameters are uniformly sampled from the ranges, and the environment generated from those values is used to train the agent policy π. DR may perturb any to all elements of the task T ’s underlying MDP2, with the exception of keeping R and γ constant. DR therefore generates a set of MDPs that are superficially similar, but can vary greatly in difficulty depending on the character of the randomization. Upon transfer to the target domain, the hope is that the agent policy has learned to generalize across MDPs, and sees the final domain as just another variation of parameters.

The most common instantiation of DR, UDR is summarized in Algorithm 1.

Algorithm 1 Uniform-Sampling Domain Randomization

1: Input: Ξ: Randomization space, S: Simulator 2: Initialize πθ: agent policy, Trand = ∅

3: for each episode t do

4: for i = 1 to Nrand do . Uniformly sample parameters

5: ξ_t(i) ∼ Uhξ_low(i) , ξ_high(i) i

6: Et ← S(ξt) . Generate randomized environments

7: rollout τt ∼ πθ(·; Et) . Rollout policy in randomized environments 8: T_rand ← T_rand∪ τ_t

9: for each policy update step do . Agent policy update

10: with experience buffer Trand update:

11: θ ← θ + ν∇θJ (πθ) . Policy gradient update

UDR generates randomized environment instances Et by uniformly sampling Ξ. The agent policy π is then trained on rollouts τt produced in randomized environments Et.

1_{We use the assumption of non-differentability because: (A) most physics simulators are non-differentiability}

(B) differentiable simulators have more efficient methods of propogating reward signals that REINFORCE-like estimates.

(28)

1.1.4. Curriculum Learning

Curriculum learning focuses on a meta-optimization task, where a black-box learner is required to learn a goal task ξg or a distribution of tasks, p(ξ). By abstracting away the inner-loop learner, curriculum learning focuses on the sequence of tasks to show the agent, rather than focusing on learning how to solve any particular task. Oftentimes, curriculum learning focuses on showing the learner easier versions of the goal task - for example, in a navigation setting, starting states closer to the goal may be shown to the agent first [21, 3], progressively getting harder until the goal task ξg is achieved. Curriculum learning works well with structured task spaces, or when there is a clear notion of difficulty between tasks. Learning a curriculum in an unstructured task space, where a notion of easier tasks cannot be discerned, is still an open research question.

1.1.5. Transfer Learning

When training machine learning models, we often make an assumption that the under-lying data-generating distribution p(x) is constant for both training and testing [61], even if the training and testing datasets are disjoint. However, when the data-generating distri-bution is not held constant between training and testing, the problem setting becomes an instantiation of transfer learning.

When training machine learning models solely on distribution p(x) and testing on a different distribution q(x), zero-shot transfer deals with problem settings where no further optimization occurs at test time using distribution q(x). The model trained in the data-regime defined by p(x), say a simulator, is used without tuning in the new distribution q(x), which for example, can be deployed on a real-world robot.

While zero-shot transfer is often motivated through the lens of expense (i.e it is cheaper to train a robotic agent in a simulation than on the physical robot), many real-world ap-plications of machine learning allow for fine-tuning in the test domain. When only a few data samples can be collected and used for optimization, the problem lies in the few-shot

transfer regime: for example, a robotic agent, trained in simulation, may get the

opportu-nity to execute a small number of real-world rollouts, ideally used to quickly fine-tune to the test distribution. However, in the context of RL, fine-tuning at test time requires additional policy gradient steps. Often, evaluating the reward for RL algorithms at test time can be problematic, leading to a surge of interest in zero-shot transfer learning.

1.1.6. Bayesian Optimization

In the Bayesian Optimization (BO) framework [6], we are concerned with the global optimization of some stationary function f : Φ → R:

(29)

φ∗ = arg max

φ∈Φ f (φ) (1.1.5) If given knowledge about f (i.e f is convex, piece-wise constant, etc.) we can use spe-cialized techniques, but in general, the structure of the underlying function is not known. Moreover, f is often considered to be expensive to evaluate, which rules out the use of exhaustive search or gradient-based methods from the choice of optimization tools.

Since f is often expensive to evaluate, we need to carefully pick where to evaluate our function. Often given a limited budget, or number of evaluations, we transform the original optimization problem into one that optimizes where to next evaluate the function f . The function which handles this decision is called the acquisition function. The acquisition func-tion is a much cheaper funcfunc-tion to optimize, and is often chosen or designed using prior knowledge about the types of solutions that an experimenter expects. The maximization of the acquisition function a(x) is sequential, meaning that past choices of iterates and their function evaluations inform future evaluations. The acquisition maximization can be written as:

φt+1= arg max

φ a(φ|Dt) (1.1.6) where dataset Dt is the set of past iterates paired with their evaluations up until time t,

Dt= {φt0, f (φ_t0))}t_t0₌₁.

1.1.7. Meta-Learning

Most deep learning models are built to solve only one task and often lack the ability to generalize and quickly adapt to solve a new set of tasks. Meta-learning involves learning a learning algorithm which can adapt quickly rather than learning from scratch. Several methods have been proposed, treating the learning algorithm as a recurrent model capable of remembering past experience [70, 51, 48], as a non-parametric model [36, 84, 73], or as an optimization problem [64, 19]. In this paper, we focus on a popular version of a gradient-based meta-learning algorithm called Model Agnostic Meta-Learning (MAML; [19]).

1.1.8. Gradient-based Meta-Learning

The main idea in MAML is to find a good parameter initialization such that the model can adapt to a new task, τ , quickly. Formally, given a distribution of tasks p(τ ) and a loss function Lτ corresponding to each task, the aim is to find parameters θ such that the model

fθ can adapt to new tasks with one or few gradient steps. For example, in the case of a single gradient step, the parameters θ0_τ adapted to the task τ are

θ0_τ = θ − α∇θLτ(Dtrain, fθ), (1.1.7)

(30)

with step size α, where the loss is evaluated on a (typically small) dataset Dtrain of training

examples from task τ . In order to find a good initial value of the parameters θ, the objective function being optimized in MAML is written as

min θ

X τi

Lτi(Dtest, fθ_τi0 ), (1.1.8)

where it evaluates the performance in generalization on some test examples Dtest for task

τ . The meta objective function is optimized by gradient descent where the parameters are

updated according to

θ ← θ − β∇θ X

τi

Lτi(Dtest, fθ_τi0 ), (1.1.9)

where β is the outer step size.

1.1.9. Meta-Reinforcement Learning

In addition to few-shot supervised learning problems, where the number of training examples is small, meta-learning has also been successfully applied to reinforcement lear-ning problems. In meta-reinforcement learlear-ning, the goal is to find a policy that can quickly adapt to new environments, generally from only a few trajectories. [63] treat this problem by conditioning the policy on a latent representation of the task, and [16, 85] represent the reinforcement learning algorithm as a recurrent network, inspired by the “black-box” meta-learning methods mentioned above. Some meta-meta-learning algorithms can even be adapted to reinforcement learning with minimal changes [48]. In particular, MAML has also shown some success in robotics applications [20]. In the context of reinforcement learning, Dtrain

and Dtest are datasets of trajectories sampled by the policies before and after adaptation (i.e

rollouts in Dtrain are sampled before the gradient step in Equation 1.1.7, whereas those in

Dtest are sampled after). The loss function used for the adaptation is REINFORCE [90],

and the outer, meta objective in Equation 1.1.9 is optimized using TRPO [72].

1.1.10. Stein Variational Policy Gradient

For interested readers, we provide a brief overview of Stein’s Method, Stein Variational Gradient Descent, and Stein Variational Policy Gradient. A more thorough overview can be found in [44].

Stein’s Method [75] is an approach for obtaining bounds on distances between distri-butions, but has generally stayed in the realm of theoretical statistics. Recently, [23] and [41] applied Stein’s Method to machine learning, showing that it can be used efficiently for goodness-of-fit tests. As a follow-up, [42] derived Stein Variational Gradient

Des-cent (SVGD), a gradient-based variational inference algorithm that iteratively transforms

a set of particles into a target distribution, using an underlying connection between Stein’s Method and KL-Divergence minimization.

(31)

However, in a reinforcement learning context, the target distribution is not known, and therefore needs to be sampled from via interactions with an environment. [43] extended SVGD to Stein Variational Policy Gradient (SVPG), which learns an ensemble of policies

µφ in a maximum-entropy RL framework [94]. max

µ Eµ[J (µ)] + αH(µ) (1.1.10)

SVPG uses SVGD to iteratively update an ensemble of N policies or particles µφ = {µφi} N i=1 using: µφi ← µφi+ N N X j=1 [∇µ_φjJ (µφj)k(µφj, µφi) + α∇µ_φjk(µφj, µφi)] (1.1.11)

with step size and positive definite kernel k. This update rule balances exploitation (first term moves particles towards high-reward regions) and exploration (second term repulses si-milar policies). As the original authors use an improper prior on the distribution of particles, we simplify the notation by dropping the term from Equation 1.1.11.

1.2. Related Work

In this section, we aggregate related work that helps understand the landscape of task distribution learning, adaptive simulators, and curriculum learning within various contexts.

1.2.1. Dynamic and Adversarial Simulators

Simulators have played a crucial role in transferring learned policies onto real robots, and many different strategies have been proposed. Randomizing simulation parameters for better generalization or transfer performance is a well-established idea in evolutionary robotics [92, 5], but recently has emerged as an effective way to perform zero-shot transfer of deep reinforcement learning policies in difficult tasks [2, 79, 57, 69].

Learnable simulations are also an effective way to adapt a simulation to a particular target environment. [10] and [68] use RL for effective transfer by learning parameters of a simulation that accurately describes the target domain, but require the target domain for reward calculation, which can lead to overfitting. In contrast, Active Domain Randomization (ADR), presented in Ch.2, requires no target domain, but rather only a reference domain (the default simulation parameters) and a general range for each parameter. ADR encourages diversity, and as a result, gives the agent a wider variety of experience. In addition, unlike [10], our method does not require carefully-tuned (co-)variances or task-specific cost func-tions. Concurrently, [32] also showed the advantages of learning adversarial simulations and the disadvantages of purely uniform randomization distributions in object detection tasks.

(32)

To improve policy robustness, Robust Adversarial Reinforcement Learning (RARL) [58] jointly trains both an agent and an adversary who applies environment forces that disrupt the agent’s task progress. ADR removes the zero-sum game dynamics, which have been known to decrease training stability [47]. More importantly, our method’s final outputs -the SVPG-based sampling strategy and discriminator - are reusable and can be used to train new agents as shown in Appendix A.3, whereas a trained RARL adversary, would overpower any new agent and impede learning progress.

1.2.2. Active Learning and Informative Samples

Active learning methods in supervised learning try to construct a representative, some-times time-variant, dataset from a large pool of unlabelled data by proposing elements to be labeled. The chosen samples are labelled by an oracle and sent back to the model for use. Similarly, ADR searches for what environments may be most useful to the agent at any given time. Active learners, like BO methods discussed in Section 1.1.6, often require an acquisition function (derived from a notion of model uncertainty) to chose the next sample. Since ADR handles this decision through the explore-exploit framework of RL and the α in SVPG, ADR sidesteps the well-known scalability issues of both active learning and BO [82]. Recently, [81] showed that certain examples in popular computer vision datasets are harder to learn and that some examples are forgotten much quicker than others. We explore the same phenomenon in the space of MDPs defined by our randomization ranges and try to find the “examples” that cause our agent the most trouble. Unlike in active learning or [81], we have no oracle or supervisory loss signal in RL, and instead, attempt to learn a proxy signal for ADR via a discriminator.

1.2.3. Generalization in Reinforcement Learning

Generalization in RL has long been one of the holy grails of the field, and recent work like [55], [11], and [18] highlight the tendency of deep RL policies to overfit to details of the training environment. Our experiments exhibit the same phenomena, but our method improves upon the state of the art by explicitly searching for and varying the environment aspects that our agent policy may have overfit to. We find that our agents, when trained more frequently on these problematic samples, show better generalization over all environments tested.

1.2.4. Interference and Transfer in Multi-Task Learning

Multi-task learning deals with training a single agent to do multiple things, such as a robotic arm trained to pick up objects and open doors. Often in multi-task learning, we aim to train a single policy that can accomplish everything, rather than training multiple,

(33)

separate policies for each task. When sharing parameters, or more commonly, using the same network across tasks, gradients from various sources can interfere or transfer, hindering or enabling learning across the curriculum. Recently, [65] have shown that when distinct tasks with minimal gradient interference are trained on, the resulting agent transfers better in a multi-task learning scenario.

When two separate tasks have gradients pointing in opposite directions, they can cancel out, or combine (average) into a third direction that is not optimal for either task. When taking the average gradient step, it has been empirically shown by [9] that when gradients do not interfere, agents improve their multi-task performance. Ideas have been proposed that these gradients align better [56], but in our setting, we focus on the sampled gradients from the tasks without any transformation.

We can quantify transfer and interference of gradients using the cosine similarity of gradients from two tasks, Ta and Tb, as follows:

ρab(θ) =

h∇JTa(θ), ∇JTb(θ)i

||∇JTa(θ)||||∇JTb(θ)||

(1.2.1) where cosine similarity ρ(θ) is bounded between +1 (full transfer) and −1 (full interfe-rence).

1.2.5. Optimization Analysis in Reinforcement Learning

In policy optimization, we traditionally consider a RL framework [76] where some task T is defined by a MDP consisting of a state space S, action space A, state transition function

P : S × A 7→ S, reward function R : S × A 7→ R, and discount factor γ ∈ (0,1). The goal

for an agent trying to solve T is to learn a policy π with parameters θ that maximizes the expected total discounted reward.

After training the policy in this maximization setting, ideally, any solution found by policy optimization would be some kind of local maximum. However, under a non-stationary MDP, standard policy optimization algorithms are no longer guaranteed to converge to any type of optima. In addition, when working with function approximators in high dimensions, optimization analysis of solutions is very difficult.

Recently, [1] showed that with linear policies, we can analyze policy optimization solu-tions using perturbative methods. Briefly, they approximate the solution neighborhood as a local quadratic and estimate the curvature of the neighborhood by perturbing the solution with a minor amount of noise. This allows for estimation of the policy quality from the per-turbations’ effects. While the method produces a high variance estimate (we characterize the perturbation using accumulated reward as a metric), with enough noise samples, meaningful conclusions regarding solutions found with policy optimization can still be drawn.

(34)

1.2.6. Task Distributions in Meta-Reinforcement Learning

When discussing meta-reinforcement learning, to the best of our knowledge, the task distribution p(τ ) has never been studied or ablated upon. As most benchmark environments and tasks in meta-RL stem from two papers ([19, 66], with the task distributions being pres-cribed with the environments), the discussion in meta-RL papers has almost always centered around the computation of the updates [66], practical improvements and approximations made to improve efficiency or learning exploration policies with meta-learning [74, 26, 25]. In this section, we briefly discuss prior work in curriculum learning that bears the most similarity to the analyses we conduct here.

Starting with the seminal curriculum learning paper [4], many different proposals to learn an optimal ordering over tasks have been studied. Curriculum learning has been tackled with Bayesian Optimization [83], multi-armed bandits [24], and evolutionary strategies [86] in supervised learning and reinforcement learning settings, but here, we focus on the latter. However, in most work, the task space is almost always discrete, with a teacher agent looking to choose the best next task over a set of N pre-made tasks. The notion of best has also been explored in-depth, with metrics being based on a variety of things from ground-truth accuracy or reward to adversarial gains between a teacher and student agent [59].

However, up until recently, the notion of continuously-parameterized curriculum learning has been studied less often. Often, continuous-task curriculum learning exploits a notion of

difficulty in the task itself. In order to get agents to hop over large gaps, it’s been empirically

easier to get them to jump over smaller ones first [28]; likewise, in navigation domains, it is easier to show easier goals and grow a goal space [60], or even work backward towards the start state in a reverse curriculum manner [21].

While deep reinforcement learning, particularly in robotics, has seen a large amount of curriculum learning papers in recent times [46, 53], curriculum learning has not been extensively researched in meta-RL. This may be partly due to the naissance of the field; only recently was a large-scale, multi-task benchmark for meta-RL released [91]. As we hope to show in this thesis, the notions of tasks, task distributions, and curricula in meta-learning are fruitful avenues of study and can make (or break) many of the meta-learning algorithms in use today.

1.3. Preliminary Results Regarding Neural Networks

and Generalization

Before continuing onto the sim2real problem, we use this section to highlight a curious phenomenon: the generalization of a neural network and its non-intuitive dependence on the data distribution.

(35)

Fig. 1.1. A diagram of data generation setup. We define regions of the training set by

zoning off rectangles of (µ, σ) pairs, and then use standard library tools to generate samples from that distribution

1.3.1. Informative Samples

With the exception of Chapter 5, much of this thesis examines work in which neural networks are used to learn policies or distributions. Neural networks, from a theoretical perspective, are poorly understood. One of the foremost theoretical challenges in the field today is the generalization properties of neural networks.

Neural networks have been known to have strangely-strong out-of-distribution perfor-mance, and while hypotheses circulate regarding optimum geometry [15, 30], information content [78], and the characteristics of stochastic gradient descent [93], the problem is, as of writing, still very much open. Here, we make no attempt to address the problem of gene-ralization in neural networks, but rather point out a curious phenomenon which we exploit throughout the rest of this thesis.

We train a small neural network to infer the mean of a normal distribution, given samples from that distribution. Illustrated in Figure 1.1, we generate a training set as follows: we block off regions of mean and standard deviation pairs (the red squares) and generate data samples from that distribution. This vector of samples is sent into the neural network and is used to predict the mean of the normal distribution it is sampled from.

We then finely discretize a grid, and at each point (a (µ, σ) pair) generate data samples to feed through the network to predict means. We then colorize the error, shown in the plots of Figure 1.2, where a darker color is a higher prediction error.

(36)

(a) (b)

Fig. 1.2. Two starkly different stories of generalization, differing only in the data

distribu-tion shown to the neural network. Figure (b) adds a small training area (upper-right red box) that, when included, leads to almost perfect generalization across the domain. Darker is higher error.

In Figure 1.2(a), we see an unsurprising pattern. The network does fairly well with inter-polation: predicting values of µ correctly around or between things it saw during training. Naturally, it does poorly in the top-right region, when tested on (µ, σ) pairs well outside of the training distribution.

However, when we add a region to the training distribution (the upper-right red box in Figure 1.2(b)), we see that the network learns to generalize, even for points far outside of the training distribution3. Despite the additional training region being placed in an already high-performing region, the addition here allows the network to generalize across the test distribution.

This simple experiment allows us to conclude that certain regions of the data distribution may be more informative, helping guide the neural network to an optimum that allows for strong generalization. Yet, even in this example, such regions of data distribution are often unknown and can be counter-intuitive when found. In traditional machine learning scenarios, the problem of finding such regions can become ever more difficult.

Curriculum learning aims to introduce these data regions iteratively (here, we mixed all the data from regions into one large dataset), introducing complex learning dynamics that depend on a moving data distribution. In addition, human experimenters often use a

easy-to-hard heuristic to guide the scheduling of region introduction, but many problems may

not have this intuitive difficulty readily available. The next chapter introduces solutions to the curriculum learning problem in unstructured data distributions, using this phenomenon - the phenomenon of representative examples - as an underlying motivation.

3_{We hold constant the architecture, random seeds, training epochs, and the total number of samples seen by}

the network. We show averaged results across three runs.

(37)

(38)

Chapter 2 Active Domain Randomization

Recent trends in Deep Reinforcement Learning (DRL) exhibit a growing interest in zero-shot domain transfer, i.e. when a policy is learned in a source domain and is then tested without

finetuning in a previously unseen target domain. Zero-shot transfer is particularly useful

when the task in the target domain is inaccessible, complex, or expensive, such as gathering rollouts from a real-world robot. An ideal agent would learn to generalize across domains; it would accomplish the task without exploiting irrelevant features or deficiencies in the source domain (i.e., approximate physics in simulators), which may vary dramatically after transfer. One promising approach for zero-shot transfer has been Domain Randomization (DR) [79]. DR uniformly randomizes environment parameters (e.g. friction, motor torque) in heuristically defined ranges after every training episode. By randomizing everything that might vary in the target environment, the hope is that the agent will view the target domain as just another variation. However, recent works suggest that the sample complexity grows exponentially with the number of randomization parameters, even when dealing only with transfer between simulations (i.e. in [2] Figure 8). In addition, when using DR unsuccessfully, policy transfer fails, but with no clear way to understand the underlying cause. After a failed transfer, randomization ranges are tweaked heuristically via human trial-and-error. Repeating this process iteratively leads to arbitrary ranges that do (or do not) lead to policy convergence without any insight into how those settings may affect the learned behavior.

In this chapter, we demonstrate that the strategy of uniformly sampling environment parameters from predefined ranges is suboptimal and propose an alternative sampling me-thod, Active Domain Randomization (ADR). ADR, shown conceptually in Figure 2.1, formulates DR as a search for randomized environments that maximize utility for the agent policy. Concretely, we aim to find environments that currently cause difficulties for the agent policy, dedicating more training time to these troublesome parameter settings. We cast this active search as a RL problem where the ADR sampling policy of the environment is para-meterized with SVPG [43]. ADR focuses on problematic regions of the randomization space

(39)

Fig. 2.1. ADR proposes randomized environments (c) or simulation instances from a simulator (b) and rolls out an agent policy (d) in those instances. The discriminator

(e) learns a reward (f) as a proxy for environment difficulty by distinguishing between rollouts in the reference environment (a) and randomized instances, which is used to train SVPG particles (g). The particles propose a diverse set of environments, trying to find the environment parameters (h) that are currently causing the agent the most difficulty.

by learning a discriminative reward computed from discrepancies in policy rollouts generated in randomized and reference environments.

We first showcase ADR in a simple environment where the benefits of training on more challenging variations are apparent and interpretable (Section 2.4.3). In this case, we de-monstrate that ADR learns to preferentially select parameters from these more challenging parameter regions while still adapting to the policy’s current deficiencies. We then apply ADR to more complex environments and real robot settings (Section 2.4.5) and show that even with high-dimensional search spaces and unmodeled dynamics, policies trained with ADR exhibit superior generalization and lower overall variance than their UDR1 counter-parts.

Finally, we show the safety-critical capabilities of our formulation; by ingesting arbitrarily

off-policy data, we can fit the parameters of the target environment while using a natural,

learned prior on the simulation parameters as a regularizer. This allows the few-shot learning variant of ADR, ADR+, to maximize performance on the target task, robot, or environment, all while staying robust across the entire generalization range. ADR+ trains policies that are robust to non-representative datasets, while enabling the use of robotic trajectories collected safely both offline and off-policy.

1_{We refer to the original version of Domain Randomization as Uniform-(Sampling) Domain Randomization}

(UDR), since the entire parameter space is sampled uniformly randomly.

(40)

2.1. Problem Formulation

In this section, we focus on learning a curriculum of environments to show the agent in the domain randomization setting. Unlike UDR seen in Algorithm 1, we aim to show that a

learned curriculum would be beneficial in this setting, as it has proven to be useful in many

other machine learning applications [67, 28]. Given a predefined randomization space, we aim to learn a curriculum that maximizes generalization on all target tasks - whether that be unseen variations of the same, simulated task, or even a completely different task altogether (i.e transfer onto a real robot).

After [45], subsequent works [54, 86] have explored the idea of progressively growing curricula in such spaces, using metrics such as return and complex scheduling schemes to heuristically tune when to grow the space of possible environments. In contrast, we focus on a fixed, but unstructured, space of environments, looking for beneficial curricula in those spaces.

2.1.1. Domain Randomization, Curriculum Learning, and Bayesian

Optimization

As shown in Section 1.1.3, domain randomization generates new environments by ran-domly perturbing the underlying MDP. This generates a space of related tasks, all of which the agent is expected to be able to solve at the end of training. Traditionally, we define a randomization space, uniformly sample environments from this space, train agents, and evaluate generalization.

In contrast, a curriculum learning setup makes an assumption of optimality - that there exists an optimal sequencing of tasks, that, when used to train an agent, generates the best possible agent. Our problem focuses on learning a curriculum of randomized environ-ments in order to train a black-box RL policy πθ to generalize across a wide distribution of environments.

On the surface, the setup is similar to the BO problem formulation, with one small caveat: only the final policy πθ is evaluated for generalization. Thus, the maximization quantity is now the sequence of iterates, or environments, rather than just a single value. As the goal of curriculum learning is to find the sequence of tasks that generate the best agent after training, in a BO setting, each sequence of training environments would comprise a single datapoint.

If we take function f from the BO setting to be the deterministic evaluation of a para-meterized policy πθ, and the series of task iterates {ξi} to be the environments which the policy πθ uses to train, we obtain the objective for what we refer to as Curriculum Learning for Reinforcement Learning (CLRL): find the tasks that when used for training, maximize the return of the final policy.

(41)

2.1.2. Challenges of Bayesian Optimization for Curriculum

Lear-ning

Such a setup fits nicely within the BO application area of experiment design [49]. Ho-wever, due to the fact that curriculum learning is not permutation invariant with respect to training tasks, the BO framework would require the entire sequence of tasks to be optimized over. To get a single signal for the optimization step, such a setup requires the complete training of an RL agent per proposed curriculum. Since we can only evaluate final generali-zation to perform the BO update, every unique sequence of tasks leads to a different function evaluation. In other words, the experiment space scales combinatorially as a function of the BO budget (which, in the DR setting, can be thought of the number of episodes).

As BO in CLRL would require an inordinate amount of training cycles, we aim to relax the usual definitions of the objective function f and acquisition function a. Rather than selecting the entire curriculum for the agent at once, a more reasonable setup would be to pick each environment configuration ξi one at a time. Such a setup now allows for iterative adjustments to the curriculum, using feedback from each timestep to pick the next best environment.

With the iterative approach, we are left to find a suitable definition for functions f and

a. A reasonable choice would be to take f as a function evaluation, except now a function

evaluation of the current policy after training on the current environment ξi.

While practical, the switch comes at a large cost. By switching to a sequential optimi-zation in the CLRL setting, we lose any analog of an acquisition function. In addition, due to the definition of f and the embedded optimization step of the agent πθ, the following properties no longer hold:

(1) Stationarity - Each particular policy gradient update changes the underlying func-tion that is being maximized; the task that would have provided the most benefit to an agent at time t is not the same as the maximum-benefit task after a policy gradient update at time t + 1. In BO, the function is thought to be stationary, and while research has been proposed to combat noise and small evaluation errors of f [39], in general, the framework breaks down quickly without this assumption. (2) Stochasticity - Up to experimental errors, evaluation of f in BO is generally

consi-dered to be deterministic. In the CLRL setting, evaluating f involves stochastically sampling a policy and estimating returns. Stochasticity in a BO setting slows lear-ning, as more function evaluations are needed in order to reduce the uncertainty at any given point.

(3) Irreversibility - A bad function evaluation in the BO setting, while wasteful, does not affect future evaluations and can be safely disregarded. In the CLRL setting, a poor policy gradient step can introduce optimization difficulties that affect future

(42)

steps; therefore, each step in a CLRL setting is "worth more", and greater care must be taken.

Even though we no longer have one, an acquisition function would still be helpful in the CLRL setting. However, the function needs to be inherently robust to the issues above, while remaining a strong indicator for which environment to sample next. In the following sections, we show why the reinforcement learning paradigm is a more suitable approach to deal with the non-stationarity inherent in the CLRL problem, and introduce our own method, Active

Domain Randomization.

2.2. Active Domain Randomization

The notion of optimal task sequencing in curriculum learning, discussed in Section 2.1.1, implies that at each step, there is a next-best environment for the agent to train on. However, outside of contrived or specific problem settings, finding these optimal environments at each step is difficult because (1) An intuitive task ordering of MDP instances or parameter ranges is rarely known beforehand and (2) Domain Randomization (DR) is used mostly when the space of randomized parameters is high-dimensional or noninterpretable.

Drawing analogies with BO literature, one can consider the randomization space as a search space. Traditionally, in BO, the search for where to evaluate an objective is informed by acquisition functions, which trade off exploitation of the objective with exploration in the uncertain regions of the space [7]. However, as we saw in 2.1.2, unlike the stationary objectives seen in BO, training the agent policy renders our optimization non-stationary: the environment with the highest utility at time t is likely not the same as the maximum utility environment at time t + 1.

This requires us to redefine the notion of an acquisition function while simultaneously dealing with BO’s deficiencies with higher-dimensional inputs [87].

(43)

Algorithm 2 Active Domain Randomization

1: Input: Ξ: Randomization space, S: Simulator, ξref: reference parameters

2: Initialize πθ: agent policy, µφ: SVPG particles, Dψ: discriminator, Eref ← S(ξref): reference environment

3: while not max_timesteps do 4: for each particle µφi do

5: rollout ξi ∼ µφi(·)

6: for each ξi do

7: Ei ← S(ξi) . Generate randomized environment

8: rollout τi ∼ πθ(·; Ei) . Rollout in randomized environment

9: T_rand ← T_rand∪ τ_i

10: rollout τref ∼ πθ(·; Eref) 11: T_ref ← T_ref ∪ τ_ref

12: for each τi ∈ Trand do . Calculate discriminative reward for env. 13: Calculate ri = log Dψ(y|τi ∼ π(·; Ei))

14: for each particle µφi with discriminative reward ri do

15: Update acquisition function analog µφ using learned rewards ri . Eq. 1.1.11 16: with Trand update:

17: θ ← θ + ν∇θJ (πθ) . Agent policy gradient updates 18: Update Dψ with τi and τref using SGD.

To this end, we propose ADR, summarized in Algorithm 4 and Figure 2.1. ADR provides a framework for manipulating a more general analog of an acquisition function, selecting the most informative MDPs for the agent within the randomization space.

By formulating the search as an RL problem, ADR learns a policy µφ where states are proposed randomization configurations ξ ∈ Ξ and actions are continuous changes to those parameters. While RL convergence guarantees do not hold in non-stationary MDPs, empiri-cally, deep RL algorithms are suitable candidates to deal with non-stationary environments. In addition, framing the optimization as an RL problem allows us to more easily deal with the stochasticity of function evaluations, folding the stochasticity from the environment into the policy gradient update. RL allows us to remove the computational explosion descri-bed in Section 2.1.2, allowing us to select (and incorporate feedback from) a single training environment at each timestep, rather than optimizing for the entire curriculum all at once.

Our method takes as input an untrained agent policy π, a simulator S (which is a map from task parameters ξ to MDP environments E), a randomization space Ξ (see Section 1.1.3), and a reference environment, Eref. The reference environment is the default environ-ment, often shipping with the original task definition.

We parameterize the policy µφ with SVPG, a particle-based policy gradient method that builds on Stein’s method. SVPG, as covered in Section 1.1.10, optimizes for diversity while also maximizing reward. Using SVPG for µφallows us to both find high-value environments