A coupling approach to rare event simulation via dynamic importance sampling

(1)

A Coupling Approach to Rare Event Simulation via

Dynamic Importance Sampling

by

Benjamin Jiahong Zhang

B.S., Engineering Physics, B.A., Applied Mathematics,

University of California, Berkeley (2015)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Master of Science in Aeronautics and Astronautics

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2017

c

Massachusetts Institute of Technology 2017. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

25 May 2017

Certified by . . . .

Youssef M. Marzouk

Associate Professor of Aeronautics and Astronautics

Thesis Supervisor

Accepted by . . . .

Youssef M. Marzouk

Associate Professor of Aeronautics and Astronautics

Chair, Graduate Program Committee

(2)

A Coupling Approach to Rare Event Simulation via Dynamic

Importance Sampling

by

Benjamin Jiahong Zhang

Submitted to the Department of Aeronautics and Astronautics on 25 May 2017, in partial fulfillment of the

requirements for the degree of

Master of Science in Aeronautics and Astronautics

Abstract

Rare event simulation involves using Monte Carlo methods to estimate probabilities of unlikely events and to understand the dynamics of a system conditioned on a rare event. An established class of algorithms based on large deviations theory and control theory constructs provably asymptotically efficient importance sampling estimators. Dynamic importance sampling is one these algorithms in which the choice of biasing distribution adapts in the course of a simulation according to the solution of an Isaacs partial differential equation or by solving a sequence of variational problems. However, obtaining the solution of either problem may be expensive, where the cost of solving these problems may be even more expensive than performing simple Monte Carlo exhaustively.

Deterministic couplings induced by transport maps allows one to relate a complex probability distribution of interest to a simple reference distribution (e.g. a standard Gaussian) through a monotone, invertible function. This diverts the complexity of the distribution of interest into a transport map. We extend the notion of transport maps between probability distributions on Euclidean space to probability distributions on path space following a similar procedure to Itô’s coupling.

The contraction principle is a key concept from large deviations theory that allows one to relate large deviations principles of different systems through deterministic cou-plings. We convey that with the ability to computationally construct transport maps, we can leverage the contraction principle to reformulate the sequence of variational problems required to implement dynamic importance sampling and make computation more amenable. We apply this approach to simple rotorcraft models.

We conclude by outlining future directions of research such as using the coupling interpretation to accelerate rare event simulation via particle splitting, using transport maps to learn large deviations principles, and accelerating inference of rare events. Thesis Supervisor: Youssef M. Marzouk

(3)

Acknowledgments

No graduate student is an island, and therefore I give thanks to those who have supported me in the course of my studies in the past two years. Thanks go first to my parents Yubao Zhang and Jie Wang, and my brother Eric Zhu for constant emotional and moral support.

I thank my advisor, Professor Youssef Marzouk, for taking a chance on me two years ago and for allowing me to be part of his research group. He has been extremely patient in allowing me to find my own way and make mistakes in research while providing invaluable guidance. Here is to hopefully more fruitful research years ahead! I thank my lab mates in the Uncertainty Quantification group and in the ACDL for providing moral support, good humor, technical support, and interesting research conversations among other things. Special thanks go out to Ricardo Baptista, Alex Feldstein, Chi Feng, Rémi Lam, Elizabeth Qian, Alessio Spantini, Zheng Wang, and Shun Zhang. I thank my friends from outside the lab, Dan Fortunato, Yuan Kang Chen, and Stephanie Zhu for providing good conversation, graduate school insights, and more good humor.

I thank our industry collaborators from the United Technologies Research Center, Dr. Tuhin Sahai and Dr. Byung-Young Min for providing technical help on the rotorcraft models, and for generally interesting conversations about the research side of industry. I also want to thank Professor Paul Dupuis from Brown University for allowing me to sit in on his rare events summer school in 2016, which greatly helped get my research going.

The author acknowledges financial support from the DARPA EQUiPS project. Thank you!

(4)

List of Figures

1-1 Concept map . . . 13

5-1 Illustrated graphic of maps between paths . . . 58

6-1 Optimal paths to the rare event for the AR(1) process . . . 68

6-2 AR(1) samples forn = 16 . . . 70

6-3 AR(1) samples forn = 32 . . . 70

6-4 AR(1) samples forn = 64 . . . 71

6-5 AR(1) samples forn = 128 . . . 71

6-6 Weighted samples for n = 16 (top left), n = 32 (top right), n = 64 (bottom left), n = 128 (bottom right) . . . 71

6-7 Simple helicopter model from [17] . . . 73

6-8 Deterministic behavior of the model. . . 74

6-9 The optimal path and control at t = 0 of the simple model. . . 77

6-10 A biased sample of simple model reaches rare event by t = 20.7 . . . 78

6-11 Unbiased samples of the simple model. . . 79

6-12 Biased samples of simple model . . . 79

6-13 Biased samples of simple model . . . 80

(7)

6-15 Simplified diagram of the variables considered in the UTRC rotorcraft model. Blue boxes represent physically computable quantities in flight. Pink boxes correspond to variables of interest in the fuselage model. Yellow ovals corresponds to dynamic user inputs into the system, and orange ovals correspond to design variables that are fixed for a given

rotorcraft. . . 84

6-16 Deterministic evolution of θvehicle(t). . . 86

6-17 Evolution of θ over time with noise in both the horizontal and vertical velocities. . . 87

6-18 Under noisier, larger variance wind conditions, the helicopter has larger oscillations. . . 88

6-19 The helicopter is less stable with a shorter tail so and has larger oscilations. . . 88

6-20 A biased sample of the UTRC model reaches the rare event by time t = 8.6s. . . 92

6-21 Simple Monte Carlo samples of UTRC model. . . 93

6-22 Biased samples of the UTRC model. . . 94

6-23 Weighted biased samples . . . 95

7-1 Schematics of the UTRC rotorcraft model . . . 101

(8)

List of Tables

6.1 Performance of simple Monte Carlo for the autoregressive process . . 69

6.2 Performance of optimal path importance sampling for the autoregressive process . . . 69

6.3 Performance of dynamic importance sampling for the autoregressive process . . . 69

6.4 Simple rotorcraft example results . . . 81

6.5 Table describing all the UTRC model parameters . . . 85

(9)

Chapter 1 Introduction

Rare events matter. In designing engineering systems, rare events matter as catas-trophic failures of systems may lead to large economic losses or number of deaths. In contemporary engineering design, one studies a system, its failure rates, and expecta-tions sensitive to rare events via computational models. These quantities of interest are relevant in a variety of models in which the risk of failures may be tiny, but the cost of failure may be catastrophic. Common examples include the probability of insolvency of an insurance firm [2], the stalling and crashing of an airplane or rotorcraft in flight [37], the loss of information packets in communications engineering [4], or the mismatch in supply and demand of wind energy due to large variability in environmental conditions and usage of energy [29]. Good engineering design makes failure rates rare, and to study them properly in computational models requires efficient are event simulation. Rare event simulation involves estimating probabilities of unlikely events (usually smaller than 10−3_{, often much smaller [}₄_{]), estimating conditional expectations given}

a rare event has occurred, and characterizing distributions and dynamics conditioned on a rare event. To accurately study rare events of an engineering system, one needs two ingredients. It begins with having a sufficiently faithful probabilistic model that can accurately exhibit the rare events of interest with the correct probabilities. The probabilistic model will induce a probability space and distributions. The resulting probability distributions are often intractable to work with analytically, either because they cannot be easily expressed with canonical probability distributions, or because

(10)

the distributions are high dimensional. For practical computations, the usual class of methods used to estimate expectations of interest is Monte Carlo simulation. However, for estimating expectations sensitive to rare events, Monte Carlo estimates will require many samples to achieve a high quality (low variance) estimate. By a rough definition, rare events do not occur in simulation very often. For example, when simulating an event whose probability is on the order of 10−6_{, on average, one will need to generate}

one million samples to obtain one realization of the rare event. When we have a high fidelity model, in which the computational cost per sample is high, we may be hard pressed to retrieve even a few hundred samples.

The second key ingredient is methodology that retrieves samples from the rare event of interest more efficiently. State of the art rare event simulation methodology rely on variance reduction methods [2]. Importance sampling and particle splitting are among the most common variance reduction techniques for rare event simulation. Importance sampling produces higher quality samples by generating them from a different distribution such that the rare event of interest occurs more frequently [2,31]. Each sample is then re-weighted according to its relative importance with respect to the original distribution. Particle splitting is a method that constructs a branching process to get more samples to the rare event of interest [5, 35, 49]. One constructs a nested sequence of sets such that each set in the sequence is more rare than the previous set, and the smallest set is the rare event of interest. As samples reach the boundary of a rarer set than they are currently in, the particles are allowed to split. Each particle is weighted according to how many times particle splits. In this thesis, we will focus on importance sampling.

While there is a broad collection of methods for rare event simulation, e.g. condi-tional Monte Carlo [1,2], the cross-entropy method [36], information theoretic bounds [27], etc., one of the broadest and most unifying approaches leverages the theory of large deviations [6, 7,8, 9, 47]. This theory provides a set of tools for a computer-less mathematician to study rare events and provides alternative ways of comparing and quantifying rarity of events without the exact probability values.

(11)

has provided a computational paradigm for constructing good importance sampling estimators for rare event simulation. For example, one of the ways the exponentially tilted biasing distribution was introduced came from the proofs in Cramér’s theorem. Later on, it was found to also be an efficient importance sampling distribution for certain problems [39].

A weak convergence approach to the classical theory of large deviations provided a unifying framework for state of the art rare event simulation methods [9]. The new methods for proving large deviations results again became useful in computational methods for rare event simulation. One of the key insights that allowed large deviations to provide a unifying framework for rare event simulation is through a control-theoretic approach to the classical theory. The proofs provided computational methods to constructing good biasing distributions for importance sampling. Recent work in dynamic importance sampling [12] showed a way of finding good importance sampling distributions through a differential game theoretic interpretation which amounts to finding the solution to an Isaacs partial differential equation. For rare event simulation, this equation was shown to be equivalent to solving a Hamilton-Jacobi PDE or a sequence of variational problems. While many past methods were restricted certain classes of problems, this control theoretic approach to large deviations and rare event simulation admitted a general framework that easily adapted to problems as broad as queues [10], small noise diffusions [46], and uniformly recurrent Markov chains [14]. At the same time, [5] showed that that this approach also results in provably good level sets for particle splitting algorithms, thus unifying importance sampling and particle splitting methods. The theme here is clear. Ground breaking rare event simulation techniques leverage tools usually used to prove results in large deviations theory to create provably good Monte Carlo estimators. We will emulate this approach in this thesis.

Meanwhile, real world engineering problems that require high fidelity probabilistic computational models often require a large number of parameters to describe them. This necessitates exploring and sampling high dimensional probability distributions. One recent way of taming the complexity of high dimensional probability distributions

(12)

is through the notion of couplings. While coupling have been studied thoroughly in mathematical theory [48], recent work [28, 40] have developed computational means for determining couplings between random variables induced by transport maps. They showed that with the ability to compute transport maps, one can retreive independent samples from high dimensional, complex probability distributions. Given a reference distribution we can easily sample from (e.g. a standard Normal), and a target distribution we want to sample from, we can construct a deterministic map such that sampling from the target distribution only requires forward mappings of samples from the reference distribution through the transport map. Our goal is to determine ways where transport maps can accelerate the construction of rare event simulation schemes such as dynamic importance sampling.

Thesis contributions: We attempt to elucidate the theory of large deviations and how its theoretical tools lead to efficient sampling of rare events via dynamic importance sampling. We bring two more key theoretical tools into the computational forefront. From large deviations theory, the contraction principle is commonly used to find large deviations results of complex systems through couplings with simpler systems with known large deviations results. At the same time, the notion of de-terministic couplings on path space, attributed to Itô [43] have been used as a pedagogical tool for understanding Kolmogorov’s equations. We use these tools in a computational sense and propose ways of accelerating rare event simulation.

This thesis is organized as follows. Chapter 2 reviews basic notions and analyses of Monte Carlo simulation and importance sampling. Here, we also discuss the different notions of efficiency of importance sampling estimators. Chapter 3 provides an overview of results from large deviations theory, including the contraction principle, which serves as the theoretical basis for dynamic importance sampling. In chapter 4, we present the concept of exponential tilting before formulating dynamic importance sampling for discrete-time Markov chains as presented by [12]. We also discuss its associated Isaacs and Hamilton-Jacobi equations. We then present rare event simulation for small

(13)

noise diffusions as presented by [46]. In chapter 5, we present notions of deterministic couplings manifested in the form of computable transport maps before extending them to transport maps on path space. We then present a reformulation of dynamic importance sampling variational problem by leveraging the transport maps framework via the contraction principle. Chapter 6 presents a few simples examples to show the validity of such an approach, and then apply them to a real world rotorcraft problem. We conclude the thesis in chapter 7 by outlining new horizons for investigation with these paradigms. We provide the following concept map to guide the reader how these disparate topics are related to each other.

Rare events simulation Dynamic importance sampling Asymptotic efficiency Large Deviations Exponential tilting Hamilton-Jacobi Formu-lation HJB equations, optimal paths New optimal control problem Contraction Principle Transport Maps Maps between paths Figure 1-1: Concept map

(14)

Chapter 2 Monte Carlo simulation and

importance sampling

We review some basic notions of Monte Carlo simulation and importance sampling to point out their uses and difficulties for rare event simulation. The discussion will motivate the development of later chapters. Throughout the rest of the thesis, we fix a probability space _{(Ω, F, P).}

2.1 Simple Monte Carlo

Monte Carlo methods are a broad class of algorithms that estimate expectations of functionals via sampling and by leveraging laws of large numbers. To illustrate key ideas and difficulties of simple Monte Carlo and importance sampling, we consider the simple example of deviations of the empirical mean.

Let {Xi} be a sequence of independent, identically distributed Rd–valued random

variables distributed according to a probability distribution η. Without loss of generality, assume E[Xi] = 0. Consider the problem of estimating the probability the

empirical average of_{n samples lie in some subset of R}d not containing the expected value ofXi. Let E ⊂ Rd where E[Xi] /∈ E and define Sn = X1 + · · · + Xn. We wish

(15)

to estimate ρn= P Sn n ∈ E . (2.1)

Note that this defines a sequence of estimation problems indexed by n. Since the set E does not contain the expected value of Xi, the law of large numbers states that

empirical mean Sn

n will tend to some value in E with probability zero as n → ∞. The

task of estimating ρn becomes a rare event simulation problem as n grows large.

To estimate this sequence of probabilities via simple Monte Carlo simulation, we takek batches of n samples, find the mean of each batch, count the number of batches whose empirical mean is in the set of interest, and divide by the total number of batches ρn ≈ ˆρsimplen = 1 k k X i=1 1E Si n n Si n = X1i + · · · + Xni Xi j ∼ η. (2.2)

To evaluate the quality of the estimator, we compute its variance. Let x= (x1, ..., xn),

and define 1E(x) =      1 if 1 n Pn i=1xi ∈ E 0 otherwise. We have k Var[ˆρsimple n ] = Z [1E(x) − ρn]2 n Y i=1 η(xi) dx = Z [1E(x) − 2ρn1E(x) + ρ2n] n Y i=1 η(x) dx = Z 1E(x) n Y i=1 η(xi) dx −ρ2n = ρn− ρ2n

(16)

Therefore,

Var[ˆρsimple

n ] =

ρn− ρ2n

k . (2.3)

Clearly having an estimator with a small variance is ideal. However, having an estimator with a small absolute variance is often meaningless for rare event simulation, as the desired probability can be much smaller than the variance. Instead, computing the relative error is often a better metric for comparing estimators [4]. The relative error is defined as the ratio between the standard error and the true value of the probability of interest. In rare event simulation, ρn 1, so ρn− ρ2n ≈ ρ, and we have

e = q Var[ˆρsimplen ] ρn ≈ r 1 ρnk . (2.4)

Therefore, to even keep the relative error below one, we would need O(1/ρn) number

of samples, which is problematic when ρn 1.

Furthermore, to keep the relative error of the estimator constant as n → ∞ (as the event becomes rarer), the number of samples we need will grow on the order of k ∼ 1/ρn samples. As we will see in Chapter3, events that have what is called a large

deviations principle will asymptotically require exponential growth in the number of samples for a fixed relative error. For problems where obtaining samples are expensive, Monte Carlo simulation becomes computationally intractable. The fact is if a random variable X has a light tailed distribution (which for now we will take to mean that it has finite exponential moments), then we will have

lim

n→∞

1

nlog ρn= −γ

for some γ > 0. This implies that ρn ≈ exp(−nγ), so the number of samples needed

for a fixed relative error will grow exponentially withn as k ∼ exp(nγ). This growth is worse as γ grows larger, and we will see that larger values of γ roughly corresponds with rarer events.

(17)

2.2 Importance sampling

Importance sampling is one of many methods used for variance reduction in Monte Carlo estimation [2]. For our example problem, the key issue that leads to high variance of a simple Monte Carlo estimator is the lack of samples that will reach the event of interest. Importance sampling confronts this problem by generating more samples from the region of interest. Rather than sampling from the original nominal distribution η, we sample from an importance sampling or biasing distribution, which is chosen such that more samples are generated from the event of interest. Each sample is then weighted according to its likelihood with respect to the nominal distribution. We construct an importance sampling estimator to compute ρn as in problem (2.1).

We choose a biasing distributionπ such that the support of π contains the support of the nominal distributionη over the region of interest E, that is, supp(1E(x)·

Qn

i=1η(xi)) ⊂

supp(1E(x) ·Qn_i=1π(xi)). We can construct the estimator

ˆ ρIS = 1 k k X i=1 1E e Si n n ! _n Y i=1 η( eXi₎ π( eXi₎ e Sni = eX i 1+ · · · + eX i n e Xi ∼ π. (2.5)

Notice that instead of sampling from the nominal distribution η, we now sample from π, and then re-weight each sample according to its likelihood with respect to the nominal distribution. This amounts to taking a weighted average of the generated samples. This estimator is unbiased since

Eπ[ˆρISn] = 1 k k X i=1 Z 1E(x) n Y i=1 η(xi) π(xi) n Y i=1 π(xi) dx = Eη[1E(x)] = ρn. (2.6)

(18)

We see Var[ˆρIS n] = Z 1E(x) n Y i=1 η(xi) π(xi) − ρn !2 _n Y i=1 π(xi) dx = Z 1E(x) n Y i=1 η(xi) π(xi) !2 _n Y i=1 π(xi) dx −ρ2n. (2.7)

We can see that the variance of an importance sampling estimator is contingent on the choice of biasing distribution π. The choice of the biasing distribution is crucial as not all valid choices for importance sampling distributions will lead to a smaller variance in the estimator compared to simple Monte Carlo. Some may cause importance sampling to perform worse [31]. Since the value of ρn is independent of the choice of biasing

distribution, we can formulate an optimization problem in which we search for a biasing distribution in the space of all probability distributions on Rd such that the second moment of the estimator is minimized

π∗(x) = arg min π∈P(Rd₎E π   1E(x) n Y i=1 η(xi) π(xi) !2 . (2.8)

In fact, there is a closed form solution to this problem called the optimal biasing distribution, which leads the variance of the importance sampling estimator to be identically zero. Observe that rather than having the importance sampling estimator be the product of distributions on Rd_{, consider a distribution on R}dn _{for all}_{n samples}

at once. Let π∗_{(x) =} 1 ρn(1E(x) Qn i=1η(xi)), then Var[ˆρISn] = Z ρn1 E(x)Qn_i=1η(xi) 1E(x) Qn i=1η(xi) 2 π∗(x) dx −ρ2n = 0

What the optimal biasing distribution is doing is placing samples exactly where they need to be.

From a computational standpoint, this solution is not as useful as it seems since sampling from this distribution requires knowledge of the probability of interest! If we were able to sample from this distribution, one would only require a single sample and

(19)

be done. Some approaches to importance sampling, such as the cross-entropy method [36], aims to sample from this distribution directly. Given a parametrized space of probability distributions, the cross-entropy method finds the closest approximating distribution in the space to the optimal biasing distribution according to the Kullback-Leibler divergence.

Definition 2.2.1. Let η and π be two equivalent probability measures. Then the Kullback-Leibler divergence or relative entropy between the distributions is

DKL(πkη) = Z log dπ dη π(dx)

While the optimal biasing distribution may not be computationally useful, we can think about what it tells us. Knowledge of the rarity of the event of interest helps construct biasing distributions for importance sampling. In the extreme of the optimal biasing distribution, if we know the probability exactly, then we can construct a perfect biasing distribution. In the other extreme, where we just do simple Monte Carlo, with no knowledge, we get terrible performance. Something in the middle where we can get some other information about ρn while still being able to sample from

something computationally feasible is ideal. The cross-entropy method is one method that tries to find something in the middle. Another approach [38] tries to find the necessary sample sizes need to make importance sampling successful through various divergences between the optimal biasing distribution and a proposed importance sampling distribution. The answer we will consider is large deviations theory, which will provide an alternative method to comparing the rarity of two rare events without directly computing their probabilities.

2.3 Notions of efficiency

We will define some notions for describing the quality of an estimator from [2]. Given some estimator Zn for computing ρn (such as the importance sampling estimator

(20)

(2.5)), the variance of Zn is

Var[Zn] = E[Zn2] − ρ2n

Roughly, an estimator is better if its second moment follows the behavior of square of the probability of interest. Of course, as discussed earlier, when the second moment is exactly equal to square of the probability of interest, we have the optimal biasing distribution. However, there are other ways for the second moment and ρ2

n to be

related to each other. Recall thatn actually denotes a sequence of estimation problems, so we can look at the asymptotic behavior of the sequence of estimators Zn as it

relates to ρn.

Definition 2.3.1. We say an estimator Zn is strongly efficient if

lim n→∞ E[Zn2] ρ2 n = 1

An estimator that is strongly efficiently will have its variance tend to zero as n goes to infinity.

Definition 2.3.2. We say an estimatorZnis asymptotically, weakly, or logarithmically

efficient if lim n→∞ log(E[Z2 n]) 2 log ρn = 1

This definition is motivated by large deviations theory. If an estimator is asymp-totically efficient, then the rate of decay of the second moment of the estimator will be equal to twice the rate of decay of ρn. An estimator that is weakly efficient will

have the number of samples grow sub-exponentially to keep the relative error constant as n → ∞. We will be judging estimators with this criterion, which will allow us to present a broad class of efficient importance sampling estimators under a single framework.

(21)

Chapter 3 Large deviations theory

In the previous chapter, we found two reasons for studying the theory of large deviations. First, we found that using simple Monte Carlo for estimating probabilities of events that obey a large deviations principle, where the probability decays exponentially fast to zero, asymptotically requires exponential growth in the number of samples to keep the relative error of the estimator constant. Second, we concluded that having knowledge about the “rarity” of the event of interest can aid in choosing a biasing distribution for importance sampling. Large deviations theory allows one to compare rarity of different events without knowing the exact probability. It instead compares the decay rate at which probabilities go to zero under the law of large numbers. We will see that knowledge of these decay rates will help find good biasing distributions. We first formally define what it means for the distribution of a Rd_{–valued random}

variable to have a light tail. Let P_(Rd_{) be the space of probability distributions on R}d_.

Definition 3.0.1. Let_{X ∼ µ. The distribution µ ∈ P(R}d_{) is light-tailed if its moment}

generating function _{M (α) = E}µ[exp(hα, Xi)] < ∞ for all α ∈ Rd.

3.1 Motivation and laws of large numbers

The classical way of motivating large deviations theory is by looking at questions the core theorems of probability theory aims to answer. Given a collection of independently identically distributed (i.i.d.) samples from some common probability distribution,

(22)

what can we say about their aggregate behavior? The laws of large numbers describe how the empirical mean of a collection of samples behave as the size of the collection grows. Central limit theorems describe how the probability that the empirical mean deviates around the expected value behaves. What is missing is theory describing the how the probability that the empirical mean deviates far away from the expected value behaves.

One can also view large deviations theory as further refinements of laws of large numbers. From the law of large numbers, we know that given i.i.d. samples from some probability distribution, under mild conditions, its empirical mean will converge to the expected value almost [15].

Theorem 3.1.1 (Strong law of large numbers). LetX be a Rd_{–valued random variable}

such that _{µ = E[X] < ∞. Then}

P lim n→∞ 1 n(X1+ · · · + Xn) = µ = 1.

Alternatively, we may take a different but equivalent version of the law of large numbers. Namely that the empirical mean will converge to anything but the expected value with probability zero.

Theorem 3.1.2 (Strong law of large numbers). If E[X] < ∞, and EX /∈ A ⊂ Rd_,

then P lim n→∞ 1 n(X1+ · · · + Xn) ∈ A = 0.

However, the question we ask now is how fast does the probability approach zero? For a certain class of distributions, the answer is given by large deviations theory. One key concept of large deviations theory are large deviations principles (LDP) and their corresponding rate functions. These rate functions describe the exponential decay rate of probabilities of regions far away from the expected value.

Before diving into technical details of large deviations theory, let us see if we can deduce behavior of tail probabilities empirical means of random variables using

(23)

only analytical tools. The following simple example is adapted from [6]. Let {Xi}

be a sequence of standard normal random variables and define Sn= X1+ · · · + Xn.

Consider the tail probabilities of Sn

n for some a > 0. Notice that we can obtain an

upper bound for the tail probabilities

P Sn n ≥ a = Z ∞ a√n 1 √ 2πexp −x 2 2 dx ≤ Z ∞ a√n 1 √ 2π x a√nexp −x 2 2 dx = exp−a2_n 2 √ 2πa√n .

We can also find a lower bound, by applying integration by parts twice,

P Sn n ≥ a = Z ∞ a√n x x 1 √ 2πexp −x 2 2 dx = − 1 x√2πexp −x 2 2 ∞ a√n − √1 2π Z ∞ a√n exp−x2 2 x2 dx = exp −a2n 2 √ 2πa√n − 1 √ 2π Z ∞ a√n x exp−x2 2 x3 dx = exp −a2_n 2 √ 2πa√n + exp−x2 2 √ 2πx3 ∞ a√n + √3 2π Z ∞ a√n exp−x2 2 x4 dx = exp−a2_n 2 √ 2πa√n − exp−a2_n 2 √ 2πa3_n3/2 + 3 √ 2π Z ∞ a√n exp−x2 2 x4 dx ≥ exp−a2_n 2 √ 2π 1 a√n − 1 a3_n3/2 .

From these estimates, we can see that the tail probability goes to zero at least exponentially fast with n. To find the asymptotic rate, we can take the logarithm of the bounds, divide by n, and take the limit. Observe that since f (x) = log(x) is a

(24)

monotone function, we have −1 n log √ 2π + 1 nlog 1 a√n − 1 a3_n3/2 − a 2 2 ≤ 1 n log P Sn n ≥ a ≤ −1 n log √ 2π + 1 nlog 1 a√n −a 2 2. Taking the limit on both sides and invoking the sandwich theorem, we arrive at

lim n→∞ 1 nlog P Sn n ≥ a = −a 2 2.

Estimates similar to the one above are standard in large deviations theory. Clearly this is a rather cumbersome way of characterizing the rate of decay of the tail probabilities, and it is restricted to empirical averages of normal random variables. The question now is whether this type of decay rate characterization exists for other distributions, and if there is a general way of finding the rate. In the next section, we will discuss Cramér’s theorem, which answers both of these questions.

In the following sections, the setting under which we will be describing large deviations theory will change depending on the concept. In certain sections, we will discuss key concepts on general complete separable metric spaces, in others, we may restrict ourselves to R or Rd_.

3.2 Cramér’s theorem and Varadhan’s lemma

We define some terms before describing some of the major results of large deviations theory. Many of these results are adapted from [6, 7]. Let X be a complete separable metric space.

Definition 3.2.1. A rate function is a mapping I : X → [0, ∞] such that for any a < ∞, I−1([0, a]) = {x ∈ X : I(x) ≤ a} is a compact subset of X .

(25)

has a large deviations principle with rate function I if for any open set O ⊂ X ,

lim inf

n→∞

1

nlog P(An∈ O) ≥ − infx∈OI(x)

and for any closed set C ⊂ X ,

lim sup

n→∞

1

n log P(An ∈ C) ≤ − infx∈CI(x).

If we restrict ourselves to R–valued random variables and sets in the form of intervals, we can write large deviations principles in the following form, as the infimum over an open set and its closure are equal

lim

n→∞

1

n log P(An ∈ [a, b]) = − infx∈[a,b]I(x).

A large deviations rate function is convex and is equal to zero only when _{x = E[A}n]

[6]. What large deviations principles intuitively states is that the rate at which the probability the sequence of random variables lie in some set approaches zero exponentially fast with rate that is dependent on the rate function and the set of interest. If a sequence of Rd_{–valued random variables {X}n_{} obeys a large deviations}

principle, then we can make the following approximation for simple subsets A ⊂ Rd_,

P (Xn _{∈ A) ≈ exp(−n inf} x∈AI(x)).

While the value of this approximation is not accurate, it is useful for relative compar-isons of the probabilities between different events. Since I(x) is convex and only zero at the expected value, I(x) will be larger for regions far away from the expected value. Given two events, we can compare their "rarities" by comparing their infima of I(x) over each region.

From the simple example in section 3.1, we derived the rate at which the tail regions of the distribution of empirical mean of standard normal random variables decayed as n → ∞. Cramér’s theorem allows us to generalize the concept to other

(26)

random variables, including giving conditions for which a rate function exists. We state and prove a special case of Cramér’s theorem for large deviations of empirical averages to highlight parts of the proof that will be useful for importance sampling. We then state the general result of Cramér’s theorem in Rd_{. The following theorem is}

from [7].

Definition 3.2.3. Let X be a random variable with distribution µ. The cumulant generating function is the logarithm of the moment generating function

H(α) = log Eµ[exphα, Xi].

Definition 3.2.4. Let _{f : R}d _{→ R}d_{. The Legendre-Fenchel transform of} _{f is}

f∗(p) = sup

x∈Rd

[hx, pi − f (x)].

We state without proof that if f (x) is a convex function, then taking the Legendre-Fenchel transform of f∗(p) is f (x).

Theorem 3.2.1 (Cramér’s theorem for large deviations from the mean). Let {Xi} be a

sequence of i.i.d. R–valued random variables with distribution µ and cumulant generat-ing function_{H(α) = log E}µ[eαX]. Define Sn =

Pn

i=1Xi andI(β) = supα∈R[αβ−H(α)].

Then for all a > E[Xi]

lim n→∞ 1 nlog P Sn n ≥ a = −I(a)

Proof. (Adapted from [7_{]) Without loss of generality, assume E[X}i] = 0. Define

M (α) = Eµ[eαX]. Observe that H(0) = 0 and

H0(α) = M

0_(α)

M (α). Then H0_{(0) = M}0

(0) = E[X] = 0. To maximize the quantity aα − H(α), we must have H0_{(α) = a. Since H(α) is convex, H}00

(α) > 0 for all α ∈ R, so if we have a α∗ _such

(27)

and H0_{(0) = 0, we must have that α}∗ _{> 0. Observe that by Markov’s inequality [}

15], for α > 0,

P(Sn≥ na) = P(eαSn ≥ eαna) ≤ e−αnaE[eαSn] = e−αnaE eαX1n. Taking the logarithm and dividing by n we have

1

n log P(Sn≥ na) ≤ −αa + log E[e

α∗X1_].

Since the bound above holds for any α > 0, n > 0, and we know that the right hand is maximized when for α∗ _{> 0,}

lim sup

n→∞

1

nlog P(Sn ≥ na) ≤ − supα>0

[αa − H(α)] = − sup

α∈R

[αa − H(α)] = −I(a). (3.1)

For the lower bound, we use a technique called exponential tilting. Choose α∗ _{in the}

discussion above. Define a sequence of i.i.d. random variables eXi with distribution

µα∗(dx) = exp(α∗x − H(α∗))µ(dx).

Observe that E[ eXi] = a. The moment generating function of eX is

Mα∗(α) = E_µ α∗[exp(α eXi)] = Z R exp(α∗x − H(α∗)) exp(αx)µ(dx) = 1 M (α∗₎Eµ[exp((α + α ∗_)X i)].

Taking the derivative with respect to α and evaluating it at zero, we have

Eµα∗[ eXi] = M 0 α∗(0) = 1 M (α∗₎Eµ[Xiexp(α ∗ Xi)] = M0_(α∗₎ M (α∗₎ = H 0 (α∗) = a.

(28)

moment generating functionMα∗(α) evaluated at zero Mα00∗(0) = 1 M (α∗₎Eµ[X 2 i exp(α ∗ Xi)] = M00_(α∗₎ M (α∗₎ .

Since the moment generating function is smooth and we know it to be finite for all α, M00_(α∗_{) < ∞, so}

e

σ2 _{= Var[f}_X

i] < ∞. With this, we can apply the central limit theorem

of the sequence { eXi}. Before we do that, observe that by defining eSn =Pn_i=1Xe_i, we have P(Sn≥ na) = Z {x1+···+xn≥na} µ(dx1) · · · µ(dxn) = Z {x1+···+xn≥na} [e−α∗x1+H(α∗)_]µ α∗(dx₁) · · · [e−α ∗_x 1+H(α∗)_]µ α∗(dx_n) = [M (α∗)]n Eµα∗[e −α∗ e Sn1 { eSn≥na}]. (3.2)

Now we apply the central limit theorem. ChooseC > 0 such that 1 √ 2π Z C 0 exp −x 2 2 dx > 1 4 and observe that

E[e−α ∗ e Sn1 { eSn≥na}] ≥ e −α∗₍ e σ√nC+na) P " e Sn− na e σ√n ∈ [0, C) # .

Using the result from (3.2), taking the logarithm, and dividing by n we get 1 nlog P(Sn ≥ na) = 1 nlog E[e −α∗ e Sn1 { eSn≥a}] + log M (α ∗₎ ≥ −α ∗_{C ˜}_σ √ n − aα ∗ + log M (α∗) + 1 nlog P " e Sn− na e σ√n ∈ [0, C) # .

For sufficiently large n, the probability on the right hand side will be equal to 1/4. Then we will have that the lower limit will be

lim inf

n→∞

1

nlog P(Sn ≥ na) ≥ −(aα

∗

(29)

Recalling how α∗

was chosen at the beginning of the proof, we see that

lim inf

n→∞

1

nlog P(Sn≥ na) ≥ − supα∈R

[aα − H(α)] = −I(a). (3.3)

Combining (3.1) and (3.3), we get our desired result.

Notice that the difficult part is proving the lower bound of the large deviations principle. This involved shifting probability mass to the set of interest via an expo-nentially tilting distribution. The interesting takeaway from this proof is that the change of measure involved in the proof turns out to be a good importance sampling distribution. We will discuss this more in the next chapter. For completeness, we state the full Cramér theorem.

Theorem 3.2.2 (Cramér). Let {Xi} be a sequence of i.i.d. Rd–valued random variables

with cumulant generating function _{H(α) = log E}µ[exphα, Xi]. Define Sn =

Pn

i=1Xi

and I(β) = sup_α∈Rd[hα, βi − H(α)]. Then

_S_n

n satisfies an LDP with rate function

I(β). That is, for any open set O ⊂ Rd_,

lim inf n→∞ 1 nlog P Sn n ∈ O ≥ − inf x∈OI(x)

and closed set _{C ⊂ R}d,

lim sup n→∞ 1 nlog P Sn n ∈ C ≤ − inf x∈CI(x)

With Cramér’s theorem, one can roughly recover the law of large numbers. Observe that if the expected value E[X] ∈ E ⊂ Rd, then we have

− inf

x∈E◦I(x) ≤ lim inf_n→∞

1 nlog P Sn n ∈ E ◦ ≤ lim sup n→∞ 1 n log P Sn n ∈ ¯E ≤ − inf x∈ ¯EI(x).

Since if E[X] ∈ E, infx∈E◦I(x) = inf_{x∈ ¯}_EI(x) = 0. Therefore we have that

lim n→∞ 1 nlog P Sn n ∈ E = 0

(30)

or P Sn

n ∈ E → 1. The takeaway is that Cramér’s theorem states that large

devia-tions estimates for empirical means of i.i.d. random variables exist for light tailed distributions and that the rate function can be computed as the Legendre transform of the cumulant generating function. Extensions of Cramér’s theorem to sequences of random variables that are not i.i.d., a result called the Gärtner-Ellis theorem, can be found in most standard large deviations texts [6,7, 8,9, 47].

Varadhan’s lemma generalizes Cramér’s theorem to exponential expectations. Under certain conditions, one can recover Cramér’s theorem. The theorem is important for importance sampling of exponential expectations, where larger values of the functional lie in regions with smaller probability.

Theorem 3.2.3 (Varadhan’s Lemma). Let {Xi} be a sequence of X –valued random

variables satisfying an LDP with rate function _{I. Let h : X → R be a bounded} continuous function. Then

lim

n→∞

1

nlog E[exp(−h(Xn))] = − infx∈Rd[I(x) + h(x)].

Notice that if Xn were a sequence of empirical means and h(x) were defined as

h(x) =      0 if x ∈ E ∞ if x /∈ E then we recover Cramér’s theorem.

3.3 The contraction principle

One of the most important results from large deviations theory is the contraction principle, which is often used as a proof technique for finding LDP of complex sequences of random variables by coupling them to simple sequences with known LDP.

Given a sequence of random variables obeying an LDP, the contraction principle allows one to show that another sequence of random variables obeys LDP if there

(31)

exists a continuous mapping between the two sequences. Furthermore, it also gives a formula of finding the corresponding large deviations rate function of the other sequence in terms of the original rate function.

Theorem 3.3.1 (Contraction Principle). Let X, Y be complete separable metric spaces. Let {An} be a sequence of X –valued random variables with LDP with rate function I.

Let G: X → Y be a continuous mapping between metric spaces and let Bn = G(An)

be a sequence of Y–valued random variables. Define

˜

I(y) = inf

x∈X{I(x) : x ∈ G −1

({y})}.

Then ˜I is a rate function on Y and {Bn} has an LDP with rate function ˜I.

What the theorem intuitively says is that under continuous mappings, large deviations principles are preserved. We will use this theorem in a way that will be amenable to computation in later chapters. The proof is routine, so we prove it for completion.

Proof. (adapted from [6, 7]) To show that ˜I(y) is a rate function we must show that for all a < ∞, ˜I−1_{([0, a]) is compact. Observe that}

˜ I−1([0, a]) = {y ∈ Y : ˜I(y) ≤ a} = inf x∈X{y ∈ Y : I(x) ≤ a, x ∈ G −1 (y)} = inf x∈X{y ∈ Y : x ∈ G −1_{({y}) ∩ I}−1_{([0, a])}.}

Note that G is continuous, {y} is a closed set, and since I−1 _{is a rate function,}

I−1_{([0, a]) is compact. Then G}−1_{({y}) is closed since it is the preimage of a closed}

set under a continuous map. The intersection of a closed set and a compact set is compact, so the the set over which we are searching for the infimum ofI(x) is compact. Then by the extreme value theorem, there exists an x ∈ G−1_{({y}) ∩ I}−1_{([0, a]) such}

(32)

that the infimum is attained. This implies that

˜

I−1([0, a]) = G(I−1([0, a])).

Since ˜I−1_{([0, a]) is the continuous image of a compact set, it is also compact. Therefore,}

˜

I is a rate function.

Next we show that the image of the sequence under the map G has a large deviations principle with rate function I0_{(y). For the lower bound, we see that for any}

open set O ⊂ Y,

lim inf

n→∞

1

nlog P[Bn∈ O] = lim infn→∞

1 nlog P[An∈ G −1 (O)] ≥ − inf x∈G−1_(O)I(x) = − inf y∈O{I(x) : y = G(x)} = − inf y∈O ˜ I(y).

Likewise, for the upper bound, we see that for any closed set C ⊂ Y,

lim sup

n→∞

1

nlog P[Bn∈ C] = lim supn→∞

1 nlog P[An ∈ G −1_(C)] ≥ − inf x∈G−1_(C)I(x) = − inf y∈C{I(x) : y = G(x)} = − inf y∈C ˜ I(y).

and we have the desired result.

This theorem will be crucial to our approach to rare event simulation, in which we will explicitly construct the continuous functions discussed in the hypothesis.

(33)

3.4 Sample path large deviations and Freidlin-Wentzell

theory

Cramér and Varadhan’s theorem describes large deviations of sequences of probability measures defined on finite dimensional spaces. Here we discuss the extension to probability measures on infinite dimensional spaces, namely path space. Mogulskii’s theorem provides an alternative view on Cramér’s theorem. Cramér’s theorem describes large deviations of empirical means in a way that does not look at how each sample contributes to the empirical mean. Mogulskii’s theorem takes the increment of each sample into account as well, roughly saying that the path an empirical mean takes is also important.

Definition 3.4.1. A function f is absolutely continuous on [0, tf] if there exists a

Lebesgue integrable function g such that for all t ∈ [0, tf]. We write f ∈ AC[0, tf].

f (t) = f (0) + Z t

0

g(s) ds .

Theorem 3.4.1 (Mogulskii). Let Xi ∼ η be an Rd–valued random variable with a

finite moment generating function. Let _{H(α) = log E}η[exp(hα, Xi)] be the cumulant

generating function and let L(β) = sup_α∈Rd[hα, βi − H(α)] be the local large deviations

rate function. Define Sk = X1+ · · · + Xk and construct

Yn(t) = Sbntc n + t − bntc n Xbntc+1.

Then for any closed set F ⊂ AC[0, tf]

lim sup

n→∞

1

nlog P(Yn(·) ∈ F ) ≤ − inf{I(ϕ(·)) : ϕ(·) ∈ F, ϕ(0) = 0} and open set G ⊂ AC[0, tf],

lim inf

n→∞

1

(34)

where

I(ϕ(·)) = Z T

0

L( ˙ϕ(s)) ds .

Mogulskii theorem basically states that the beginning and endpoint of a sample path are not the only parts that contributes to the rarity of a path, what the path does in between matters as well. As an example, let us look at sample path large deviations of random walks with standard normal increments Xi ∼ N (0, 1). Then

applying Mogulskii’s theorem, we have

H(α) = 1 2α 2 L(β) = sup α∈R [αβ − 1 2α 2 ] = 1 2β 2 I(x(·)) = Z 1 0 1 2˙x(s) 2_{ds .}

Intuitively, this says that if we take a sample path that increases faster, it is rarer. We now discuss large deviations of stochastic differential equations. The following results are from [6, 18]. Freidlin-Wentzell theory describes large deviations for Itô diffusions – large deviations for stochastic differential equations as the magnitude of the diffusion term goes to zero. We first describe large deviations of Brownian motion. Theorem 3.4.2 (Schilder). Let W(t) =

√

W (t), where W (t) is standard Brownian motion on Rd, with W0 = 0. Let P be the probability measure of W(t), where P1 is

the classical Wiener measure [15]. Let C0([0, tf]) be the space of continuous functions

on [0, tf], and define the rate function I : C0([0, tf]) → [0, ∞] by

I(ϕ(·)) = Z tf 0 1 2k ˙ϕ(s)k 2_{ds .}

Then the sequence {W} satisfies an LDP with rate function I. That is, for any open

set O ⊂ C0([0, tf])

lim inf

(35)

and for any closed set C ⊂ C0([0, tf]),

lim sup

→0 log P

(W ∈ C) ≤ − inf

ϕ∈OI(ϕ(·)).

The path of a Brownian motion under no noise is a constant path that stays flat at zero. What Schilder’s theorem says is that the probability the empirical mean path of realizations of Brownian motion is in a particular set of continuous paths decays exponentially, with the rate being the integral of the derivative squared of the least rare path of the set.

LetX

t be a diffusion process on Rdwith uniformly Lipschitz drift termb : Rd → Rd,

diffusion term_{σ : R}d_{→ R}d×d_{, and deterministic initial condition}_X

0 = X0. Let be a

noise constant and Wt be standard Brownian motion on Rd. We quantify the large

deviations rate function of the following system

dXt = b(X t) dt +σ(X t) dWt X0 = X0.

How one gets large deviations of Itô diffusions is via the contraction principle. We can view realizations of a stochastic differential equation as the image of a sample path of a Brownian motion under the dynamics. If the dynamics vary continuously with the state space, then the contraction principle applies, and one can get large deviations of the diffusion process. The result is the Freidlin-Wentzell theorem.

Theorem 3.4.3 (Freidlin-Wentzell). Define a diffusion process as above. Let a(x) = σ(x)σ(x)T_{. Then the diffusion satisfies a large deviations principle on C}

0([0, T ]) with rate function I(ϕ(·)) = 1 2 Z T 0 ( ˙ϕ(s) − b(ϕ(s)))T_a−1 (ϕ(s))( ˙ϕ(s) − b(ϕ(s))) ds .

(36)

any open set O ⊂ C0([0, tf])

lim inf

→0 log P(X

_{∈ O) ≥ − inf}

ϕ∈OI(ϕ(·))

and for any closed set C ⊂ C0([0, tf]),

lim sup

→0 log P

(X ∈ C) ≤ − inf

(37)

Chapter 4 Dynamic importance sampling

We synthesize the ideas from the theory of large deviations to construct efficient and computationally amenable importance sampling estimators. We first discuss the optimal biasing distribution from the lenses of large deviations theory and information theory which allows us to get the exponentially tilted distribution in an alternative way. We then discuss dynamic importance sampling for discrete-time Markov chains. Afterwards, we present its extension to importance sampling of diffusions. This includes a theorem from stochastic calculus called Girsanov’s theorem, which is an important tool for generalizing importance sampling to stochastic differential equations.

4.1 Exponential tilting

Suppose we have an Rd_{–valued random variable} _{X distributed according to a nominal}

distribution η with E[X] = 0. We wish to find a biasing distribution π corresponding to random variable e_{X with expected value E}πα[ eX] = a. Assuming that the moment

generating function ofX is finite, the exponentially tilted biasing distribution is one way of achieving this goal in an optimal way.

We have already seen one way of arriving at the exponentially tilted distribution. In the proof of Cramér’s theorem, we saw that the exponential tilting distribution was used to prove the large deviations lower bound by biasing the nominal distribution to the event of interest. It was later found to be a good importance sampling estimator

(38)

as well, as it created asymptotically efficient estimators for certain simple rare event simulation problems [39].

Here we consider another way of arriving at exponentially tilted biasing distribution via information theoretic concepts. Recall the relative entropy or Kullback-Leibler divergence between two probability distributions µ and ν is

DKL(νkµ) = Eν log dν dµ = Z log dν dµ dν .

One can arrive at the exponentially tilted biasing distribution from the solution of the following variational problem [44].

Theorem 4.1.1. Suppose there exists αa ∈ Rd such that ∇H(αa) = a. Then the

optimization problem inf ν∈P(Rd₎ n DKL(νkµ) : Eν( eX) = a o

has unique solution given by ν∗_{(dx) = µ}

αa(dx) = exp[hαa, Xi − H(αa)]µ(dx).

In other words, given a random variable, and the desire to bias the random variable towards a prescribed expected value, the distribution that minimizes the relative entropy while achieving those constraints will uniquely be an exponentially tilted distribution of µ.

For practical computations, one finds the exponentially tilted distribution through the following recipe. Let X be a real valued random variable with density η(x), with finite cumulant generating function H(α). Suppose we wish to produce an exponentilally tilted biasing distribution that has expected value equal to a. We first solve the equation ∇H(α) = a. Then the exponentially tilted distribution will be ηα(x) = η(x) exp[hα, Xi − H(α)].

Now that we know how to compare rarity of sample paths with large deviations, and how to bias the expectation of a variable through exponential tilting, we can put these together to construct importance sampling estimators. Exponential tilting allows one to easily bias a nominal distribution to have some desired expected value,

(39)

but it does not discuss how tell what expected value it should bias towards. To answer this question, we must see how exponential tilting and large deviations are related to each other. Consider the following theorem the describes the duality between the large deviations rate function and the KL divergence.

Theorem 4.1.2. Suppose for all α ∈ Rd_, _{H(α) < ∞. Then for any β ∈ R}d_,

L(β) = inf ν∈P(Rd₎ DKL(νkµ) : Z Rd yν(dy) = β .

Let Xi ∼ µ i.i.d., and let µ have finite cumulant generating function. We wish to

compute the probability that the sample average of n samples from µ is lies in set E ⊂ Rd _{via importance sampling with the exponentially tilted distribution as the}

biasing distribution ρn= P Sn n ∈ E . (4.1)

We want to find a biasing distribution µα that minimizes the KL divergence between

the biasing distribution and the nominal distribution, and has expected value in the rare event of interest. That is bias towardsβ∗ _{such that}

β∗ = arg inf β∈E_ν∈P(Rinfd₎ h DKL(νkµ) : Eν[ eX] = β i = arg inf β∈EL(β).

Based on theorem 4.1.2, this is equivalent to finding an exponentially tilted biasing distribution that has expected value equal to the minimizer of the large deviations rate function in the set of interest. In other words, we bias to the least rare point of the set E according to its large deviations rate function.

4.2 The optimal path approach: Open loop control

The optimal path approach to rare event simulation is a generalization of finding a good importance sampling estimator for large deviations of the empirical mean. Rather than looking at just the large deviations principle that comes from Cramér’s theorem, we consider the sample mean as a path, and look at the large deviations

(40)

principle that comes from Mogulskii’s theorem.

Consider the random walk generated by each sample of Sn and make it continuous

through linear interpolation as in the hypothesis of Mogulskii’s theorem. The general heuristic to decide where to bias the walk is based on the principle: "rare events occur in the most likely way". This heuristic is based on comments in [7]. This will lead to an optimization problem similar as the previous section. That is, we bias the simulation so that the path taken will be on average the most likely path to get to the region of interest. We solve the following variational problem to find the optimal path an importance sampling scheme should bias each step in the importance sampling scheme towards that path. To elucidate the reasoning, let Xi be i.i.d standard normal

random variables, then recall that the large deviations of the standard normal is L(β) = 1

2β

2_{. To find the optimal path to bias towards, we solve}

min ϕ Z 1 0 1 2ϕ(s)˙ 2_{ds : ϕ(0) = 0, ϕ(1) = a} .

Using the Euler-Lagrange equations, one will obtain the boundary value problem

¨

ϕ(s) = 0 : ϕ(0) = 0, ϕ(1) = a

which has solution ϕ(t) = at for t ∈ [0, 1]. Therefore, the optimal path to bias towards is a straight line whose endpoint is in the region of interest. To obtain where to bias the discrete increments of the random walk, we find the derivative of the optimal path and evaluate it at integer multiples of _n1. In this example, we need to bias each increment so that EµαXei = a. Intuitively, this makes sense. Suppose one was given n

steps to walk to location na starting at 0. Given that each step would be taken i.i.d., it makes no sense to bias any one step more than another. Therefore, one would bias each step equally, in which case, we would do exactly what large deviations tells us.

To generalize this example, suppose instead that Xi ∼ η i.i.d. i = 1, ..., n with the

conditions being E[Xi] = 0 and a finite cumulant generating function H(α). Define

(41)

transform ofH(α)). Using Mogulskii’s theorem, we obtain the large deviations rate function to be I(ϕ(·)) = Z tf 0 L( ˙ϕ(s)) ds The resulting variational problem is

min ϕ Z 1 0 L( ˙ϕ)ds : ϕ(0) = 0, ϕ(1) = a

Once we find the optimal ϕ∗_{(t), we construct biased random variables e}_X

k such that

Eη_αk[ ˜Xk] = ˙ϕ k_n.

In some cases the optimal path approach will lead to an asymptotically efficient importance sampling scheme. However the optimal paths approach is only a heuristic and there exist simple examples for which the optimal path approach performs poorly [19]. One example is where the rare event of interest is in two disjoint sets, where biasing towards a single optimal path will likely ignore possible samples that will get to other regions of the set. Because of this a state independent approach to importance sampling, we will not get any general provably efficient importance sampling estimators.

4.3 The Isaacs’ equation: Closed loop control

The problem with the naïve exponential tilting algorithm [39] and the optimal path approach to rare event simulation is that it fails to perform well in scenarios where the rare event of interest is separated into multiple regions.

One of the natural ansatz to this problem is to ask what if we were to change the optimal path to bias towards in a dynamic fashion? That is find the optimal path at every step in the simulation. What [12] found was that by writing down a dynamic programming problem of minimizing the second moment of an importance sampling estimator where the choice of biasing was the exponentially tilted distribution, one can arrive at the conclusion that this is what is to be done.

The approach for proceeding section is based on [12]. Suppose we are (still) estimating the probability that the empirical mean of n samples of some random

(42)

variable reaches the rare event of interest via importance sampling. When studying importance sampling, we found that to minimize the variance, one had to choose a biasing distribution such that the second moment is as small as possible – what is known as strong efficiency. This requirement is hard to obtain. Instead, we opt for something weaker. We want the second moment to have exponential decay rate as close as possible to twice the decay rate of the quantity of interest (probability of the rare event, or the exponential expectations). We state a theorem from [12] saying that a sequence of biasing or controls that minimizes the second moment will have decay rate equal to twice the rate of the quantity of interest – exactly what we want.

Let _{X be a R}d _{valued random variable distributed according to probability}

dis-tribution _{µ. Assume, without loss of generality, that E[X] = 0 and that H(α) =} log E[exp(hα, Xi)] is finite for all α ∈ Rd_{. We estimate, via Monte Carlo, exponential}

expectations qn= E exp −nF Sn n

for bounded and continuous _{F : R}d_{→ R, where S}

n= X1+ · · · Xn. A special case is

the computation of the probability that the empirical average is in some set A that does not contain the expectation. Let _{A ⊂ R}d_{. Observe that if we let}

F (x) =      0 if x ∈ A ∞ if x ∈ Ac

then we have that exp[−nF (x)] = 1x∈A. The reason we work with exponential

integrals is that the control theory technicalities will only work if F (x) are bounded and continuous. However, this does not prevent us from applying the method to computing rare event probabilities. By Varadhan’s lemma, we know that the quantity qn will have decay rate equal to infx∈]Rd[I(x) + F (x)].

We estimate these quantities via importance sampling using exponential tilting. That is, we sample from distributions of the form

(43)

We consider a dynamic change of measure. The change of measure can be parametrized as a function _{α(x, t) : R}d_{× [0, 1] → R}d_{. The task at hand is to find}_{α(x, t) such that}

the corresponding importance sampling estimator is asymptotically efficient. The corresponding importance sampling estimator can be written

Zn = exp " −nF Se n n n !# exp " − n−1 X j=0 * α Se n j n , j n ! , eXj+1 + − H " α Se n j n , j n !## αn j = α e Sn j n , i n ! e Sj = eX1+ · · · + eXj e Xj+1∼ µαn j.

Define Vn _{to be the smallest second moment using an optimal sequence of controls}

{αn j} ∞ j=1, Vn_{= inf} {αn j} Eµ " e−2nF Sn¯_n n n−1 Y j=0 e− α _¯ Sjn n , j n ,Xj+1 +H α _Sn¯ j n , j n # .

Then we can define the decay rate Wn_{= −}1 nlog V

n_.

4.3.1 Control-theoretic formulation: the Isaacs equation

We want the second moment of the importance sampling estimator to have exponential decay rate as close as possible to twice the decay rate of the probability of interest. From [12], we state a theorem that gives exactly this desire under mild assumptions. Theorem 4.3.1. If _{H(α) < ∞ for all α ∈ R}d_, _inf

β∈A◦L(β) = inf_{β∈ ¯}_AL(β), and

{αn

j}nj=1 is a sequence of controls that minimizes the second moment of the estimator,

then lim n→∞W n = 2 inf β∈AL(β).

(44)

number of samples to estimateqn will not grow exponentially as n → ∞ if we want to

relative error to be constant. The question now is how the choice of the sequence of biasing is related with an Isaacs partial differential equation.

We first state the following theorem connecting entropy and exponential integrals. Theorem 4.3.2. (Donsker-Varadhan variational formula) For any bounded, continu-ous function_{f : R}d_{→ R,}

− log Z

Rd

e−f (y)µ(dy) = inf

ν∈P(Rd₎ DKL(νkµ) + Z Rd f (y)ν(dy) . Define Vn_{(x, i) = inf} {αn j} Eµ " e−2nF Sn¯_n n n−1 Y j=i e α _¯ Sjn n , j n ,Xj+1 +H α _Sn¯ j n, j n #

The second moment Vn_{(x, i) of the estimator starting at (x, i) can be defined from}

the preceding dynamic programming problem. It is the value function of function of a stochastic dynamic programming equation. We write

Vn_{(x, i) = inf} α Z Rd eH(α)−hα,yi_Vn_{x +} y n, i + 1 µ(dy)

Then observe that Wn_{(x, i) =} 1 nlog V

n_{(x, i) also satisfies the dynamic programming}

equation. Using theorems 4.1.2, and 4.3.2, we have

Wn(x, i) = −1 n log V n (x, i) = −1 n log infα Z Rd eH(α)−hα,yi_Vn_{x +} y n, i + 1 µ(dy) = 1 n supα − log Z Rd eH(α)−hα,yi_e−n(−1 nlog V n_(x+y/n,i+1) )µ(dy) = 1 n _α∈Rsupd inf γ∈P(Rd₎ DKL(γkµ) + Z Rd −H(α) + hα, yi + nWn_{x +} y n, i + 1 γ(dy) = sup α∈Rd inf γ∈R(Rd₎ Z Wnx + y n, i + 1 γ(dy) + 1 n DKL(γkµ) + H(α) + Z Rd hα, yiγ(dy) .

We assume thatWn _{converges to a continuously differentiable function}_{W x,} i

(45)

Wn _{is an approximation of}_{W such that W}n_{(x, i) ≈ W (x, i/n). Observe that through} Taylor expansion, Wn_{x +} y n, i + 1 ≈ W x + y n, i + 1 n ≈ W x, i n +∂W ∂t 1 n + D ∇W, y n E .

Then we have that

sup α∈Rd inf ν∈P(Rd₎ Z Rd ∂W ∂t 1 n + 1 nh∇W, yiν(dy) + 1 n DKL(νkµ) − H(α) + Z Rd hα, yiν(dy) = 0

Using the duality relation between the large deviations rate function and relative entropy, and rearranging terms, we have that

∂W

∂t + sup_α∈Rd

inf

β∈Rd[h∇W, βi + L(β) − H(α) + hα, βi] = 0. (4.2)

This partial differential equation is called an Isaacs equation, which arises in differential game theory [23]. Note that this is a PDE for the decay rate W . It tells us what the optimal decay rate an estimator will be given that we are at (x, t) in the course of a simulation. Finding the maximizer α defines the biasing of the simulation that minimizes the variance. Finding the minimizer β defines the dynamics of the most likely path of the simulations.

4.3.2 Variational formulation: The Hamilton-Jacobi-Bellman

equation

Recall that L(β) is the Legendre transform of H(α), and since they are both convex, taking the Legendre transform of H(α) will give us L(β). Therefore, we can rewrite the above equation in terms of the cumulant generating function alone. Since H(α) =

(46)

sup_β∈Rd[hα, βi − L(β), we have inf β∈Rd[h∇W + α, βi + L(β)] = − sup β∈Rd [h−∇W − α, βi − L(β)] = −H(−∇W − α).

So the Isaacs equation can be rewritten as ∂W

∂t + sup_α∈Rd

[h−H(−∇W − α) − H(α)] = 0.

One can verify that by convexity of H, the α that achieves the supremum is α∗ ₌

−1

2∇W . Therefore, the equation reduces to a Hamilton-Jacobi equation

∂W ∂t − 2H −1 2∇W = 0.

DefiningW (x, t) = 2U (x, t), in summary, we must solve the following Hamilton-Jacobi equation with terminal conditions

     ∂U ∂t − H(x, −∇U ) = 0 U (x, 1) = F (x).

Using the theory of viscosity solutions and the variational representation of solutions to Hamilton-Jacobi equations [16], we may write the solutionU (x, t) as the cost function of a family of variational problems parametrized by(x, t)

U (x, t) = inf ϕ∈AC[t,1],ϕ(t)=x Z 1 t L(ϕ(s), ˙ϕ(s))ds + F [ϕ(1)] .

After all this analyses, we determine that the parametrized biasing function in the algorithm is α∗_{(x, t) = −}1

2∇W(x, t) = −∇U (x, t), and we can take the biasing at

each step to be αnj = α ∗ _Sn j n , j n .

A coupling approach to rare event simulation via dynamic importance sampling

A Coupling Approach to Rare Event Simulation via

Dynamic Importance Sampling

by

Benjamin Jiahong Zhang

B.S., Engineering Physics, B.A., Applied Mathematics,

University of California, Berkeley (2015)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Master of Science in Aeronautics and Astronautics

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2017

c

Massachusetts Institute of Technology 2017. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

25 May 2017

Certified by . . . .

Youssef M. Marzouk

Associate Professor of Aeronautics and Astronautics

Thesis Supervisor

Accepted by . . . .

Youssef M. Marzouk

Associate Professor of Aeronautics and Astronautics

Chair, Graduate Program Committee

A Coupling Approach to Rare Event Simulation via Dynamic

Importance Sampling

by

Benjamin Jiahong Zhang

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Monte Carlo simulation and

importance sampling

2.1

Simple Monte Carlo

2.2

Importance sampling

2.3

Notions of efficiency

Chapter 3

Large deviations theory

3.1

Motivation and laws of large numbers

3.2

Cramér’s theorem and Varadhan’s lemma

3.3

The contraction principle

3.4

Sample path large deviations and Freidlin-Wentzell

theory

Chapter 4

Dynamic importance sampling

4.1

Exponential tilting

4.2

The optimal path approach: Open loop control

4.3

The Isaacs’ equation: Closed loop control

4.3.1

Control-theoretic formulation: the Isaacs equation

4.3.2

Variational formulation: The Hamilton-Jacobi-Bellman

equation