Composable Probabilistic Inference with Blaise

(1)

Computer Science and Artificial Intelligence Laboratory

Technical Report

MIT-CSAIL-TR-2008-044

July 23, 2008

Composable Probabilistic Inference with Blaise

(2)

Composable Probabilistic Inference with Blaise

by

Keith Allen Bonawitz

Submitted to the

Department of Electrical Engineering and Computer Science

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer Science and Engineering

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2008

c

Massachusetts Institute of Technology 2008. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

April 30, 2008

Certified by . . . .

Patrick H. Winston

Ford Professor of Artificial Intelligence and Computer Science

Thesis Supervisor

Certified by . . . .

Joshua B. Tenenbaum

Paul E. Newton Career Development Professor

Thesis Supervisor

Accepted by . . . .

Arthur C. Smith

Chairman, Department Committee on Graduate Students

(3)

(4)

Composable Probabilistic Inference with Blaise

by

Keith Allen Bonawitz

Submitted to the Department of Electrical Engineering and Computer Science on April 30, 2008, in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy in Computer Science and Engineering

Abstract

If we are to understand human-level cognition, we must understand how the mind finds the patterns that underlie the incomplete, noisy, and ambiguous data from our senses and that allow us to generalize our experiences to new situations. A wide variety of commercial applications face similar issues: industries from health services to business intelligence to oil field exploration critically depend on their ability to find patterns in vast amounts of data and use those patterns to make accurate predictions. Probabilistic inference provides a unified, systematic framework for specifying and solving these problems. Recent work has demonstrated the great value of probabilistic models defined over complex, structured domains. However, our ability to imagine probabilistic models has far outstripped our ability to programmatically manipulate them and to effectively implement inference, limiting the complexity of the problems that we can solve in practice.

This thesis presents Blaise, a novel framework for composable probabilistic mod-eling and inference, designed to address these limitations. Blaise has three compo-nents:

• TheBlaise State-Density-Kernel (SDK) graphical modeling language that generalizes factor graphs by: (1) explicitly representing inference algorithms (and their locality) using a new type of graph node, (2) representing hierar-chical composition and repeated substructures in the state space, the interest distribution, and the inference procedure, and (3) permitting the structure of the model to change during algorithm execution.

• A suite of SDK graph transformations that may be used to extend a model (e.g. to construct a mixture model from a model of a mixture component), or to make inference more effective (e.g. by automatically constructing a parallel tempered version of an algorithm or by exploiting conjugacy in a model). • The Blaise Virtual Machine, a runtime environment that can efficiently

(5)

Blaise encourages the construction of sophisticated models by composing simpler models, allowing the designer to implement and verify small portions of the model and inference method, and to reuse model components from one task to another. Blaise decouples the implementation of the inference algorithm from the specification of the interest distribution, even in cases (such as Gibbs sampling) where the shape of the interest distribution guides the inference. This gives modelers the freedom to explore alternate models without slow, error-prone reimplementation. The compositional nature of Blaise enables novel reinterpretations of advanced Monte Carlo inference techniques (such as parallel tempering) as simple transformations of Blaise SDK graphs.

In this thesis, I describe each of the components of the Blaise modeling frame-work, as well as validating the Blaise framework by highlighting a variety of contem-porary sophisticated models that have been developed by the Blaise user community. I also present several surprising findings stemming from the Blaise modeling frame-work, including that an Infinite Relational Model can be built using exactly the same inference methods as a simple mixture model, that constructing a parallel tempered inference algorithm should be a point-and-click/one-line-of-code operation, and that Markov chain Monte Carlo for probabilistic models with complicated long-distance dependencies, such as a stochastic version of Scheme, can be managed using standard Blaise mechanisms.

Thesis Supervisor: Patrick H. Winston

Title: Ford Professor of Artificial Intelligence and Computer Science Thesis Supervisor: Joshua B. Tenenbaum

(6)

Acknowledgments

This thesis would not have been possible without the help and support of innumerable friends, family and colleagues. Let me start by thanking my thesis committee: Patrick Winston, Josh Tenenbaum, and Antonio Torralba. Patrick has advised me since I was an undergraduate, guiding me through the deep questions of artificial intelligence from the perspective of understanding human-level cognition. I am also deeply indebted to Patrick for teaching me both how to frame my ideas and how to present them to others – without a doubt, Patrick is a master of this art, and I can only hope I have absorbed some of his skills. I thank Josh for teaching me the power of combining probabilistic modeling and symbolic reasoning. Moreover, I thank Josh for fostering a vibrant and social research environment, both inside the lab and out – there is definitely something wonderful to be said for insightful conversation while hiking in New Hampshire.

I cannot praise my community at MIT highly enough. My friends and colleagues in Patrick’s lab, in Josh’s lab, and throughout the Institute have been the most won-derful support system I could ask for, not to mention their influence on my thoughts about this work and science in general. Special thanks to Dan Roy for bravely using a very early version of this work for his own research, despite its rough edges and lack of documentation, and to Vikash Mansinghka, Beau Cronin, and Eric Jonas for embracing the ideas in this thesis and providing crucial input and endless encourage-ment.

I thank my parents, Lynn and John Bonawitz, for teaching me to follow my dreams with passion and integrity, and to build the life that works for me. They and my brother, Michael Bonawitz, are remarkable in their ability to always let me know how much I’m loved and missed, while still encouraging my pursuits – even when those pursuits take me far away from them.

But most of all, this thesis would never have happened if not for my wife Liz Bonawitz. Liz’s boundless love and support have been a constant source of strength, encouraging me to be the best I can be, both personally and professionally, and

(7)

challenging me to forge my own path and chase both my ideas and my ideals. From graduate school applications through my defense, Liz has been by my side every step of the way – introducing me to many of the people and ideas that have become the core of my work, but also making sure I didn’t miss out on the fun and laughter life has to offer.

This material is based upon work supported by the National Science Foundation under Grant No. 0534978. Any opinions, findings, and conclusions or recommen-dations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

(8)

List of Figures

3-1 _{Preview of the Blaise modeling language compared to other standard}

probabilistic modeling languages . . . 37

3-2 A beta-binomial model in factor graph notation . . . 40

3-3 The state space for a beta-binomial model in factor graph notation and Blaise SDK notation . . . 40

3-4 _{Simple plated factor graphs and the equivalent Blaise State structures 41} 3-5 Beta-binomial model for multiple data points, in plated factor graph notation and Blaise SDK notation . . . 42

3-6 _{Mixture models in Bayes net notation and Blaise notation . . . . .} 44

3-7 The state space and interest distribution for a single-datapoint beta-binomial model in factor graph notation and Blaise SDK notation . 45 3-8 The state space and interest distribution for a multiple-datapoint beta-binomial model in factor graph notation and Blaise SDK notation . 46 3-9 _{A beta-binomial Blaise model with Kernels . . . .} 51

3-10 Kernels for mixture models . . . 52

3-11 Operations supported by Kernels . . . 54

3-12 Operations supported by Moves . . . 55

3-13 A Gaussian (normal) Blaise model with Kernels . . . 57

3-14 Metropolis-Hastings Kernels . . . 59

3-15 Pseudocode for a Metropolis-Hastings Kernel’s Sample-Next-State. 60 3-16 Kernels for mixture models . . . 61

(13)

3-18 Blocking in Metropolis-Hastings Kernels . . . 62

3-19 Blocking in Gibbs Kernels . . . 63

3-20 A mixture model, demonstrating Associated Collection Densities . . . 65

3-21 A mixture model, demonstrating Virtual Hybrid Kernels . . . 67

3-22 Virtual Hybrid Kernels can be analyzed as a Conditional Hybrid of Concrete Cycles . . . 68

3-23 Constraint States in “bridged form” mixture models . . . 71

3-24 Bridged form mixture models with Kernels . . . 72

3-25 Initialization Kernels . . . 74

3-26 Initialization Kernels for Bridged Mixtures . . . 75

3-27 Non-parametric multi-feature beta-binomial mixture model . . . 78

3-28 Model-View-Controller design pattern and Blaise SDK analogy . . . 80

4-1 The temper transform . . . 87

4-2 The simulated annealing transform . . . 88

4-3 The parallel tempering transform . . . 89

4-4 The particle filtering transform . . . 91

4-5 Parallel tempered reversible jump Markov chain Monte Carlo . . . 94

4-6 Particle filtering mixed with MCMC . . . 95

4-7 Particle filtering with simulated annealing . . . 96

4-8 Parallel tempered particle filter . . . 97

4-9 Parametric mixture model transform (fixed size, fixed weights) . . . . 98

4-10 Parametric mixture model transform (fixed size, unknown weights) . 99 4-11 Parametric mixture model transform (variable size) . . . 100

4-12 Non-parametric mixture model transform . . . 101

4-13 Beta-binomial conjugacy transform . . . 103

5-1 Timing comparison for State mutation vs. Copy-On-Write . . . 110

5-2 Timing comparison for transactions and memoization . . . 112 5-3 Timing comparison for transactions and memoization, experiment 2 . 113

(14)

5-4 _{Blaise uses a tree-structured transactional caching system to enhance}

performance. . . 115

5-5 “Let” Kernels in the Virtual Machine . . . 118

6-1 Generative Vision input . . . 124

6-2 Generative Vision SDK . . . 125

6-3 Generative Vision requires parallel tempering . . . 126

6-4 Functional model for a V1 neuron . . . 127

6-5 LNP model of a V1 neuron, as a Bayes net . . . 128

6-6 _{LNP model of a V1 neuron, as a Blaise model . . . 129}

6-7 _{Generalized LNP model of a V1 neuron, as a Blaise model . . . 130}

6-8 Neurophysiological experimental data and model predictions generated using Blaise (from Schummers et al. [57]) . . . 131

6-9 Input and output for the Infinite Relational Model . . . 135

6-10 Multi-feature Mixture Model SDK (reprise) . . . 136

6-11 Infinite Relational Model SDK . . . 137

6-12 Infinite Relational Model applied to a synthetic dataset . . . 138

6-13 Infinite Relational Model applied to Movielens . . . 139

6-14 Blaise IRM scales near linearly in number of datapoints . . . 140

6-15 Relational data modeled with Annotated Hierarchies (from Roy et al. [54]) . . . 142

6-16 Latent Dirichlet Allocation Bayes Net . . . 143

6-17 Latent Dirichlet Allocation Model SDK . . . 144

6-18 LDA Wikipedia input . . . 146

6-19 LDA Wikipedia learned topics . . . 147

6-20 The Ising model as a factor graph and as a Blaise model . . . 149

6-21 Kernels for factor graphs . . . 150

6-22 Bayes nets can be reduced to factor graphs . . . 151

6-23 Systematic Stochastic Search: Sequentialize transform . . . 154

(15)

6-25 A standard example of a BUGS model . . . 157

6-26 A non-parametric mixture model in a hypothetical BUGS extension . 158 6-27 The Infinite Relational Model as a Church program . . . 160

6-28 The Infinite Hidden Markov Model as a Church program . . . 161

7-1 Software package comparison matrix . . . 166

(16)

Chapter 1 Introduction

My thesis is that a framework for probabilistic inference can be designed that enables efficient composition of both models and inference procedures, that is suited to the representational needs of emerging classes of proba-bilistic models, and that supports recent advances in inference.

Probabilistic inference is emerging as a common language for computational stud-ies of the mind. Cognitive scientists and artificial intelligence researchers are ap-pealing more frequently to probabilistic inference in both normative and descriptive accounts of how humans can draw confident conclusions from incomplete or am-biguous evidence [9, 60]. Explorations of human category learning [61], property induction [29], and causal reasoning [23] have all found remarkable accord between human performance and the predictions of Bayesian probabilistic models. Proba-bilistic models of the human visual system help us understand how top-down and bottom-up reasoning can be integrated [67]. Computational neuroscientists are even finding evidence that probabilistic inference may help explain the behavior of indi-vidual neurons and neuronal networks [37, 52]. Beyond studying the mind, much practical use is found for probabilistic models in fields as diverse as business intelli-gence [48], bioinformatics, and medical informatics [27].

Probabilistic models are not the only way to approach problems of reasoning under uncertainty, but they have recently exploded in popularity for a number of reasons.

(17)

First, Bayesian probabilistic modeling encourages practitioners to be forthright with their assumptions. As phrased by MacKay [38] (p. 26), “you cannot do inference without making assumptions.” All inference techniques make assumptions about how unobserved (or future) values are related to observable (or past) values; how-ever, non-Bayesian models typically leave these assumptions implicit. As a result, it is difficult both to evaluate the justification of the assumptions and to change those assumptions. In contrast, Bayesian models are explicit about their prior assumptions. Another virtue of probabilistic models is that, given a few intuitive desiderata for rea-soning under uncertainty, probability theory is the unique calculus for such rearea-soning (this result is known as Cox’s Theorem [28]). This provides the modeler with assur-ance that the mathematical framework in which his models are embedded is capable of correctly handling future model extensions and provides a common language for the interchange of modeling results. The use of the term “language” here is non-accidental: probability theory provides all the elements of a powerful programming or engineering language: primitives (e.g., random variables and simple conditional distributions), means of combination (e.g., joint distributions composed from simple conditional distributions that may share variables), and means of abstraction (e.g., marginalizing out variables to produce new conditional distributions) [2].

In addition to these general features of probability theory, a confluence of fac-tors are contributing to a renaissance of probabilistic modeling in cognitive science. One of these is the d´etente between practitioners of structured symbolic reasoning and statistical inference. This has resulted in the investigation of “sophisticated” (cf. [9]) probabilistic models, in which the random variables have structured represen-tations for their domains, such as trees, graphs, grammars, or logics. For example, [23] treat learning a causal structure as a probabilistic inference problem including a random variable on the domain of causal networks. This variable is conditioned on a more abstract structure encoding a simple probabilistic grammar for possible causal networks. Probabilistic inference on this hierarchical layering of structured representations, applied to (for example) data about behaviors, diseases, and symp-toms, allows one to learn not only a reasonable causal network for a particular set

(18)

of observations, but also more abstract knowledge such as “diseases cause symptoms, but symptoms never cause behaviors.” As another example of sophisticated models, [54] learn a classification hierarchy from a set of features and relations among ob-jects while simultaneously learning what level of the hierarchy is most appropriate for predicting each feature and relation in the dataset; this work can be viewed as a probabilistic extension of hierarchical semantic networks [10]. These examples illus-trate how incorporating the representational power of symbolic systems has enabled the inferential power of probability theory to be brought to bear on problems that not long ago were thought to be outside the realm of statistical models, resulting in robust symbolic inference with principled means for balancing new experience with prior beliefs.

Contemporary probabilistic inference problems are also sophisticated in their use of advanced mathematics throughout modeling and inference. Nonparametric models such as the Chinese Restaurant Process (or Dirichlet Process) [6, 30] and the Indian Buffet Process [24] allow the dimensionality of a probabilistic model to be determined by the data being explained; in effect, the model grows with the data. Nonparamet-rics are the basis of many recent cognitive models such as the Infinite Relational Model [30], in which relational data is explained by partitioning the objects in each domain into a set of classes, each of which behaves homogeneously under the relation. The number of classes in each domain is not known a priori, and instead a Chinese Restaurant Process is used to allow the Infinite Relational Model to use just as many classes as the data justifies. Along with advanced mathematics for modeling come advanced techniques for performing inference on these models. For example, when performing inference in a nonparametric model, special care must be taken to account for the variable dimensionality of the model. For approximate inference techniques based on Markov chain Monte Carlo (MCMC), this special care takes the form of Reversible Jump MCMC [21] and involves the computation of a Jacobian factor re-lating parameter spaces of different dimension. Reversible Jump MCMC ensures only the correctness of inference; other advanced techniques are focused on making infer-ence tractable in models of increasing complexity, whether that complexity is due to

(19)

complexly structured random variables (e.g., with domains such as the space of all graphs), hierarchically layered models (e.g., causal structures and causal grammars as in [23]), or models with unknown dimensionality (e.g. resulting from the use of non-parametrics). Examples of advanced techniques for improving inference performance include parallel tempering [15], in which probabilistic inference on easier versions of the probabilistic model is used to guide inference in the desired full-difficulty model, and sequential methods such as particle filtering [14], an online, population-based Monte Carlo method that only uses each datapoint once, at the time when it arrives online.

Unfortunately, existing tools are not designed to handle the kind of sophisticated models and inference techniques that are required today. As a result, most modelers currently construct their own special purpose implementations of these algorithms for every model they create — an inefficient and error-prone process which frequently leads the practitioner to forgo many advanced techniques due to the difficulty of implementing them in a system that does not offer use of the proper abstractions.

1.1 Thesis statement and organization

My thesis is that a framework for probabilistic inference can be designed that enables efficient composition of both models and inference procedures, that is suited to the representational needs of emerging classes of probabilistic models, and that supports recent advances in inference.

Chapter 2 reviews the mathematical underpinnings of probabilistic inference. Chapters 3–6 directly address the claims in my thesis statement. Chapter 3 in-troduces the Blaise State–Density–Kernel (SDK) graphical modeling language and shows how this language supports composition of models and inference procedures. Chapter 4 highlights how several recent advances in inference are supported by in-terpreting them as graph transformations in the SDK language. Chapter 5 describes the Blaise virtual machine, which can efficiently execute the stochastic automata described by Blaise SDK graphs. Chapter 6 describes several applications

(20)

involv-ing emerginvolv-ing classes of probabilistic models, each of which has been built usinvolv-ing the Blaise framework.

With the thesis supported, chapter 7 compares Blaise to existing probabilistic inference frameworks, and chapter 8 reviews the contributions this thesis makes to the field.

(21)

(22)

Chapter 2 Background: Monte Carlo

Methods for Probabilistic Inference

This thesis focuses on Monte Carlo methods for probabilistic inference, a class of algorithms for drawing conclusions from a probabilistic model. Anthropomorphiz-ing for a moment, Monte Carlo methods can be interpreted as hallucinatAnthropomorphiz-ing possible worlds and evaluating those worlds according to how well they fit the model and how well they explain observations about the real world. From this perspective, the re-jection sampling Monte Carlo method hallucinates complete random worlds, drawing conclusions only from those hallucinations that match the observed evidence. Like-lihood weighting, another Monte Carlo method, hallucinates random worlds up to (but not including) the gathering of evidence, then weights any conclusions drawn from one of these hallucinated worlds by how well the world fits with the observa-tions of the real world. Markov chain Monte Carlo also hallucinates possible worlds, but tries to be more systematic about it by continually adjusting its hallucination to try to better account for real world observations. There are a wide variety of abstract ideas from artificial intelligence that can be concretized as Monte Carlo in-ference. For example, streams and counterstreams [65] is an interesting proposal for modeling the interaction of top-down and bottom-up effects on visual perception. Unfortunately, the proposal is framed in terms of priming cognitive states, with no guidance provided on how such priming might be realized in a computational model.

(23)

In contrast, work on data-driven Markov chain Monte Carlo [64] has approached the same problem from a Monte Carlo inference perspective, resulting in a concrete model that seems to provide the most promising computational interpretation available of streams and counterstreams. Monte Carlo-based probabilistic inference holds the po-tential to systematize a wide range of artificial intelligence theories, offering to add both algorithmic alternatives and rational analysis to the existing intuitions.

This chapter surveys the mathematics of probabilistic inference, focusing on those aspects that will provide the foundation for the remainder of this thesis.

2.1 Probabilistic Models and Inference

In probabilistic models on discrete variables, P (x = xi) denotes the probability that

the random variable x takes on the value xi. This is often simply written as P (x) or

P (xi). Likewise, the joint probability of two variables is written P (x = xi, y = yj)

and indicates the probability that random variable x takes the value xi and variable

y takes the value yj. Conditional probabilities are denoted P (x = xi|y = yj) and

indicate the probability that the random variable x = xi, given that y = yj. If y is

the empty set, then P (x|y) = P (x). Probability distributions on discrete variables satisfy several properties: 0 ≤ P (x|y) ≤ 1, and if the set X contains all possible values for the variable x, then P

xi∈XP (x = xi|y) = 1.

For continuous variables, the terminology is slightly different. P (x ∈ X) de-notes the probability that the random variable x takes on a value in the set X. A probability density function p(x = xi) is then derived from this by the relation

R

xi∈Xp(x = xi) = P (x ∈ X). The terminology for joint and conditional distributions

changes analogously. Probability distributions on continuous variables satisfy several properties: 0 ≤ P (x ∈ X|y) ≤ 1, p(x = xi|y) ≥ 01, and if the set X contains all

possible values for the variable x, then R

xi∈Xp(x = xi|y) = 1.

For the remainder of this thesis, references to distributions will be written as if

1_{Note that for continuous variables, p(·) denotes the density of the probability distribution and} is not restricted to be less than 1. For example, a uniform distribution on the interval [0,1₂] has density 2 on that interval.

(24)

the distribution is over continuous variables. However, it should be understood that all methods are equally applicable to discrete variables unless otherwise noted.

There are a few useful rules for computing desired distributions from other distri-butions. First, marginal probabilities can be computed by integrating out a variable: p(x|a) = R

yp(x, y|a). Bayes’ theorem declares that p(x, y|a) = p(y|x, a)p(x|a), or

In problems of probabilistic inference, the following are specified: a set of variables ~

x; a partition of the variables ~x into three groups: ~e (the evidence variables with observed values), ~q (the query variables), and ~u (the uninteresting variables); and a joint density p(~x) = p(~e, ~q, ~u) over those variables. The goal of inference, then, is to compute the distribution of the query variables given the observed evidence:

p(~q|~e) = p(~q, ~e) p(~e) = R ~ up(~e, ~q, ~u) d~u p(~e) = R ~ up(~e, ~q, ~u) d~u R ~ q,~up(~e, ~q, ~u) d~q d~u

Once the conditional distribution p(~q|~e) is in hand, it can be used answers queries2 _such

as the expected value of some function f (~q): Ep(~q|~e)[f (~q)] =

Z

~ q

f (~q)p(~q|~e) d~q

For example, in a classification task, ~e might be a set of observed object properties, ~

q might be the assignments of objects to classes, and ~u might be the parameters governing the distribution of properties in each class. Suppose you were interested in whether two particular objects belonged to the same class. Letting f (~q) be an indicator function

f (~q) =   

1 if ~q assigns the two objects to the same class; 0 otherwise.

(25)

class.

This is an elegant expression of the goals of inference, but unfortunately it is rarely possible to directly apply the inference formulae because the required integrals (or the analogous summations in the discrete case) are intractable for most probabilistic models. As a result, even after specifying the model and the inference task to be performed, it is still necessary to derive a method for performing that inference that does not require the evaluation of intractable integrals.

2.2 Approximate Inference

For this thesis, I focus on approximate probabilistic inference methods. While exact inference methods exist and are useful for certain classes of problems, exact infer-ence in sophisticated models is generally intractable, because these methods typically depend on integrals, summations, or intermediate representations that grow unman-ageably large as the state space grows large or even infinite.

There are two main classes of approximate inference: variational methods and Monte Carlo methods. Variational methods operate by first approximating the full model with a simpler model in which the inference questions are tractable. Next, the parameters of this simpler model are adjusted to minimize a measure of the dissimilarity between the original model and the simplified version; this adjustment is usually performed deterministically. Finally, the query is executed in the adjusted, simplified model.

In contrast, Monte Carlo methods draw a set of samples from the target

dis-2_{Another popular inference goal is to find the maximum a posteriori (MAP) value of the query} variables: ~qM AP = arg max

~ q

p(~q|~e). MAP values are typically used to find the “best explanation” of a set of data. Unfortunately, they do not satisfy intuitive consistency properties. In particular, a change-of-variables transformation of the target distribution is likely to change the MAP value. More concretely: let f : q → q∗ be some invertible function, let F : Q → Q∗ be the analogous set-valued invertible function F (Q) = {f (q)|q ∈ Q}. Consider the change-of-variables transformed distribution P∗(Q∗|e) = P F−1_(Q∗_{)|e with density p}∗_(q∗_{|e) =} d

dq∗P∗(Q∗|e). Intuitive consistency

is violated because, in general, arg max q

p(q|e) 6= f−1 arg max

q∗ p

∗_(q∗_{|e), implying that the choice of} representation can change the “best explanation.” This and other shortcomings notwithstanding, MAP values can also be estimated using Monte Carlo methods.

(26)

tribution; inference questions are then answered by using this set of samples as an approximation of the target distribution itself.

Variational methods have the advantage of being deterministic; the corresponding results, however, are in the form of a lower bound on the actual desired quantity, and the tightness of this bound depends on the degree to which the simplified dis-tribution can model the target disdis-tribution. Furthermore, standard approaches to variational inference such as variational message passing [66] restrict the class of models to graphical models in which adjacent distributions are conjugate3_{. For}

ex-ample, when conjugacy assumptions are not satisfied, [66] recommends reverting to a Monte Carlo approach. In contrast, Monte Carlo techniques are applicable to all classes of probabilistic models. They are also guaranteed to converge – if you want a more accurate answer, you just need to run the inference for longer; in the limit of running the Monte Carlo algorithm forever, the sampled approximation converges to the the target distribution. Furthermore, it is possible to construct hybrid infer-ence algorithms in which variational inferinfer-ence is used for some parts of the model, while Monte Carlo methods are used for the rest. These mixed approaches, however, are outside the scope of this work. For this thesis, I concentrate on Monte Carlo methods because the mathematics for this class of inference supports inference com-position in a way that parallels model comcom-position (For example, see the description of hybrid kernels in section 2.4). Notwithstanding the particular focus of this thesis, the stochastic automata developed here (Blaise SDK graphs, chapter 3) could be used to model processes other than Monte Carlo inference, including other inference techniques (e.g. belief propagation [49, 31]) and even non-inferential processes.

3_{A prior distribution p(θ) is said to be conjugate to a likelihood p(x|θ) if the posterior distribution} p(θ|x) is of the same functional form as the prior. Conjugacy is generally important because it allows key integrals to be computed analytically, and because it allows certain inference results to be represented compactly (as the parameters of the posterior distribution).

(27)

2.3 Monte Carlo Methods

There are a wide variety of Monte Carlo methods, but they all share a common recipe. First, draw a number of samples h~q1, ~u1i, · · · , h~qN, ~uNi from the distribution p(~q, ~u|~e),

and then approximate the interest distribution using p(~q|~e) ≈ ˜pN(~q|~e) = 1 N N X i=1 δ~qi(~q)

where δ~qi is the Dirac delta function

4_{. As the number of samples increases, the}

ap-proximation (almost surely) converges to the true distribution: ˜pN(~q|~e) a.s. N →∞

−−−→ p(~q|~e). Expectations can similarly be approximated from the Monte Carlo samples:

Ep(~q|~e)[f (~q)] = Z ~ q f (~q)p(~q|~e) d~q ≈ Z ~ q f (~q) 1 N N X i=1 δ~qi(~q) d~q = 1 N N X i=1 f (~qi)

If it were generally easy to draw samples directly from p(~q, ~u|~e), the Monte Carlo story would end here. Unfortunately, this is typically intractable due to the same integrals that made it intractable to compute p(~q|~e) exactly. Fortunately, a range of techniques have been developed for producing samples from p(~q, ~u|~e) indirectly.

One of the simplest Monte Carlo techniques is rejection sampling. In rejection sampling, samples from the conditional distribution p(~q, ~u|~e) are produced by gener-ating samples from the joint distribution p(~q, ~u, ~e) and discarding any samples that disagree with the observed evidence values. Rejection sampling is extremely ineffi-cient if the observed evidence is unlikely under the joint distribution, because almost all of the samples will disagree with the observed evidence and be discarded. Further-more, as the amount of observed data increases, any particular set of observations gets increasingly unlikely.

Importance sampling is a Monte Carlo technique that avoids discarding samples by only sampling values for ~q and ~u; these samples are then weighted by how well

4_{The Dirac delta function δ}

x0(x) has the properties that it is non-zero only at x = x0,

R

Xδx0(x) dx = 1, and

R

Xf (x)δx0(x) dx = f (x0). The Dirac delta can be thought of as the derivative

of the Heaviside step function Hx0(x) =

0 for x < x0; 1 for x ≥ x0.

(28)

they conform to the evidence. More specifically, samples h~qi, ~uii are drawn from a

proposal distribution q(~q, ~u) and assigned weights w(~qi, ~ui) =

p(~qi,~ui,~e)

q(~qi,~ui) . Observing

that p(~qi, ~ui, ~e) = q(~qi, ~ui)w(~qi, ~ui), the Monte Carlo approximation of the conditional

distribution becomes p(~q|~e) ≈ PN i=1δ~qi(~q)w(~qi, ~ui) PN i=1w(~qi, ~ui)

Efficient importance sampling requires that the proposal distribution q(~q, ~u) produces samples from the high-probability regions of p(~q, ~u|~e). Choosing a good proposal distribution can be very difficult, becoming nearly impossible as the dimensionality of the search space increases.

2.4 Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) is an inference method designed to spend most of the computational effort producing samples from the high probability regions of p(~q, ~u|~e). In this method, a stochastic walk is taken through the state space ~q ×~u such that the probability of being in a particular state h~qi, ~uii at any point in the walk is

p(~qi, ~ui|~e). Therefore, samples from p(~q, ~u|~e) can be produced by recording the states

visited by the stochastic walk. The stochastic walk is a Markov chain: the choice of state at time t + 1 depends only on the state at time t. Formally, if st ∈ ~q × ~u is

the state of the chain at time t, then p(st+1|s1, · · · , st) = p(st+1|st). Because Markov

chains are history-free, they can be run for an unlimited number of iterations without consuming additional memory space; contrast this with classic backtracking search strategies which maintain a complete history of visited states and a schedule of states to be visited. The history-free property also means that the MCMC stochastic walk can be completely characterized by p(st+1|st), known as the transition kernel. I will

use the notation K(st → st+1) = p(st+1|st) for transition kernels to emphasize the

directionality of the movement through the state space. The transition kernel K is a linear transform, such that if pt= pt(s) is the row vector encoding the probability of

(29)

If the stochastic walk starts from state s0, such that the distribution over this

initial state is the delta distribution p0 = δs0(s), then the state distribution for the

chain after step t is pt = p0Kt. The key to Markov chain Monte Carlo is to choose

K such that lim

t→∞pt = p(~q, ~u|~e), regardless of choice of s0; kernels with this property

are said to converge to an equilibrium distribution peq = p(~q, ~u|~e). Convergence is

guaranteed if both:

• peq is an invariant (or stationary) distribution for K. A distribution pinv is an

invariant distribution for K if pinv = pinvK.

• K is ergodic. A kernel is ergodic if it is irreducible (any state can be reached from any other state) and aperiodic (the stochastic walk never gets stuck in cycles).

Markov chain Monte Carlo can be viewed as fixed point iteration on the domain of probability distributions, where K is the iterated function and peq is the unique fixed

point, even though in practice K is iteratively applied to a sample from peq, rather

than peq itself, side-stepping the issue of explicitly representing distributions.

Transition kernels compose well. Let K1 and K2 be two transition kernels with

invariant distribution pinv. The cycle hybrid kernel Kcycle = K1K2 is the result of

first taking a step with K1, then taking a step with K2. Kcycle has the same

invari-ant distribution pinv. Kernels can also be composed using a mixture hybrid kernel

Kmixture= αK1+ (1 − α)K2 for 0 ≤ α ≤ 1, which is the result of stochastically

choos-ing to apply either K1 or K2, with α being the probability of choosing K1. Mixture

hybrid kernels also maintain the invariant distribution pinv. These hybrid kernels do

not guarantee ergodicity, but it is generally very easy to show that the composite kernel is ergodic. Hybrid kernels are the key that will enable MCMC-based inference to compose in the same way that probabilistic models compose.

Kernel composition is only useful to the extent that effective base kernels can be generated. The most common recipe for constructing an MCMC transition kernel with a specific equilibrium distribution is the Metropolis-Hastings method [43, 26], which converts an arbitrary proposal kernel q(st → s∗) into a transition kernel

(30)

with the desired invariant distribution peq(s). In order to produce a sample from

a Metropolis-Hastings transition kernel, one first draws a proposal s∗ ∼ q(st → s∗),

then evaluates the Metropolis-Hastings acceptance probability A(st→ s∗) = min 1,p(s∗)q(s∗ → st) p(st)q(st → s∗) .

With probability A(st → s∗) the proposal is accepted and st+1 = s∗; otherwise the

proposal is rejected and st+1 = st. Intuitively, Metropolis-Hastings kernels tend to

accept moves that lead to higher probability parts of the state space due to the p(s∗)

p(st)

term, while also tending to accept moves that are easy to undo due to the q(s∗→st)

q(st→s∗)

term. Because Metropolis-Hastings kernels only evaluate p(s) as part of the ratio

p(s∗)

p(st), one may include in every evaluation of p(s) an unknown normalizing constant

without altering the kernel’s transition probabilities. For inference, this means that instead of computing the generally intractable integral involved in evaluating p(st) =

p(~qt, ~ut|~e) = p(~q_p(~t,~u_e)t,~e) = Rp(~qt,~ut,~e)

~

q,~up(~q,~u,~e)

, we can let c = _p(~1_e) be an unknown normalizing constant and evaluate p(st) ∝ p(~qt, ~ut, ~e).

2.5 Transdimensional MCMC

Sophisticated models often have an unknown number of variables. For example, a mixture model typically has parameters associated with each mixture component; if the number of components is itself to be inferred, the model has an unknown number of variables. MCMC kernels that change the parameterization of the model, such as those that change the dimensionality of the parameter space, must ensure that the reparameterization is accounted for5_.

Reversible jump MCMC [21], also known as the Metropolis-Hastings-Green method,

5_{For example, consider a Kernel that can reparameterize a model from a variable x distributed} uniformly on the range [0, 1] to a variable x0 distributed uniformly on [0,1₂]. Even though there is a one-to-one correspondence between x and x0 values, e.g. x = 2x0, the two parameterizations have different densities. Specifically, for any value of x ∈ [0, 1], the probability density is 1, whereas for any value of x0 ∈ [0,1

2] the probability density is 2. Any “compressing” or “stretching” of the state space must be accounted for. As described in the context of Reversible jump MCMC in this section, the Jacobian of the transformation is the mathematical tool for measuring this state space distortion.

(31)

is an extension of the Metropolis-Hastings method for transdimensional inference. In the Reversible Jump framework, sampling from the proposal distribution q(st→ s∗) is

broken into two phases. First, a vector of random variables v is sampled from a distri-bution q(v); note that this distridistri-bution is not conditioned on st. Then s∗ is computed

from st and v using an invertible deterministic function g; that is, hs∗, v0i = g(hst, vi)

and hst, vi = g−1(hs∗, v0i), where the dimensionality of hst, vi matches the

dimension-ality of hs∗, v0i. Finally, the Metropolis-Hastings acceptance ratio is adjusted to reflect

any changes in the parameter space caused by this move using a Jacobian factor: A(st→ s∗) = min 1,p(s∗)q(v 0₎ p(st)q(v) ∂hs∗, v0i ∂hst, vi where ∂hs∗,v0i ∂hst,vi

is the Jacobian factor: the absolute value of the determinant of the matrix of first-order partial derivatives of the function g.

2.6 Tempered Inference

A number of Monte Carlo inference variants, including simulated annealing and par-allel tempering, operate by changing the “temperature” τ of the interest distribution:

ptempered(s) ∝ p(s)1/τ

where ptempered(s) reduces to p(s) when τ = 1. As τ goes to 0, ptempered(s) concentrates

all of its mass on its modes; therefore, sampling from ptempered(s) for very small τ is

much more likely to produce the maximum a posteriori value than sampling from p(s). However, such “peaky” interest distributions, whether they arise naturally or through tempering, are generally more difficult for Monte Carlo methods to handle effectively. For example, Metropolis-Hastings kernels operating on a “peaky” distribution are much more likely to have their proposals rejected.

In contrast, as τ goes to ∞, ptempered(s) gets increasingly flat, and Monte Carlo

methods can produce samples very easily. The disadvantage of these high-τ samples is that they are less likely to come from high-probability regions of the original interest

(32)

distribution p(s). Simulated annealing and parallel tempering both use a sequence of tempered distributions from τ = 1 to a τ large enough to make inference easy; the intuition is to leverage results from the easy-inference values of τ to perform better on the harder τ values.

In simulated annealing [17, 32], τ is initialized to a large value. Then, as MCMC proceeds, τ is gradually decreased. The hope is that the stochastic walk will find the mode of the distribution while τ is large, and will settle in that mode as τ is decreased. Once τ is small, the stochastic walk is unlikely to leave that mode (assuming a “peaky” distribution); therefore, simulated annealing is most useful for locating the maximum a posteriori value.

In parallel tempering [18, 15], multiple MCMC chains are run in parallel, each at a different fixed value of τ . Occasionally, swaps of adjacent chains are proposed and evaluated according to the Metropolis-Hastings acceptance ratio. Samples are only collected from the lowest τ chain. One interpretation of parallel tempering is that the high τ chains act as proposal distributions for the lower τ chains. The sam-ples gathered from parallel tempering should be representative of the whole interest distribution (as opposed to simulated annealing, which produces samples only from one mode). Parallel tempering provides a generic means for the MCMC inference to move efficiently between modes of the interest distribution, even when those modes are widely separated by low-probability regions. Without parallel tempering, the probability that plain MCMC will make these moves becomes vanishingly small un-less a great deal of problem specific knowledge is used to construct clever proposal distributions.

2.7 Particle Filtering

Particle filtering, also known as Sequential Monte Carlo, is a population-based Monte Carlo method similar to importance sampling. It is typically applied to dynamic mod-els with unobserved variables xiforming a Markov chain such that p(xi|x0, . . . , xi−1) =

(33)

p(x0, . . . , xn, y1, . . . , yn) = p(x0)Qn_i=1p(xi|xi−1)p(yi|xi). Inference by particle filtering

produces an approximation to p(xn|y1, . . . , yn).

This technique unrolls inference over the same timeline used to index the dynamic model, such that the inference results for p(xn|y1, . . . , yn) together with the

observa-tion yn+1 are all that is needed to infer p(xn+1|y1, . . . , yn+1). Inference is achieved

using a population of “particles”: weighted samples hxj_i, w_iji which together form a Monte Carlo estimate of p(xn|y1, . . . , yn) ≈

P

jδxj_i(xi)w j

i (the weights are normalized

such that P

jw

j i = 1).

Inference is initialized by drawing a number of particles from the prior distribution on states and assigning each particle an equal weight:

xj₀ ∼ p(x0); wj0 =

1 #particles

Inference is then advanced to the next time step by stochastically advancing each particle according to an importance distribution q, such that xj_i ∼ q(xj_i|xj_i−1, yi).

The simplest cases are those in which it is tractable to sample from q(xj_i|xj_i−1, yi) =

p(xj_i|xj_i−1)p(yi|xji). Otherwise, a common choice for q is q(x j i|x j i−1, yi) = p(xji|x j i−1),

though any approximation can be used. Weights are then updated as in importance sampling: wj_i = p(x j 0, . . . , x j i, y1, . . . , yi) Qi k=1q(x j k|x j k−1, yk) = w_i−1j p(x j i|x j i−1)p(yi|xji) q(xj_i|xj_i−1, yi) . Next, the weights are renormalized to sum to 1:

wj_i ← w j i P#particles k=1 w k i .

Finally, the particles may be resampled and the weights set to equal values: x0j_i ∼ p(x0j_i = xj_k) = wj_k; w0j_i = 1

#particles.

Without resampling, most of the particle weights would drift towards 0; resampling effectively kills off particles with small weights while duplicating particles with large

(34)

weights. Resampling is often only performed when certain criteria are met, such as when an estimate of the number of effective particles (i.e., particles with relatively large weight) falls below a predetermined threshold.

(35)

(36)

Chapter 3 The Blaise State–Density–Kernel

Graphical Modeling Language

My thesis is that a framework for probabilistic inference can be designed that enables efficient composition of both models and inference procedures, that is suited to the representational needs of emerging classes of proba-bilistic models, and that supports recent advances in inference.

In this chapter, I support this thesis by introducing the Blaise State–Density– Kernel graphical modeling language and showing how this language supports compo-sition of models and inference procedures.

By the end of this chapter, you will understand all the elements of the Blaise SDK graphical modeling language. You will be able to draw complete graphical mod-els, including graphical representations of inference, for sophisticated models such as multi-feature non-parametric mixture models, and you will understand how models can be built up iteratively by composing existing probabilistic models and inference methods with minimal effort. This chapter also provides the foundation for chap-ters 4–6, which will discuss transformations of SDK models, a virtual machine that can execute SDK models, and applications built using Blaise.

This chapter introduces a graphical modeling language, including several symbols. Each symbol is described as it is introduced. For reference, appendix B also supplies a

(37)

complete legend of symbols, including the page on which the symbol was introduced.

3.1 _{An overview of the Blaise modeling language}

In order to fully specify a probabilistic modeling application, three things must be described. One thing the modeler must describe is the state space. This is a descrip-tion of the domain of the problem: what are the variables we might be interested in? What values can those variables take on? Could there be an unknown number of variables (for example, could there be an unknown number of objects in the world we are trying to describe, such as an unknown number of airplanes in an aircraft track-ing problem?) Are there structural constraints amongst the variables (for example, is every variable of type A associated with a variable of type B?) One of the Blaise modeling language’s three central abstractions, State, is devoted to expressing these aspects of the model.

The state space typically describes a vast number of possible variable instantia-tions, most of which the modeler is not very interested in. The second central ab-straction, Density, allows the modeler to describe how interesting a particular state configuration is. For discrete probabilistic models, this is typically the joint proba-bility mass function. If continuous variables are used, then Density would represent the joint probability density function (from which the abstraction derives is name). When describing how to score a State, the modeler will be expressing things such as: how does the joint score decompose into common pieces, such as standard prob-ability distributions? How does the score accommodate state spaces with unknown numbers of objects – does it have patterns that repeat for each one of these objects? The Density abstraction is designed to represent the modeler’s answers to these questions.

With State and Density in hand, the modeler can now express models, but cannot yet say how to extract information from these models. As described in chapter 2, there are a wide variety of inference techniques that can be applied. Blaise focuses on those inference techniques that can be described as history-free stochastic walks

(38)

p(x|p, α, β) = Beta(p|α, β) · Binomial(x|p)

(a) Joint Density Equation

(b) Bayes Net (c) Factor Graph

(d) Blaise, without inference

(e) Blaise, with inference

Figure 3-1: A preview of the Blaise modeling language, showing the same simple Beta-Binomial model as (a) a joint probability density equation, (b) a Bayes net, (c) a factor graph, (d) a Blaise probabilistic model (no inference), and (e) a Blaise probabilistic model with inference.

(39)

through a State space, guided by the Density1_. _{All such walks can be completely}

described by a transition kernel: an expression of the probability that the stochastic walk will make a particular step in the state space, given the state the walk is currently at. To describe a transition kernel, a modeler will have to make choices such as: which variables in the state space will change on this step? How exactly will these variables be updated – are there common update procedures that can be used? How will these update rules be coordinated so that the whole state space is explored efficiently – that is, how are fragments of an inference algorithm composed? How does the inference method accommodate state spaces with unknown numbers of objects? Often the modeler will want to maintain a certain relationship between the Density and the exploration of the state space; for example, a modeler designing an Markov chain Monte Carlo-based inference method will want to ensure that the transition kernel converges to the Density as an invariant distribution. How will the modeler meet this goal? These consideration are the focus of the Kernel abstraction in Blaise.

A common design tenet runs throughout the entire modeling language: support composability. That is, it should be easy for the modeler to reuse existing models in the creation of new models. For example, if the modeler has constructed a State– Density–Kernel representation of a Chinese Restaurant Process, it should be easy for the modeler to reuse this representation to create a CRP-based mixture model. In most cases, in fact, the SDK for the original model should not need to be modified at all – even the same inference procedure should continue to work in the new model, despite the fact that there are now other States in the state space and other Densities affecting the joint probability density. Realizing this design philosophy will mean that if a modeler extends an existing model or composes several existing models, develop-ment resources can be reserved for the truly novel parts of the new model. It is my hypothesis that such an approach will provide the leverage required to effectively en-gineer sophisticated models of increasing complexity, such as are becoming ever more important in artificial intelligence, cognitive science, and commercial applications.

1_{The history-free limitation is restrictive, because history-dependent stochastic walks can also be} modeled by augmenting the State space with an explicit representation of the history.

(40)

This chapter will compare compare and contrast the Blaise modeling language with classical graphical modeling languages such as Bayes nets and factor graphs. It should be noted that Blaise models are strictly more expressive than factor graphs; see section 6.5 for a simple demonstration of how any factor graph can be translated to a Blaise model.

Although Blaise SDK graphs are presented here in the specific context of Monte Carlo methods for probabilistic inference, the SDK foundation (consisting of a domain described using States, functions over the domain described using Densities, and a stochastic process for domain exploration described using Kernels, together with sim-ple composition and locality rules for each of these representations) can also serve as a general framework for expressing and manipulating any stochastic (or deterministic) automaton.

3.2 _{Blaise States}

The state space describes the domain of the inference problem; that is, the variables and their valid settings. All probabilistic modeling languages have some representa-tion of the state space: graphical modeling languages, such as Bayes nets and factor graphs, use nodes to represent variables (as in figure 3-2), whereas programmatic modeling languages, such as BLOG [45], allow the user to declare variables. Blaise follows in the graphical modeling tradition by representing variables as graph nodes called States. State nodes are also typed, carrying information about what values the represented variable can take on. For example, a State node might be typed as a continuous variable, indicating that it will take real numbers as values.

Unlike classical graphical modeling languages, however, Blaise requires that its State nodes be organized into a single rooted tree via containment (has-a) links in the graph (See figure 3-3). This organization is the foundation of State composition in Blaise – it allows the modeler to take several States and bundle them together as children of some parent State. Note that the space of States is closed under this composition structure: composing several States produces another State.

(41)

Figure 3-2: A simple graphical model for a single draw x from a beta-binomial model, drawn as a factor graph. Several of the examples in this chapter build on this familiar model, though most will ignore the conjugacy properties of the model. Exploiting conjugacy in Blaise will be discussed in section 4.6.

(a) Factor Graph (without factors) _{(b) Blaise State Space}

Figure 3-3: Omitting the factors from the beta-binomial model factor graph in fig-ure 3-2 leaves just the variables, representing the state space of the model, as in (a). Figure (b) shows the same state space as it might be implemented in Blaise. States in Blaise models form trees. The State in the Blaise model that does not have an analog in the factor graph (i.e., the root of the tree) is used to compose diverse States into a single state space. The root State is highlighted with a gray annulus.

The tree-structured organization of States is a critical enabler in modeling re-peated structure in the state space. Information about rere-peated structure is com-monly represented in a graphical modeling language using “plate notation” – draw-ing a box (a “plate”) containdraw-ing the variables that will be repeated, and writdraw-ing a number in the corner of the box to indicate how many times that structure will be repeated. Plate notation has several limitations. Most significantly, state spaces rep-resented using plate notation are not closed under composition: composing several variable nodes produces a new class of object (a plate) rather than a variable. This in turn means that the number of copies of a plate is not part of the state space. This information is not available as a variable, so, for example, one cannot express a prior over the number of copies of a plate nor perform inference to determine how many copies of the plate should be used. This prevents an intuitive expression of even simple models such as mixture models, if the number of components is not known a

(42)

(a) Factor Graph _{(b) Blaise} (c) Factor Graph _{(d) Blaise}

Figure 3-4: Figures (a) and (c) show the state spaces for simple factor graphs, using plate notation to represent repetition. Figures (b) and (d) show the corresponding Blaise state spaces. States marked with a star are Blaise Collection States. The unmarked State in (d) is a generic composite State containing x and y.

priori. Expressing non-parametric mixture models is even more complicated.

There are a number of other important shortcomings of plate notation. Plate notation is most often used in the context of Bayes nets, where there is the additional limitation that the notation does not express how the model’s joint density should factor across the plate boundary. Inference procedures also need to account for re-peated structures in the state space, particularly when the number of repetitions is not fixed a priori. Finally, plate notation only allows plates to interact by having one plate embedded in another; it does not permit plates to intersect, nor interact in other more complex relationships, without making the meaning ambiguous. This makes it challenging to express many interesting models. Each of these limitations will be addressed in this chapter (specifically in sections 3.5, 3.6 and 3.7).

Instead of plates, Blaise uses State composition to capture repeated structure. Blaise allows States to have arbitrarily-sized collections of children. Such Collection States are used to capture the idea of repetition. For example, a model that would be denoted in plate notation as a single variable x inside a plate would be captured in Blaise as a Collection State with a collection of x States as children (see Figure 3-4 (a) and (b)). Composition allows the same containment mechanism to be used for repeated state structure rather than just single states. For example, a model that would be denoted in plate notation as two variables x and y inside a plate would be captured in Blaise as a Collection State with a collection of composite States, where each composite has an x and a y (see Figure 3-4 (c) and (d)). For easy interpretation,

(43)

(a) Factor Graph _{(b) Blaise State Space}

Figure 3-5: The beta-binomial models from Figure 3-3 can be extended to model multiple datapoints drawn from the same binomial distribution. This figure shows state space of this extended model, in plated factor graph notation and as a Blaise State structure.

Blaise will also include plate-like boxes surrounding the repeated structure. However it must be emphasized that these ornamentations carry no new information – they simply highlight the children of a Collection State, allowing the grouping to be seen at a glance, much as a syntax highlighting text editor might highlight balanced pairs of parentheses without providing any additional information.

Reifying the repetition of State structure using Collection States remedies the weakness of plate notation wherein the number of copies of a repeated structure is not available as a variable. Because the Collection State is a State like any other, it serves as a variable in the State space. Thus the computation of the joint density can naturally reference the size of the Collection State (the representation of the joint density will be described shortly in section 3.3).

In order to perform Monte Carlo inference in state spaces with repeated structure where the repetition count is not known a priori, it will be necessary to consider states with different repetition counts. That is, it will be necessary, at inference time, to allow instances of the repeated structure to be added to and removed from the state space. Thus, the topology of Blaise States is considered to be mutable at inference time, so that children may be added and removed from Collection States. It also follows that the State topology carries information. Consider, as an example, the information contained in the size of a Collection State, which might be used to compute the joint density (as above), or might itself be the target of inference (e.g. for

(44)

a query such as “how many mixture components are required to explain this data?”). An interesting effect of allowing the State topology to bear information is that many models that would normally require the used of integer indices no longer require such indices. For example, consider a mixture model, where the number of mixture components is fixed a priori, but where the assignment of data to components is to be inferred. A Bayes net for such a model would assign a unique integer index from the range [0, number of components − 1] to each component (a name, in essence), and each data point would have associated with it a component index from the same the range (see figures 3-6a and 3-6b). Inference is then a matter of choosing appropriate values for the integer indices associated with the datapoints. In a Blaise model, it would be more natural to use a Collection State to represent each component, with the data points currently assigned to each component being the children of that component’s Collection State (see figures 3-6c and 3-6d). Inference is then a matter of moving data points from one component to another (figure 3-6e). This formulation has several advantages. First, it is more parsimonious insofar as the components of a mixture model usually do not actually have an order; the component indices in the Bayes net formulation are an artifact of the formalism that must be explicitly worked around when it comes time to compute the joint density or to evaluate a state in order to answer a query. Second, integer indices are often assumed to be contiguous, which imposes several inefficiencies in the implementation of the system. For example, deleting a component with a mid-valued component index will require changing the component index of at least one other component, otherwise the existing components will not have contiguous indices. Changing the value of a component index is inefficient because it requires finding all the data points associated with the component and updating their component index as well; thus, deleting a component in a integer-indexed model is usually implemented as an operation with time cost linear in the number of datapoints rather than the constant-time operation it is in a Blaise model in which data point assignment is represented directly by the State topology.

(45)

(a) Bayes Net (b) Bayes Net state space

(c) Blaise

(d) Expanded Blaise (e) Expanded Blaise, after reassign-ment

Figure 3-6: Mixture models are among the simplest models with interesting repeated structure. (a) shows a simple mixture model, represented as a Bayes net. There are m components, with Θ representing the parameters for a component. All the Θ variables are governed by a common hyperparameter Ω. There are also n datapoints, where the value of the datapoint is x, and a ∈ [0, n − 1] encodes the component to which the associated datapoint is assigned. (b) shows just the state space for this Bayes net. (c) shows the the state space for a mixture model in Blaise notation. Rather than using integer-valued component assignment variables (a in the Bayes nets), the Blaise model uses Collection States for each component, where each Collection State contains just those datapoints assigned to the component. (d) shows an expanded version of this Blaise model with two components and three datapoints, and (e) demonstrates how the State structure would change when datapoint x2 is moved

(46)

(a) Factor Graph

(b) Blaise

Figure 3-7: A model for a single draw x from a beta-binomial model, drawn as (a) a factor graph, and (b) a Blaise model. In Blaise models, Densities form trees. Den-sities also have States as children, encoding the portions of the State hierarchy that will be used when evaluating the Density. Note that the Density→State connections reflect the factor→node connections in the graphical model. The Density without a graphical model analog represents the (multiplicative) composition of individual Densities into a Density over the whole state space. The gray annulus around this Density highlights it as the root Density.

may be cases in which a reference to more than one State is required. For example, in an admixture model, a State may belong to more than one mixture component simultaneously. To capture this type of pattern, Blaise States support state-to-state dependency links in addition to has-a links. These links are permitted to connect the States in non-tree-structured ways. State-to-state dependency links are also used to model constraints in the State space, as described in section 3.7.

3.3 _{Blaise Densities}

Whereas States are used to model the domain of the state space for a probabilistic model, Densities are used to describe the joint probability density function over that state space. It is often advantageous to decompose the joint probability density

(47)

(a) Graphical Model

(b) Blaise Model

Figure 3-8: A model for a multiple draws from a beta-binomial model, drawn as (a) a factor graph, and (b) a Blaise model. The density labeled π is a Multiplicative Collection Density; it composes any number of child Densities by computing the product of their values.

function into a number of simpler Densities that only depend on a subset of the state space variables (i.e., a projection of the state space onto a lower-dimensional subspace). For example, Bayes nets decompose the joint Density into a product of conditional probabilities and factor graphs decompose the joint Density into a product of factors. Decomposing the density is beneficial for several reasons:

• Pragmatic: the modeler can often express the joint density as a composition of common Densities which are built into the modeling language and which are easy for another human to interpret

• Learnability: decomposing the joint density often reduces the number of de-grees of freedom. For example, expressing the joint density over two boolean variables x and y as a single conditional probability table would require 3 parameters, e.g. p(x ∧ y), p(x ∧ ¬y), and p(¬x ∧ y), with p(¬x ∧ ¬y) = 1 − p(x ∧ y) − p(x ∧ ¬y) − p(¬x ∧ y); in contrast, if the joint probability can be

Composable Probabilistic Inference with Blaise

Computer Science and Artificial Intelligence Laboratory

Technical Report

MIT-CSAIL-TR-2008-044

July 23, 2008

Composable Probabilistic Inference with Blaise

Composable Probabilistic Inference with Blaise

by

Keith Allen Bonawitz

Submitted to the

Department of Electrical Engineering and Computer Science

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer Science and Engineering

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2008

c

Massachusetts Institute of Technology 2008. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

April 30, 2008

Certified by . . . .

Patrick H. Winston

Ford Professor of Artificial Intelligence and Computer Science

Thesis Supervisor

Certified by . . . .

Joshua B. Tenenbaum

Paul E. Newton Career Development Professor

Thesis Supervisor

Accepted by . . . .

Arthur C. Smith

Chairman, Department Committee on Graduate Students

Composable Probabilistic Inference with Blaise

by

Keith Allen Bonawitz

Abstract

Acknowledgments

Contents

List of Figures

Chapter 1

Introduction

1.1

Thesis statement and organization

Chapter 2

Background: Monte Carlo

Methods for Probabilistic Inference

2.1

Probabilistic Models and Inference

2.2

Approximate Inference

2.3

Monte Carlo Methods

2.4

Markov Chain Monte Carlo

2.5

Transdimensional MCMC

2.6

Tempered Inference

2.7

Particle Filtering

Chapter 3

The Blaise State–Density–Kernel

Graphical Modeling Language

3.1

An overview of the Blaise modeling language

3.2

Blaise States

3.3

Blaise Densities

_{An overview of the Blaise modeling language}

_{Blaise States}

_{Blaise Densities}