HAL Id: hal-03033672
https://hal.archives-ouvertes.fr/hal-03033672
Preprint submitted on 1 Dec 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Fredrik Ronquist, Jan Kudlicka, Viktor Senderov, Johannes Borgström, Nicolas Lartillot, Daniel Lundén, Lawrence Murray, Thomas Schön, David
Broman
To cite this version:
Fredrik Ronquist, Jan Kudlicka, Viktor Senderov, Johannes Borgström, Nicolas Lartillot, et al.. Prob- abilistic programming: a powerful new approach to statistical phylogenetics. 2020. �hal-03033672�
Probabilistic programming: a powerful new approach to statistical phylogenetics
Fredrik Ronquist1†∗, Jan Kudlicka2†, Viktor Senderov1†, Johannes Borgström2, Nicolas Lartillot3, Daniel Lundén4, Lawrence Murray5, Thomas B. Schön2, David Broman4
1Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, SE-104 05 Stockholm, Sweden 2Department of Information Technology, Uppsala University, Box 337, SE-751 05 Uppsala, Sweden
3Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard Lyon 1, FR-69622 Villeurbanne Cedex, France 4Department of Computer Science, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden
5Uber AI, San Francisco CA 94105, United States
Statistical phylogenetic analysis currently relies on complex, dedicated software packages, making it difficult for evolutionary biologists to explore new models and inference strategies. Recent years have seen more generic solutions based on probabilistic graphical models, but this formalism can only partly express phylogenetic prob- lems. Here we show that universal probabilistic programming languages (PPLs) solve the model expression prob- lem, while still supporting automated generation of efficient inference algorithms. To illustrate the power of the approach, we use it to generate sequential Monte Carlo (SMC) algorithms for recent biological diversification models that have been difficult to tackle using traditional approaches. This is the first time that SMC algorithms have been available for these models, and the first time it has been possible to compare them using model testing.
Leveraging these advances, we re-examine previous claims about the performance of the models. Our work opens up several related problem domains to PPL approaches, and shows that few hurdles remain before PPLs can be effectively applied to the full range of phylogenetic models.
In statistical phylogenetics, we are interested in learn- ing the parameters of models where evolutionary trees—
phylogenies—play an important part. Such analyses have a surprisingly wide range of applications across the life sci- ences1,2,3. In fact, the research front in many disciplines is partly defined today by our ability to learn the parameters of realistic phylogenetic models.
Statistical problems are often analyzed using generic modeling and inference tools. Not so in phylogenetics, where empiricists are largely dependent on dedicated soft- ware developed by small teams of computational biolo- gists3. Even though these software packages have become increasingly flexible in recent years, empiricists are still limited to a large extent by predefined model spaces and inference strategies. Venturing outside these boundaries typically requires the help of skilled programmers and in- ference experts.
If it were possible to specify arbitrary phylogenetic mod- els in an easy and intuitive way, and then automatically learn the latent variables (the unknown parameters) in them, the full creativity of the research community could be un- leashed, significantly accelerating progress. There are two major hurdles standing in the way of such a vision. First, we must find a formalism (a language) that can express phyloge- netic models in all their complexity, while still being easy to learn for empiricists (the model expression problem). Sec- ond, we need to be able to generate computationally efficient inference algorithms from such model descriptions, draw- ing from the full range of techniques available today (the automated inference problem).
In recent years, there has been significant progress to- wards solving the model expression problem by adopting the framework of probabilistic graphical models (PGMs)4,5.
∗E-mail: [email protected]
†F.R., J.K., and V.S. contributed equally to this work.
PGMs can express many components of phylogenetic mod- els in a structured way, so that efficient Markov chain Monte Carlo (MCMC) samplers—the current workhorse of Bayesian statistical phylogenetics—can be automatically generated for them. Other inference strategies are also read- ily applied to PGM components6,7.
Unfortunately, PGMs cannot express the core of phyloge- netic models: the stochastic processes that generate the tree, and anything dependent on those processes. This is because the resulting evolutionary tree has variable topology, while a PGM expresses a fixed topology. It is possible to express the tree as a single stochastic variable within the PGM, but then the structure of this critical component of the model is opaque to the inference machinery. Hiding the tree inside a stochastic variable also means that it becomes impossi- ble to describe relations between tree-generating processes and other model components, such as the rate of evolution, organism traits or biogeography.
Here, we show that the model expression problem can be solved using universal probabilistic programming languages (PPLs). PPLs have a long history in computer science8, but until recently they have been largely of academic interest because of the difficulty of generating efficient inference machinery when using such expressive languages. This is now changing rapidly thanks to improved methods of au- tomated inference for PPLs9,10,11,12,13,14, and the increased interest in more flexible approaches to statistical modeling and analysis.
To demonstrate the potential of PPLs in statistical phylo- genetics, we tackle a tough problem domain: models that ac- commodate variation across lineages in diversification rate.
These include the recent ClaDS15, LSBDS16and BAMM17 models, attracting considerable attention among evolution- ary biologists despite the difficulties in developing good inference algorithms for them18.
Using WebPPL—an easy-to-learn PPL9—and Birch—a language with a more efficient inference machinery14—we develop an effective encoding approach, and then automat- ically generate sequential Monte Carlo (SMC) algorithms based on short model descriptions (∼100 lines of code each). This is the first time that powerful and flexible SMC algorithms have been available for these models, and the first asymptotically exact inference machinery for BAMM. It is also the first time that it has been possible to compare the models directly using Bayes factors. We end the paper by discussing a few problems, all seemingly tractable, which remain to be solved before PPLs can be used to address the full range of phylogenetic models. Solving them would facilitate the adoption of a wide range of novel inference strategies that have seen little or no use in phylogenetics before.
Results
Probabilistic programming. Consider one of the sim- plest of all diversification models, constant rate birth-death (CRBD), in which lineages arise at a rate λand die out at a rate µ, giving rise to a phylogenetic treeτ. Assume that we want to infer the values of λandµgiven some phylo- genetic treeτobsof extant (now living) species that we have observed (or inferred from other data). In a Bayesian anal- ysis, we would associateλand µwith prior distributions, and then learn their joint posterior probability distribution given the observed value ofτ.
Let us examine a PGM description of this model, say in RevBayes5 (Listing 1). The first statement associates an observed tree with the variablemyTree. The priors on lambda andmuare then specified, and it is stated that the tree variabletauis drawn from a birth-death process with parameterslambdaandmuand generating a tree with leaves matching the taxa inmyTree. Finally,tauis associated with (‘clamped to’) the observed valuemyTree.
Listing 1: PGM description of the CRBD model
1 myTree = readTrees( "treefile.nex" )
2
3 lambda ~ dnGamma( 1, 1 )
4 mu ~ dnGamma( 1, 1 )
5
6 tau ~ dnBirthDeath( lambda, mu, myTree.taxa )
7 tau.clamp( myTree )
There is a one-to-one correspondence between these statements and elements in the PGM graph describing the conditional dependencies between the random variables in the model (Fig. 1). Given that the conditional densities dnGamma anddnBirthDeath are known analytically, along with good samplers, it is now straightforward to automati- cally generate standard inference algorithms for this prob- lem, such as MCMC.
Unfortunately, a PGM cannot describe from first princi- ples (elementary probability distributions) how the birth- death process produces a tree of extant species. The PGM has a fixed graph structure, while the probability of a sur- viving tree is an integral over many outcomes with varying topology. Specifically, the computation of dnBirthDeath
1 1 1 1
λ µ
τ
Figure 1: A probabilistic graphical model describing con- stant rate birth-death (CRBD). The square boxes are fixed nodes (parameters of the gamma distributions) and the cir- cles are random variables. The shaded variable (τ) is ob- served, and (λ, µ) are latent variables to be inferred.
Figure 2: Two trees with extinct side branches (thin lines), each corresponding to the same observed phylogeny of ex- tant species (thick lines). The trees illustrate just two exam- ples of an infinite number of possible PGM expansions of theτnode in Fig. 1.
requires integration over all possible ways in which the pro- cess could have generated side branches that eventually go extinct, each of these with a unique configuration of speci- ation and extinction events (Fig. 2). The integral must be computed by special-purpose code based on analytical or numerical solutions specific to the model. For the CRBD model, the integral is known analytically, but as soon as we start experimenting with more sophisticated diversification scenarios, as evolutionary biologists would want to do, com- puting the integral is likely to require dedicated numerical solvers, if it can be computed at all.
Universal PPLs solve the model expression problem by providing additional expressivity over PGMs. A PPL model description is essentially a simulation program (or genera- tive model). Each time the program runs, it generates a different outcome. If it is executed an infinite number of times, we obtain a probability distribution over outcomes.
The trick is to write a PPL program so that the distribu- tion over outcomes corresponds to the posterior probability distribution of interest. This is straightforward if we under- stand how to simulate from the model, and how to insert the constraints given by the observed data.
Assume, for instance, that we are interested in computing the probability of survival and extinction under CRBD for specific values of λand µ, given that the process started at some time t in the past. We will pretend that we do not know the analytical solution to this problem; instead we will use a PPL to solve it. WebPPL9is an easy-to-learn PPL based on JavaScript, and we will use it here for illustrating
PPL concepts. WebPPL can be run in a web browser at http://webppl.org or installed locally (Supplementary Section 2). Like many PPLs, WebPPL has two special constructs that we will see in the following: (1) a sample statement, which specifies the prior distributions from which random variables are drawn; and (2) aconditionstatement, conditioning a random variable on an observation.
In WebPPL, we define a function goesExtinct, which takes the values of time, lambda andmu (Listing 2). It returnstrueif the process does not survive until the present (that is, goes extinct) andfalseotherwise (survives to the present).
Listing 2: Basic birth-death model simulation in WebPPL
1 var goesExtinct = function(time, lambda, mu) {
2 var waitingTime = sample(
3 Exponential({a: lambda + mu})
4 )
5
6 if (waitingTime > time) { return false }
7
8 var isSpeciation = sample(
9 Bernoulli({p: lambda / (lambda + mu)})
10 )
11
12 if (isSpeciation == false) { return true }
13
14 return goesExtinct(time - waitingTime, lambda, mu)
15 && goesExtinct(time - waitingTime, lambda, mu)
16 }
The function starts at sometime > 0 in the past. The waitingTime until the next event is drawn from an expo- nential distribution with rate lambda + mu and compared with time. If waitingTime > time, the function returns false(the process survived). Otherwise, we flip a coin (the Bernoullidistribution) to determine whether the next event is a speciation or an extinction event. If it is a speciation, the process continues by calling the same function recursively for each of the daughter lineages with the updated timetime - waitingTime. Otherwise the function returnstrue(the lineage went extinct).
If executed many times, thegoesExtinctfunction defines a probability distribution on the outcome space {true,false } for specific values of t, λ and µ. To turn this into a Bayesian inference problem, let us associate λandµwith gamma priors, and then infer the posterior distribution of these parameters assuming that we have observed a group originating at time t = 10 and surviving to the present.
To do this, we combine the prior specifications and the conditioning on survival to the present with thegoesExtinct function into a program that defines the distribution of interest (Listing 3).
Listing 3: CRBD model description in WebPPL
1 var model = function() {
2 var lambda = sample(
3 Gamma({shape: 1, scale: 1})
4 )
5 var mu = sample(
6 Gamma({shape: 1, scale: 1})
7 )
8 var t = 10
9
10 condition(goesExtinct(t, lambda, mu) == false)
11
12 return [lambda, mu]
13 }
Universal PPLs are by definition Turing-complete, that is, they have the same expressive power as most sophis- ticated programming languages used today. PGM-based systems lack expressions for stochastic branching (condi- tional if-then-else statements involving random vari- ables) and unbounded recursion, such as the one used in thegoesExtinctfunction above (Listing 2). If such con- structs are provided by PGM-based software, they are only executed when the model is initiated; they are not part of the model description itself. Because of the popularity of PPLs in recent years, the term ‘probabilistic program- ming’ is now often used also for PGM-based languages, but here we reserve ‘probabilistic programming’ and ‘PPL’ for Turing-complete languages.
Inference in PPLs is typically supported by constructs that take a model description as input. Returning to the previous example, the joint posterior distribution is inferred by calling the built-inInferfunction with the model, the desired inference algorithm, and the inference parameters as arguments (Listing 4).
Listing 4: Specifying inference strategy in WebPPL
1 Infer({model: model, method: ’SMC’, particles:
10000})
To develop this example into a probabilistic program equivalent to the RevBayes model discussed previously (Listing 1), we need to describe the CRBD process along the observed tree, conditioning on all unobserved side branches going extinct (Supplementary Listings 2 and 3). The PPL specification of the CRBD inference problem is longer than the PGM specification because it does not use the analytical expression for the CRBD density. However, it exposes all the details of the diversification process, so it can be used as a template for exploring a wide variety of diversifica- tion models, while relying on the same inference machinery throughout. We will take advantage of this in the following.
Diversification models. The simplest model describing biological diversification is the Yule (pure birth) pro- cess19,20, in which lineages speciate at rateλbut never go extinct. For consistency, we will refer to it as constant rate birth (CRB). The CRBD model21discussed in the examples above adds extinction to the process, at a per-lineage rate ofµ.
An obvious extension of the CRBD model is to let the speciation and/or extinction rate vary over time instead of being constant22, referred to as the generalized birth-death process. Here, we will consider variation in birth rate over time, keeping turnover (µ/λ) constant, and we will refer to this as the time-dependent birth-death (TDBD) model, or the time-dependent birth (TDB) model when there is no extinction. Specifically, we will consider the function
λ(t)=λ0ez(t0−t),
whereλ0is the initial speciation rate at timet0,tis current time, andzdetermines the nature of the dependency. When
CRB CRBD
LSBDS TDBD
BAMM
TDB CLADS1
CLADS0 CLADS2
ε=0
α=1
σ→0 z=0 η→0
µ=0
α=1 σ→0
η→0
z=0 µ=0
α=1 σ→0
µ=0 z=0
Figure 3: Relations between the diversification models con- sidered in this paper.
z>0, the birth rate grows exponentially and the number of lineages explodes. The case z<0 is more interesting bio- logically; it corresponds to a niche-filling scenario. This is the idea that an increasing number of lineages leads to com- petition for resources and—all other things being equal—to a decrease in speciation rate. Other potential causes for slowing speciation rates over time have also been consid- ered23.
The four basic diversification models—CRB, CRBD, TDB and TDBD—are tightly linked (Fig. 3). Whenz=0, TDBD collapses to CRBD, and TDB to CRB. Similarly, whenµ=0, CRBD becomes equivalent to CRB, and TDBD to TDB.
In recent years, there has been a spate of work on mod- els that allow diversification rates to vary across lineages.
Such models can accommodate diversification processes that gradually change over time. They can also explain sudden shifts in speciation or extinction rates, perhaps due to the origin of new traits or other factors that are specific to a lineage.
One of the first models of this kind to be proposed was Bayesian analysis of macroevolutionary mixtures (BAMM)17. The model is a lineage-specific, episodic TDBD model. A group starts out evolving under some TDBD process, with extinction (µ) rather than turnover () being constant over time. A stochastic process running along the tree then changes the parameters of the TDBD process at specific points in time. Specifically,λ0,µandz are all redrawn from the priors at these switch points. In the original description, the switching process was defined in a statistically incoherent way18; here, we assume that the switches occur according to a Poisson process with rateη.
The BAMM model has been implemented in dedicated software using a combination of MCMC sampling and other numerical approximation methods17,24. The implementa- tion has been criticized because it results in severely biased inference18. To date, it has not been possible to provide asymptotically exact inference machinery for BAMM.
In a recent contribution, a simplified version of BAMM was introduced: the lineage-specific birth-death-shift (LS- BDS) model16. LSBDS is an episodic CRBD model, that is, it is equivalent to BAMM whenz = 0. Inference ma- chinery for the LSBDS model has been implemented in RevBayes5based on numerical integration over discretized prior distributions forλandµ, combined with MCMC. The computational complexity of this solution depends strongly on the number of discrete categories used. Ifk categories are used for bothλandµ, computational complexity is mul- tiplied by a factork2. Therefore, it is tempting to simplify the model. We note that, in the empirical LSBDS examples given so far, µ is kept constant and onlyλ is allowed to change at switch points16. Whenz =0, BAMM collapses to LSBDS, and whenη→0 it collapses to TDBD (Fig. 3).
Whenη→0, LSBDS collapses to CRBD.
A different perspective is represented by the cladogenetic diversification rate shift (ClaDS) models15. They map di- versification rate changes to speciation events, assuming that diversification rates change in small steps over the en- tire tree. After speciation, each descendant lineage inher- its its initial speciation rateλi from the ending speciation rateλa of its ancestor through a mechanism that includes both a deterministic long-term trend and a stochastic effect.
Specifically,
logλi∼ N
log(αλa), σ2 .
The α parameter determines the long-term trend, and its effects are similar to thezparameter of TDBD and BAMM.
Whenα <1, that is, logα <0, the speciation rate decreases over time, corresponding toz<0. The standard deviation σ determines the noise component. The larger the value, the more stochastic fluctuation there will be in speciation rates.
There are three different versions of ClaDS, characterized by how they model µ. In ClaDS0, there is no extinction, that is,µ=0. In ClaDS1, there is a constant extinction rate µthroughout the tree. Finally, in ClaDS2, it is the turnover rate = µ/λthat is kept constant over the tree. All ClaDS models collapse to CRB or CRBD models whenα=1 and σ→0 (Fig. 3). The ClaDS models are implemented in the R package RPANDA25, using a combination of advanced numerical solvers and MCMC simulation15.
In contrast to previous work, where these models are implemented independently in complex software packages, We used PPL model descriptions ( 100 lines of code each) to generate efficient and asymptotically correct inference machinery for alldiversification models described above.
This machinery relies on sophisticated Monte Carlo algo- rithms which, unlike classical MCMC, can also estimate the marginal likelihood (the normalization constant of Bayes theorem). We then compared the performance of the dif- ferent diversification models on empirical data by inferring the posterior distribution over the parameters of interest and by conducting model comparison based on the marginal likelihood (Bayes factors). Specifically, we implemented the CRB, CRBD, TDB, TDBD, BAMM, LSBDS, ClaDS0, ClaDS1 and ClaDS2 models in WebPPL and Birch. The model descriptions are provided athttps://github.com/
phyppl/probabilistic-programming. They are simi-
lar in structure to the CRBD program presented above.
Inference strategies. We used inference algorithms in the SMC family, an option available in both WebPPL and Birch.
An SMC algorithm runs many simulations (called particles) in parallel, and stops them when some new information, like the time of a speciation event or extinction of a side lineage, becomes available. At such points, the particles are sub- jected toresampling, that is, sampling (with replacement) based on their likelihoods. SMC algorithms work partic- ularly well when the model can be written such that the information derived from observed data can successively be brought to bear on the likelihood of a particle during the simulation. This is the case when simulating a diversifica- tion process along a tree of extant taxa, because we know that each ‘hidden’ speciation event must eventually result in extinction of the unobserved side lineage. That is, we can condition the simulation on extinction of the side branches that arise (Supplementary Listing 3). Similarly, we can con- dition the simulation on the times of the speciation events leading to extant taxa.
Despite this, standard SMC (the bootstrap particle filter) remains relatively inefficient for these models. Therefore, we employed three new PPL inference techniques that we developed or extended as part of this study: alignment26, delayed sampling13and the alive particle filter27(see Meth- ods).
Empirical results. To demonstrate the power of the ap- proach, we applied PPLs to compare the performance of the nine diversification models discussed above for 40 bird clades (see Methods and Supplementary Table 5). The re- sults (Supplementary Figs. 12–21) are well summarized by the four cases represented in Fig. 4. Focusing on marginal likelihoods (top row), we observe that the simplest mod- els (CRB, CRBD), without any variation through time or between lineages, provide an adequate description of the di- versification process for around 40% of the trees (Fig. 4a).
In the remaining clades, there is almost universal support for slowing diversification rates over time. Occasionally, this is not accompanied by strong evidence for lineage-specific effects (Fig. 4b) but usually it is (Figs. 4c and d). In the latter case, the ClaDS models always show higher marginal likelihoods than BAMM and LSBDS, and this even for trees on which the latter do detect rate shifts (Fig. 4d). Interest- ingly, ClaDS2 rarely outperforms ClaDS0, which assumes no extinction. More generally, models assuming no ex- tinction often have a higher marginal likelihood than their counterparts allowing for it.
The parameter estimates (Fig. 4, rows 2–6) show the con- servative nature of the Bayes factor tests, driven by the rel- atively vague priors we chose on the additional parameters of the more complex models (Supplementary Fig. 2). How- ever, even when complex models are marginally worse than simple or no-extinction models, there is evidence of the kind of variation they allow. For instance, the posterior distribu- tions onzand logαsuggest that negative time-dependence is quite generally present. Similarly, more sophisticated models usually detect low levels of extinction when they are outperformed by extinction-free counterparts. For a more
extensive discussion of these and other results, see Supple- mentary Section 9.
Discussion
Universal PPLs provide Turing-complete languages for model descriptions, which guarantees that virtually all inter- esting phylogenetic models can be expressed. The expres- siveness of PPLs is liberating for empiricists but it forces statisticians and computer scientists to approach the infer- ence problem from a more abstract perspective. This can be challenging but also rewarding, as inference techniques for PPLs are so broadly applicable. Importantly, express- ing phylogenetic models as PPLs opens up the possibility to apply a wide range of inference strategies developed for scientific problems with no direct relation to phylogenetics.
Another benefit is that PPLs reduce the amount of manually written code for a particular inference problem, facilitating the task and minimizing the risk of inadvertently introducing errors, biases or inaccuracies. Our verification experiments (Supplementary Section 7) suggest that the light-weight PPL implementations of ClaDS1 and ClaDS2 provide more accu- rate computation of likelihoods than the thousands of lines of code developed originally for these models.
Previous discussion on the relative merits of diversifica- tion models have centered around the results of simulations and arguments over biological realism17,18,29,15,16, and it has been complicated by the lack of asymptotically correct infer- ence machinery for BAMM18,29. Our most important con- tribution in this context is the refinement of PPL techniques so that it is now possible to implement correct and efficient parameter inference under a wide range of diversification models, and to compare their performance on real data us- ing rigorous model testing procedures. The PPL analyses of bird clades confirm previous claims that the ClaDS models provide a better description of lineage-specific diversifica- tion than BAMM15. Even when simpler models have higher likelihoods, the ClaDS models seem to pick up a consistent signal across clades of small, gradual changes in diversi- fication rates. Like many previous studies30, our analyses provide little or no support for extinction rates above zero.
This appears to be due in part to systematic biases in the sampling of the leaves in the observed trees31,32, a problem that could be addressed by extending our PPL model scripts (Supplementary Section 9.6). Such sampling biases may also partly explain the strong support for slowing diversi- fication rates23. A fascinating question that is now open to investigation is whether there remains evidence of oc- casional major shifts in diversification rates once the small gradual changes have been accounted for, something that could be addressed by a model that combines ClaDS- and BAMM-like features.
Our results show that PPLs can already now compete suc- cessfully with dedicated special-purpose software in several phylogenetic problem domains. Separately, we show how PPLs can be applied to models where diversification rates are dependent on observable traits of organisms (so-called state-dependent speciation and extinction models)27. Other problem domains that may benefit from the PPL approach already at this point include epidemiology33, host-parasite
(a) Alcedinidae (b) Meliphagidae-+ (c) Accipitridae (d) Lari
logbZλ,µ
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
z
−0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2
logα
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
σ2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
η
0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05
•CRB •CRBD •TDB •TDBD •ClaDS0 •ClaDS1 •ClaDS2 •LSBDS •BAMM Figure 4: Comparison of diversification models for four bird clades exemplifying different patterns. Alcedinidae: simple models are adequate; Meliphagidae-+: slowing diversification but no lineage-specific effects; Accipitridae: gradual (ClaDS) lineage-specific changes in diversification; and Lari: evidence for both gradual (ClaDS) and for punctuated (BAMM and LSBDS) lineage-specific changes in diversification. The upper plots show the marginal likelihoods (log scale); a difference of 5 units (scale bar) is considered strong evidence in favor of the better model28. The remaining plots show estimated posterior distributions of model parameters. Theµdistributions are shown with dashed lines.
co-evolution34, and biogeography35,36,37,38.
What is missing before it becomes possible to generate ef- ficient inference machinery for the full range of phylogenetic models from PPL descriptions? Assume, for instance, that we would like to do joint inference of phylogeny (from DNA sequence data) and diversification processes, instead of as- suming that the extant tree is observed; this would seem to touch on all the major obstacles that remain. We then need to extend our current PPL models so that they also describe the nucleotide substitution process along the tree, and condition the simulation on the observed sequences. To generate the standard MCMC machinery for sampling across trees from such descriptions, delayed sampling needs to be extended to summarize over ancestral sequences (Felsenstein’s prun- ing algorithm)39, and it should be applied statically through analysis of the script before the MCMC starts rather than dy- namically. State-of-the-art MCMC algorithms for PPLs12 must then be extended to generate computationally efficient tree samplers, such as stochastic nearest neighbour inter- change40. To facilitate use of PPLs, we think it will also be important to provide a domain-specific PPL that is easy to use, while supporting both automatic state-of-the-art infer- ence algorithms for phylogenetic problems as well as man- ual composition of novel inference strategies suited for this application domain. These all seem to be tractable prob- lems, which we aim to address within the TreePPL project (treeppl.org). We hope this paper will inspire readers to explore PPLs, and we invite computational biologists to join us in developing languages and inference strategies support- ing this powerful new approach to statistical phylogenetics.
Methods
PPL software and model scripts. All PPL analyses de- scribed here used WebPPL version 0.9.15, Node version 12.13.19and the most recent development version of Birch (as of June 12, 2020)14. We implemented all models (CRB, CRBD, TDB, TDBD, ClaDS0, ClaDS1, ClaDS2, LSBDS and BAMM) as explicit simulation scripts that follow the structure of the CRBD example discussed in the main text (Supplementary Section 5). We also implemented compact simulations for the four simplest models (CRB, CRBD, TDB and TDBD) using the analytical equations for specific val- ues ofλ,µandzto compute the probability of the observed trees.
In the PPL model descriptions, we account for incom- plete sampling of the tips in the phylogeny based on the ρ-sampling model41. That is, each tip is assumed to be sampled with a probabilityρ, which is specified a priori. To simplify the presentation in this paper, we always setρ=1.
Arguably, this is the relevant setting for the empirical anal- yses, as the selected trees comprise all or nearly all extant species.
We standardized prior distributions across models to fa- cilitate model comparisons (Supplementary Section 4, Fig.
2). To simplify the scripts, we simulated outcomes on or- dered but unlabeled trees, and reweighted the particles so that the generated density was correct for labelled and un- ordered trees (Supplementary Section 3.2). We also de- veloped an efficient simulation procedure to correct for sur-
vivorship bias, that is, the fact that we can only observe trees that survive until the present (Supplementary Section 5.3).
Inference strategies. To make SMC algorithms more ef- ficient on diversification model scripts, we applied three new PPL inference techniques: alignment, delayed sam- pling, and the alive particle filter. Alignment26,42refers to the synchronization of resampling points across simulations (particles) in the SMC algorithm. The SMC algortihms previously used for PPLs automatically resample particles when they reachobserveorconditionstatements. Diver- sification simulation scripts will have different numbers and placements of hidden speciation events on the surviving tree (Fig. 2), each associated with aconditionstatement in a naive script. Therefore, when particles are compared at re- sampling points, some may have processed a much larger part of the observed tree than others. Intuitively, one would expect the algorithm to perform better if the resampling points were aligned, such that the particles have processed the same portion of the tree when they are compared. This is indeed the case; alignment is particularly important for efficient inference on large trees (Supplementary Fig. 3).
Alignment at code branching points (corresponding to ob- served speciation events in the diversification model scripts) can be generated automatically through static analysis of model scripts26. Here, we manually aligned the scripts by replacing the statements that normally trigger resampling with code that accumulate probabilities when they did not occur at the desired locations in the simulation (Supplemen- tary Section 6.1).
Delayed sampling13is a technique that uses conjugacy to avoid sampling parameter values. For instance, the gamma distribution we used for λ and µ is a conjugate prior to the Poisson distribution, describing the number of births or deaths expected to occur in a given time period. This means that we can marginalize out the rate, and simulate the number of events directly from its marginal (gamma- Poisson) distribution, without having to first draw a specific value ofλor µ. In this way, a single particle can cover a portion of parameter space, rather than just single values of λandµ. Delayed sampling is only available in Birch; we extended it to cover all conjugacy relations relevant for the diversification models examined here.
The alive particle filter27 is a technique for improving SMC algorithms when some particles can ‘die’ because their likelihood becomes zero. This happens when SMC is applied to diversification models because simulations that generate hidden side branches surviving to the present need to be discarded. The alive particle filter is a generic im- provement on SMC, and it collapses to standard SMC with negligible overhead when no particles die. This improved version of SMC, inspired by state-dependent speciation- extinction models27, is only available in Birch.
Verification. To verify that the model scripts and the au- tomatically generated inference algorithms are correct, we performed a series of tests focusing on the normalization constant (Supplementary Section 7). First, we checked that the model scripts for simple models (CRB(D) and TDB(D)) generated normalization constant estimates that were con-
sistent with analytically computed likelihoods for specific model parameter values (Supplementary Fig. 4). Second, we used the fact that all advanced diversification models (ClaDS0-2, LSBDS, BAMM) collapse to the CRBD model under specific conditions, and verified that we obtained the correct likelihoods for a range of parameter values (Supple- mentary Fig. 5). Third, we verified for the advanced models that the independently implemented model scripts and the inference algorithms generated for them by WebPPL and Birch, respectively, estimated the same normalization con- stant for a range of model parameter values (Supplementary Fig. 6). Fourth, we checked that our normalization constant estimates were consistent with the RPANDA package25,15 for ClaDS0, ClaDS1, and ClaDS2, and with RevBayes for LSBDS5,16. For these tests, we had to develop special- ized PPL scripts emulating the likelihood computations of RPANDA and RevBayes. The normalization constant esti- mates matched for LSBDS (Supplementary Fig. 8) and for ClaDS0 (Supplementary Fig. 7) but not for ClaDS1 and ClaDS2. Our best-effort interpretation at this point is that the PPL estimates for ClaDS1 and ClaDS2 are more ac- curate than those obtained from RPANDA (Supplementary Section 7.4). Finally, as there is no independent software that computes BAMM likelihoods correctly yet, we checked that our BAMM scripts gave the same normalization con- stant estimates as LSBDS under settings where the former collapses to the latter (Supplementary Fig. 9).
Data. We applied our PPL scripts to 40 bird clades derived from a previous analysis of divergence times and relation- ships among all bird species43. The selected clades are those with more than 50 species (range 54–316) after out- groups had been excluded (Supplementary Table 5). We followed the previous ClaDS2 analysis of these clades15in converting the time scale of the source trees to absolute time units. The clade ages range from 12.5 Ma to 66.6 Ma.
Bayesian inference. Based on JavaScript, WebPPL is comparatively slow, making it less useful for high-precision computation of normalization constants or estimation of posterior probability distributions using many particles.
WebPPL is also less efficient than Birch because it does not yet support delayed sampling and the alive particle fil- ter. Delayed sampling, in particular, substantially improves the quality of the posterior estimates obtained with a given number of particles. Therefore, we focused on Birch in computing normalization constants and posterior estimates for the bird clades.
For each tree, we ran the programs implementing the ClaDS, BAMM and LSBDS models using SMC with de- layed sampling and the alive particle filter as the inference method. We used 5000 particles for all models except BAMM, for which we increased the number of particles to 20000. We ran each program 500 times and collected the estimates of logZbfrom each run together with the informa- tion needed to estimate the posterior distributions.
For CRB, CRBD, TDB and TDBD we exploited the closed form for the likelihood in the programs. We used sequential importance sampling with 10,000 particles as the inference method, and ran each program 50 times.
Visualization. Visualizations were prepared with Mat- plotlib44. We used the collected data from all runs to draw violin plots for logbZ as well as the posterior distributions forλ,µ(for all models),z(for TDB, TDBD and BAMM), logαandσ2 (for the ClaDS models), and η (for LSBDS and BAMM). By virtue of delayed sampling, the posterior distributions forλ andµfor all ClaDS models as well as BAMM and LSBDS were calculated as mixtures of gamma distributions, the posterior distribution for logαandσ2for all ClaDS models as mixtures of normal inverse gamma and inverse gamma distributions, and the posterior distribution forηfor BAMM and LSBDS as a mixture of gamma dis- tributions. For the remaining model parameters, we used the kernel density estimation (KDE) method. Exact plot settings are provided in the code repository accompanying the paper.
Reporting Summary Further information on research de- sign is available in the Nature Research Reporting Summary linked to this article.
Data availability
The data used to compare the diversification models, together with full literature references, can be found at https://github.com/phyppl/probabilistic- programming, under the directorydata.
Code availability
The WebPPL and Birch models can be found in the same repository, https://github.com/phyppl/
probabilistic-programming, under the directories webpplandbirch.
References
1. Felsenstein, J. Inferring Phylogenies(Sinauer Asso- ciates, Sunderland, Massachusetts, 2003).
2. Yang, Z. Molecular Evolution: A Statistical Approach (Oxford University Press, Oxford, United Kingdom ; New York, NY, United States of America, 2014), 1 edition edition.
3. Nascimento, F. F., dos Reis, M. & Yang, Z. A biolo- gist’s guide to Bayesian phylogenetic analysis. Nature Ecology & Evolution1, 1446–1454 (2017).
4. Höhna, S. et al. Probabilistic graphical model represen- tation in phylogenetics.Systematic Biology63, 753–771 (2014).
5. Höhna, S. et al. RevBayes: Bayesian phylogenetic infer- ence using graphical models and an interactive model- specification language.Systematic Biology65, 726–736 (2016).
6. Fourment, M. & Darling, A. E. Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics. PeerJ7, e8272 (2019).
7. Bouchard-Côté, A. et al. Blang: Bayesian declara- tive modelling of arbitrary data structures. Preprint at https://arxiv.org/abs/1912.10396(2019).
8. Kozen, D. Semantics of probabilistic programs. In 20th Annual Symposium on Foundations of Computer Science, pages 101–114 (San Juan, Puerto Rico, USA, 1979).
9. Goodman, N. D. & Stuhlmüller, A. The design and im- plementation of probabilistic programming languages.
http://dippl.org(2014). Accessed: 2020-5-12.
10. Wood, F., Meent, J. W. & Mansinghka, V. A new ap- proach to probabilistic programming inference. InPro- ceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pages 1024–
1032 (Reykjavik, Iceland, 2014).
11. Mansinghka, V., Selsam, D. & Perov, Y. Venture: a higher-order probabilistic programming platform with programmable inference. Preprint athttps://arxiv.
org/abs/1404.0099(2014).
12. Ritchie, D., Stuhlmüller, A. & Goodman, N. C3:
Lightweight incrementalized MCMC for probabilis- tic programs using continuations and callsite caching.
In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 28–37 (Cadiz, Spain, 2016).
13. Murray, L. M., Lundén, D., Kudlicka, J., Broman, D. & Schön, T. B. Delayed sampling and automatic Rao–Blackwellization of probabilistic programs. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, volume 21, page 10 (Lanzarote, 2018).
14. Murray, L. M. & Schön, T. B. Automated learning with a probabilistic programming language: Birch. Annual Reviews in Control46, 29–43 (2018).
15. Maliet, O., Hartig, F. & Morlon, H. A model with many small shifts for estimating species-specific diversifica- tion rates. Nature Ecology & Evolution3, 1086–1092 (2019).
16. Höhna, S. et al. A Bayesian approach for estimat- ing branch-specific speciation and extinction rates.
Preprint at https://www.biorxiv.org/content/
10.1101/555805v1(2019).
17. Rabosky, D. L. Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees.PLoS ONE9, e89543 (2014).
18. Moore, B. R., Höhna, S., May, M. R., Rannala, B. &
Huelsenbeck, J. P. Critically evaluating the theory and performance of Bayesian analysis of macroevolutionary mixtures.Proceedings of the National Academy of Sci- ences of the United States of America113, 9569–9574 (2016).
19. Yule, G. U. A mathematical theory of evolution, based on the conclusions of Dr. JC Willis, FRS.Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character213, 21–
87 (1924).
20. Nee, S. Birth-death models in macroevolution.Annual Review of Ecology, Evolution and Systematics37, 1–17 (2006).
21. Feller, W. Die Grundlagen der Volterraschen Theo- rie des Kampfes ums Dasein in wahrscheinlichkeitsthe- oretischer Behandlung. Acta Biotheoretica 5, 11–40 (1939).
22. Kendall, D. G. On the generalized "birth-and-death"
process. The Annals of Mathematical Statistics19, 1–
15 (1948).
23. Moen, D. & Morlon, H. Why does diversification slow down? Trends in Ecology & Evolution29, 190–197 (2014).
24. Rabosky, D. L. et al. BAMMtools: an R package for the analysis of evolutionary dynamics on phylogenetic trees. Methods in Ecology and Evolution5, 701–707 (2014).
25. Morlon, H. et al. RPANDA: an R package for macroevo- lutionary analyses on phylogenetic trees. Methods in Ecology and Evolution7, 589–597 (2016).
26. Lundén, D., Broman, D., Ronquist, F. & Murray, L. M.
Automatic alignment of sequential Monte Carlo infer- ence in higher-order probabilistic programs. Preprint at https://arxiv.org/abs/1812.07439(2018).
27. Kudlicka, J., Murray, L. M., Ronquist, F. & Schön, T. B. Probabilistic programming for birth-death models of evolution using an alive particle filter with delayed sampling. In Proceedings of the Conference on Un- certainty in Artificial Intelligence 2019, volume 2019, page 11 (Tel Aviv, Israel, 2019).
28. Jeffreys, H.The Theory of Probability(Oxford Univer- sity Press, Oxford, 1961).
29. Rabosky, D. L., Mitchell, J. S. & Chang, J. Is BAMM flawed? Theoretical and practical concerns in the anal- ysis of multi-rate diversification models. Systematic biology66, 477–498 (2017).
30. Pyron, R. A. & Burbrink, F. T. Phylogenetic estimates of speciation and extinction rates for testing ecologi- cal and evolutionary hypotheses. Trends in Ecology &
Evolution28, 729–736 (2013).
31. Höhna, S., Stadler, T., Ronquist, F. & Britton, T. In- ferring speciation and extinction rates under different sampling schemes. Molecular Biology and Evolution 28, 2577–2589 (2011).
32. Rosindell, J., Cornell, S. J., Hubbell, S. P. & Etienne, R. S. Protracted speciation revitalizes the neutral the- ory of biodiversity: Protracted speciation and neutral theory. Ecology Letters13, 716–727 (2010).