HAL Id: hal-02941245
https://hal.archives-ouvertes.fr/hal-02941245
Submitted on 19 Nov 2020
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Additional Analytical Support for a New Method to Compute the Likelihood of Diversification Models
Giovanni Laudanno, Bart Haegeman, Rampal Etienne
To cite this version:
Giovanni Laudanno, Bart Haegeman, Rampal Etienne. Additional Analytical Support for a New
Method to Compute the Likelihood of Diversification Models. Bulletin of Mathematical Biology,
Springer Verlag, 2020, 82 (22), 22 p. �10.1007/s11538-020-00698-y�. �hal-02941245�
(will be inserted by the editor)
Additional analytical support for a new method to compute
1
the likelihood of diversification models
2
Giovanni Laudanno · Bart Haegeman ·
3
Rampal S. Etienne
4
5
Received: date / Accepted: date
6
Abstract
7
Molecular phylogenies have been increasingly recognized as an important source of
8
information on species diversification. For many models of macro-evolution, analyt-
9
ical likelihood formulas have been derived to infer macro-evolutionary parameters
10
from phylogenies. A few years ago, a general framework to numerically compute
11
such likelihood formulas was proposed, which accommodates models that allow spe-
12
ciation and/or extinction rates to depend on diversity. This framework calculates the
13
likelihood as the probability of the diversification process being consistent with the
14
phylogeny from the root to the tips. However, while some readers found the frame-
15
work presented in Etienne et al. (2012) convincing, others still questioned it (personal
16
communication), despitenumerical evidence that for special cases the framework
17
yields the same (i.e. within double precision) numerical value for the likelihood as
18
analytical formulas do that were independently derived for these special cases. Here
19
we proveanalyticallythat the likelihoods calculated in the new framework are correct
20
for all special cases with known analytical likelihood formula. Our results thus add
21
substantial mathematical support for the overall coherence of the general framework.
22
Keywords Diversification·Birth-death process
23
Giovanni Laudanno
Groningen Institute for Evolutionary Life Sciences, Box 11103, 9700 CC, Groningen, The Netherlands E-mail: glaudanno@gmail.com
Bart Haegeman
Centre for Biodiversity Theory and Modelling, Theoretical and Experimental Ecology Station, CNRS and Paul Sabatier University, Moulis, France
Rampal S. Etienne
Groningen Institute for Evolutionary Life Sciences, Box 11103, 9700 CC, Groningen, The Netherlands
Mathematics Subject Classification (2000):Primary: 60J80; Secondary: 92D15,
24
92B10, 60J85, 92D40
25
1 Introduction
26
One of the major challenges in the field of macro-evolution is understanding the
27
mechanisms underlying patterns of diversity and diversification. A very fruitful ap-
28
proach has been to model macro-evolution as a birth-death process which reduces the
29
problem to the specification of macroevolutionary events (i.e. speciation and extinc-
30
tion). However, providing likelihood expressions for these models given empirical
31
data on speciation and extinction events is quite challenging, for the following reason.
32
While such a likelihood is very easy to derive when full information is available for
33
all events, typically the data involves phylogenetic trees constructed with molecular
34
data collected from extant species alone. Hence, no extinction events and speciation
35
events leading to extinct species are recorded in such phylogenetic trees. For a vari-
36
ety of models this problem can be overcome by considering a reconstructed process,
37
whereby the phylogeny of extant species can be regarded as a pure-birth process with
38
time-dependent speciation rate [9]. But this approach is not generally valid.
39
Thus, the methods employed to derive likelihood expressions are usually applica-
40
ble to a limited set of models. They do not apply to models that assume that speciation
41
and extinction rates depend on the number of species in the system. Hence, potential
42
feedback of diversity itself on diversification rates, due to interspecific competition
43
or niche filling, is completely ignored. The first to incorporate such feedbacks were
44
Rabosky & Lovette [10], who made rates dependent on the number of species present
45
at every given moment in time, analogously to logistic growth models used in pop-
46
ulation biology. However, their model had to assume that there is no extinction for
47
mathematical tractability, which stands in stark contrast to the empirical data: the
48
fossil record provides us with many examples of extinct species.
49
Etienne et al. [3] presented a framework to compute the likelihood of phyloge-
50
netic branching times under a diversity-dependent diversification process that explic-
51
itly accounts for the influence of species that are not in the phylogeny, because they
52
have become extinct. Some of our colleagues have doubts that this framework con-
53
tains a formal argument that the solution of the set of ordinary differential equa-
54
tions that together constitute the framework gives the likelihood of the model for
55
a given phylogenetic tree. Instead, only numerical evidence for a small set of pa-
56
rameter combinations has been provided that the method yields, in the appropri-
57
ate limit, the known likelihood for the standard diversity-independent (i.e. using
58
constant-rates) birth-death model. This likelihood was first provided by Nee et al. [9],
59
using a breaking-the-tree approach. Later, Lambert & Stadler [8] used coalescent
60
point process theory to provide an approach to obtain likelihood formulas for a wider
61
set of models. These models did not include diversity-dependence. For example,
62
Lambert et al. [4, 7] applied their framework to the protracted birth-death model,
63
which is a generalization of the diversity-independent model where speciation is no
64
longer an instantaneous event [5]. For this model they provided an explicit likelihood
65
expression.
66
Here we provide an analytical proof that the likelihood of Etienne et al. [3] re-
67
duces to the likelihood of Lambert et al. [7] – and hence to that of Nee et al. [9] – for
68
the case of diversity-independent diversification.
69
The extant species belonging to a clade are often not all available for sequenc-
70
ing, because some species are difficult to obtain tissue from (either because they are
71
difficult to find/catch, or because they are endangered, or because they have recently
72
become extinct due to anthropogenic rather than natural causes) or because it is diffi-
73
cult to extract their DNA. This means that our data consists of a phylogenetic tree of
74
an incomplete sample of species, and thus of an incomplete set of speciation events,
75
even for those that lead to the species that we observe today. Models have been de-
76
veloped to deal with this incomplete sampling. One model assumes that we know
77
exactly how many extant species are missing from the phylogeny. This is the case for
78
well-described taxonomic groups, such as birds, where we have a good idea of the
79
species that are evolutionarily related, but we are simply missing some data points for
80
the reasons mentioned above. This sampling model is calledn-sampling [7]. Another
81
model assumes that we do not know exactly how many species are missing from the
82
phylogeny, but rather that we have an idea of the probabilityρ of a species being
83
sampled. This sampling scheme is calledρ-sampling [7], but is also referred to as
84
f-sampling [9]. The framework of Etienne et al. [3] assumesn-sampling, but in this
85
paper we show that it can also be extended to incorporateρ-sampling.
86
In the next section we summarize the framework of Etienne et al. [3] and we pro-
87
vide the likelihood formula analytically derived by Lambert et al. [7] for the special
88
case of diversity-independent but time-dependent diversification with n-sampling.
89
Then we proceed by showing that the probability generating functions of these two
90
likelihoods are identical. We end with a discussion where we point out how the frame-
91
work of Etienne et al. [3] can be extended to includeρ-sampling and how it relates to
92
the likelihood formula of Rabosky & Lovette [10] for the diversity-dependent birth-
93
death model without extinction.
94
2 The diversity-dependent diversification model
95
Diversification models are birth-death processes in which “birth” and “death” cor-
96
respond to speciation and extinction events, respectively. In the simplest case, the
97
per-species speciation rateλand the per-species extinction rateµare constants. Here
98
we consider diversification models in which the speciation and extinction rates de-
99
pend on the number of speciesnpresent at timet, i.e., diversity-dependent, which we
100
denote byλnandµn. We also allow the speciation and extinction rates to depend on
101
timet, i.e.,λn(t)andµn(t), although the latter dependence is often not explicit in our
102
notation.
103
We assume that the diversification process starts at timetcfrom a crown, i.e., from two ancestor species. Assuming that at a later timet>tcthe process hasnspecies, the transition probabilities in the infinitesimal time interval[t,t+dt]are
fromnton+1 species with probabilityλn(t)dt fromnton−1 species with probabilityµn(t)dt
number of speciesnunchanged with probability 1−λn(t)dt−µn(t)dt.
The diversification process runs until the present timetp.
104
We denote byPn(t)the probability that the process hasnspecies at timet. This
105
probability satisfies the following ordinary differential equation (ODE, called master
106
equation or forward Kolmogorov equation [1]),
107
dPn(t)
dt =µn+1(n+1)Pn+1(t) +λn−1(n−1)Pn−1(t)−(λn+µn)nPn(t), (2.1)
where we omit in the notation the time dependence of the speciation and extinction
108
rates.
109
Sampling models
110
At the present timetpa subset of thenextant species are observed and sampled. This
111
sampling process can been modelled in two different ways (see introduction). The
112
first model assumes that a fixed number of species is unsampled, which corresponds
113
to then-sampling scheme of Ref. [7]. That is, the number of extant species attp
114
that are not sampled, a number we denote by mp, is part of the data. The second
115
model assumes that each extant species at the present time is sampled with a given
116
probability, which has been called f-sampling [9] orρ-sampling [7]. In this case the
117
number of unsampled speciesmpis a random variable, and the probability with which
118
extant species are sampled a parameter to estimate, which we denote by fp.
119
†
† †
02468Complete tree
a)
b)
c)
number of lineages t
c t2 t3 t4 t5tp
Reconstructed tree
Fig. 1 a) Full tree where missing species are plotted as red dashed lines: the ones ending in a cross become extinct before the present, whereas the ones ending with a red dot are unsampled species at the present;
b) Corresponding reconstructed tree in which only extant species are present. This is the type of tree we usually work with because actual phylogenetic trees are usually obtained from molecular data taken from extant species; c) Lineages-through-time plot: the green line represents the number of lineages leading to extant species (k), the red line represents lineages leading to extinct or unsampled species (m), and the blue line represents the total number of lineages (n=k+m).
Reconstructed tree
120
A realization of the diversification process fromtctotpcan be represented graphically
121
as a tree, see Figure 1. The complete tree shows all the species that have originated in
122
the process (Fig. 1a). However, in practice we have only access to the reconstructed
123
tree, i.e., the complete tree from which we remove all the species that became extinct
124
before the present or that were not sampled (Fig. 1b). While it would be straight-
125
forward to infer information about the diversification process based on the complete
126
tree, this task is much more challenging when only the reconstructed tree is available.
127
This paper deals with likelihood formulas for a reconstructed phylogenetic tree.
128
The number of tips equals the number of sampled extant specieskp. We assume that
129
also the number of unsampled extant species is known, a number we denote bymp.
130
The information contained in a phylogenetic tree consists of a topology and a set
131
of branching times. For a large set of diversification models, including the diversity-
132
dependent one, all trees having the same branching times but different topologies are
133
equally probable [8]. Hence, we can discard the topology for the likelihood compu-
134
tation. We denote the vector of branching times byt= (t2,t3, . . . ,tkp−1), wheretkis
135
the branching time at which the phylogenetic tree changes fromktok+1 branches.
136
It will be convenient to sett0=t1=tcandtkp =tp.
137
Likelihood conditioning
138
It is common practice to condition the likelihood on the survival of both ancestor lin-
139
eages to the present time [9]. Indeed, we would only do an analysis on trees that have
140
actually survived to the present. In particular, we impose the crown age of the recon-
141
structed tree. This corresponds to conditioning on the event that both ancestor species
142
have extant descendant species. We do not require that these descendant species have
143
been sampled.
144
3 The framework of Etienne et al.
145
Etienne et al. [3] presented an approach to compute the likelihood of a phylogeny for
146
the diversity-dependent model. It is based on a new variable,Qkm(t), which they de-
147
scribed as “the probability that a realization of the diversification process is consistent
148
with the phylogeny up to timet, and hasn=m+kspecies at timet” (Ref. [3], Box
149
1), whereklineages are represented in the phylogenetic tree (because they are ances-
150
tral to one of thekpspecies extant and sampled at present) andmadditional species
151
are present but unobserved (Fig. 1c). These species might not be in the phylogenetic
152
tree because they became extinct before the present or because they are either not
153
discovered or not sampled yet (see introduction). From hereon we will refer to these
154
species denoted bymas missing species. We cannot ignore these missing species, be-
155
cause in a diversity-dependent speciation process, they can influence the speciation
156
and extinction rates.
157
We start by describing the computation of the variableQkmpp(tp), which proceeds from the crown agetcto the present timetp. It is convenient to arrange the values
Qkm(t), withm=0,1,2, . . ., into the vectorQk(t). The initial vectorQk=2(tc)is trans- formed into the vector Qk(t) at a later timet as follows (Ref. [3], Appendix S1, Eq. (S1)):
Qk(t) =Ak(tk−1,t)Bk−1(tk−1)Ak−1(tk−2,tk−1). . . A3(t2,t3)B2(t2)A2(tc,t2)Qk=2(tc),
withtk−1≤t≤tk. The operators Ak and Bk are infinite-dimensional matrices that operate along the tree, on branches and nodes, respectively (Fig. 2). Continuing this computation until the present timetp, we get
Qkp(tp) =Akp(tkp−1,tp)Bkp−1(tkp−1)Akp−1(tkp−2,tkp−1). . .
A3(t2,t3)B2(t2)A2(tc,t2)Qk=2(tc). (3.1) Note that Eq. (3.1) generalizes Eq. (S1) of Ref. [3] to the case in which the rates are
158
time-dependent.
159
We specify the different terms appearing in the right-hand side of Eq. (3.1):
160161
– For the initial vector Qk=2(tc)we assume that there are no missing species at
162
crown age, that is,Qk=2m (tc) =δm,0.
163
– The matrix Akcorresponds to the dynamics ofQkm(t)in the time interval[tk−1,tk], during which the phylogenetic tree has k branches. Etienne et al. [3] argued that these dynamics are given by the following ODE system (Ref. [3], Box 1, Eq. (B2)):
dQkm(t)
dt =µk+m+1(m+1)Qkm+1(t) +λk+m−1(m−1+2k)Qkm−1(t)
−(λm+k+µm+k)(m+k)Qkm(t), ∀m>0, dQk0(t)
dt =µk+1Qk1(t)−(λk+µk)kQk0(t), ifm=0.
(3.2) The quantityQk0p(tp)is the likelihood for the tree at the present time, assuming
164
that all the species survived to the present have been sampled. We can collect the
165
coefficients ofQkm(t)on the right-hand side of the ODE system in a matrix Vk(t).
166
If we do so, the system can be rewritten as
167
dQk(t)
dt =Vk(t)Qk(t), which has solution
168
Qk(t) =expZ t tk−1
Vk(s)ds
Qk(tk−1), so that
169
Ak(tk−1,t) =exp Z t
tk−1
Vk(s)ds
. (3.3)
– The matrix Bk transforms the solution of the ODE system ending attk into the
170
initial condition of the ODE system starting at tk. It is a diagonal matrix with
171
componentskλk+mdt, so that
172
Qk+1m (tk) = Bk(tk)
m,mQkm(tk) =kλk+mdt Qkm(tk). (3.4) The multiplication byλk+mdtcorresponds to the probability that a speciation oc-
173
curs in the time interval[tk,tk+dt]. In the likelihood expressions we will omit the
174
differential (a choice that is widely adopted across the vast majority of this kind
175
of models in the literature) as it is actually not essential in parameter estimation.
176
Therefore, we will work with a likelihood density, but for simplicity we will refer
177
to it as a likelihood.
178
We are then ready to formulate the claim made by Etienne et al. [3] (in particular, see
179
their Eqs. (S2) and (S6) in Appendix S1).
180
Claim 3.1 Consider the diversity-dependent diversification model, given by specia-
181
tion ratesλn(t)and extinction ratesµn(t). The diversification process starts at crown
182
age tc with two ancestor species, and ends at the present time tp, at which a fixed
183
number of species mpare not sampled. A phylogenetic tree is constructed for the
184
sampled species. Then, the likelihood that the phylogenetic tree has kptips and vec-
185
tor of branching timest= (t1,t2, . . . ,tkp−1), conditional on the event that both crown
186
lineages survive until the present, is equal to
187
Lkp,t,mp= Qkmpp(tp)
kp+mp
mp
Pc(tc,tp). (3.5)
The term Qkmpp(tp)in the numerator of this expression is obtained from Eq. (3.1). The
188
term Pc(tc,tp)in the denominator, where the subscript c stands for conditioning, is
189
equal to
190
Pc(tc,tp) =
∞
∑
m=0
6
(m+2)(m+3)Qk=2m (tp), (3.6) where Qk=2m (tp)is again obtained from Eq. (3.1).
191
The structure of the likelihood expression (3.5) can be understood intuitively. It
192
is proportional toQkmpp(tp), which in Etienne et al.’s interpretation is the probability
193
that the diversification process generates the phylogenetic tree withkptips andmp
194
missing species at present timetp. The combinatorial factor kpm+mp
p
accounts for the
195
number of ways to selectmpmissing species out of a total pool ofkp+mpspecies.
196
The factorPc(tc,tp)is the probability that both ancestor species at crown agetchave
197
descendant species at the present timetp. Hence, this factor applies the likelihood
198
conditioning.
199
Etienne et al. [3] provided numerical evidence that Claim 3.1 is in agreement
200
with the likelihood provided by Nee et al. [9] under the hypothesis of diversity-
201
independent speciation and extinction rates and no missing species at the present.
202
However, a rigorous analytical proof, even for this specific case, has not yet been
203
given. In this paper we show that Claim 3.1 holds (1) for the diversity-independent
204
(but possibly time-dependent) case and (2) for the diversity-dependent case without
205
extinction (i.e., extinction rateµ=0).
206
Reconstructed tree
A
2( t
2,t
c) B
2Q
2( t
c) Q
4( t
p)
k=2 k=3 k=4
B
3A
3( t
3,t
2) A
4( t
p,t
3)
t
t
ct
2t
3t
pFig. 2 An example of how to build a likelihood for a tree withkp=4 tips. We start with a vec- tor Q2(tc) at the crown age. We use Ak(tk,tk−1) and Bk(tk) to evolve the vector across the en- tire tree (on branches and nodes, respectively) up to the present time tp according to Q4(t4) = A4(t4,t3)B3(t3)A3(t3,t2)B2(t2)A2(t2,tc)Q2(tc). At the present time the likelihood accounting formpmiss- ing species will be proportional to themp-th component of the vectorL4,mp∝Q4m
p(tp).
4 The likelihood for the diversity-independent case
207
Claim 3.1 proposes a likelihood expression for the case with a known number of
208
unsampled species at the present, i.e., it accounts forn-sampling. For the diversity-
209
independent case, i.e.,λn(t) =λ(t)andµn(t) =µ(t), the likelihood is contained in
210
a more general result established by Lambert et al. [7]. In the following proposition
211
we derive an explicit likelihood expression by restricting the result of Lambert et al.
212
to the diversity-independent case.
213
Proposition 4.1 Consider the diversity-independent diversification model, given by speciation ratesλ(t)and extinction rate µ(t). The diversification process starts at crown age tcwith two ancestor species, and ends at the present time tp, at which a fixed number of species mpare not sampled. A phylogenetic tree is constructed for the sampled species. Then, the likelihood that the phylogenetic tree has kptips and vector of branching times t= (t1,t2, . . . ,tkp−1), conditional on the event that both crown lineages survive until the present, is equal to
L(div-indep)
kp,t,mp =(kp−1)!
kp+mp
mp
(1−η(tc,tp))2
kp−1
∏
i=2λ(ti)(1−ξ(ti,tp))(1−η(ti,tp))
m|m
∑
pkp−1
∏
j=0(mj+1)(η(tj,tp))mj, (4.1) where we used the convention t0=t1=tc. The components mj(with j=0,1, . . . ,kp−
214
1) of the vectorsm, in the sum on the second line, are non-negative integers satisfying
215
∑kj=0p−1mj=mp. The functionsξ(t,tp)andη(t,tp)are given by
216
ξ(t,tp) =1− 1
α(t,tp) +Rttpλ(s)α(t,s)ds
=1− 1
1+Rttpµ(s)α(t,s)ds (4.2) η(t,tp) =1− α(t,tp)
α(t,tp) +Rttpλ(s)α(t,s)ds
=1− α(t,tp)
1+Rttpµ(s)α(t,s)ds, (4.3) with
217
α(t,s) =expZ s t
(µ(s0)−λ(s0))ds0 .
The functionsξ(t,tp)andη(t,tp)are those appearing in Kendall’s solution of the
218
birth-death model (see Ref. [6], Eqs. (10–12)), and are useful to describe the process
219
when time-dependent rates are involved. Given the probabilityPn(t,tp)of realizing a
220
process starting with 1 species at timetand ending withnspecies at timetp, we have
221
thatξ(t,tp) =P0(t,tp)andη(t,tp) =Pn?+1(t,tp)
Pn?(t,tp) for anyn?>0.
222
Proof The likelihood forn-sampling was originally provided by Ref. [7], Eq. (7), but we start from the explicit version provided in Ref. [4], Eq. (1), see corrigendum in Ref. [2],
L(div-indep)
kp,t,mp =(kp−1)!
kp+mp
mp
(g(tc,tp))2
kp−1
∏
i=2f(ti,tp)
∑
m|mp
kp−1
∏
j=0(mj+1)(1−g(tj,tp))mj. (4.4) Lambert et al. [4, 7] specify the functions f(t,tp)andg(t,tp)as the solution of a system of ODEs for the case of protracted speciation, a model where speciation does not take place instantaneously but is initiated and needs time to complete. The stan- dard diversification model is then obtained by taking the limit in which the speciation- completion rate tends to infinity. In this limit the four-dimensional system of Lambert et al. [4], Eq. (2), reduces to a two-dimensional system of ODEs,
f(t,tp) =dg(t,tp)
dt =λ(t) (1−q(t,tp))g(t,tp) dq(t,tp)
dt =−µ(t) + (µ(t) +λ(t))q(t,tp)−λ(t)q2(t,tp).
Note that in this paper timetruns from past to present rather than from present to past
223
as in Lambert et al. [4]. The conditions at the present timetpare given byg(tp,tp) =1
224
andq(tp,tp) =0.
225
The solution of this system of ODEs can be expressed in terms ofη(t,tp)and ξ(t,tp),
f(t,tp) =λ(t) (1−ξ(t,tp)) (1−η(t,tp)) g(t,tp) =1−η(t,tp)
q(t,tp) =ξ(t,tp),
which can be checked using the derivatives of the expressions 4.3 and 4.2
∂ η(t,tp)
∂t =−λ(t) (1−ξ(t,tp)) (1−η(t,tp))
∂ ξ(t,tp)
∂t =−(µ(t)−λ(t)ξ(t,tp)) (1−ξ(t,tp)).
Substituting the functions f(t,tp)andg(t,tp)into the likelihood expression (4.4) con-
226
cludes the proof.
227
The functions ξ(t,tp)andη(t,tp)are directly related to the functions used by
228
Nee et al. [9]. In particular, the functions they denoted byP(t,tp)andut correspond
229
in our notation to 1−ξ(t,tp)andη(t,tp), respectively.
230
This correspondence allows us to get an intuitive understanding of the likelihood
231
expression (4.1). First consider the case without missing species. Settingmp=0, we
232
get
233
L(div-indep)
kp,t,0 = (kp−1)!(1−η(tc,tp))2
kp−1
∏
i=2λ(ti)(1−ξ(ti,tp))(1−η(ti,tp)), which is identical to the breaking-the-tree likelihood of Nee et al. (Ref. [9], Eq. (20)).
234
In the latter approach the phylogenetic tree is broken into single branches: two for
235
the interval [tc,tp] and one for each interval [ti,tp] with i=2,3, . . . ,kp−1. Each
236
branch contributes a factor(1−ξ(ti,tp))(1−η(ti,tp)), equal to the probability that
237
the branch starting atti has a single descendant species attp. For the two branches
238
originating attc, the factor(1−ξ(ti,tp)), equal to the probability of having (one or
239
more) descendant species, drops due to the conditioning. For the other branches, there
240
is an additional factorλ(ti)for the speciation events.
241
Next consider the case with missing species. Each of the branches resulting from
242
breaking the tree can contribute species to the pool ofmpmissing species. For the
243
branch over the interval[tj,tp], there aremj such species, each contributing a factor
244
η(tj,tp)to the likelihood. Indeed,(1−ξ(tj,tp))(1−η(tj,tp))(η(tj,tp))mj is equal
245
to probability of having exactlymj+1 descendant species at the present time. One
246
of these species is represented in the phylogenetic tree, justifying the combinatorial
247
factor(mj+1)in the second line of Eq. (4.1).
248
Finally, we recall the expressions for the functionsξ(t,tp)andη(t,tp)in the case of constant rates,λ(t) =λ andµ(t) =µ,
ξ(t,tp) =µ 1−e−(λ−µ)(tp−t) λ−µe−(λ−µ)(tp−t) η(t,tp) =λ 1−e−(λ−µ)(tp−t)
λ−µe−(λ−µ)(tp−t) . 5 Equivalence for the diversity-independent case
249
Likelihood formula (4.1) allows speciation and extinction rates to be arbitrary func-
250
tions of time,λ(t)andµ(t). Here we show that, for the diversity-independent case,
251
we find the same likelihood formula with the approach of Etienne et al. [3].
252
Theorem 5.1 Claim 3.1 holds for the diversity-independent case.
253
Proof The proof relies heavily on generating functions. First, we introduce the
254
generating function for the variablesQkm(t),
255
Fk(z,t) =
∞ m=0
∑
zmQkm(t). (5.1)
The set of ODEs satisfied byQkm(t), Eq. (3.2), transforms into a partial differential equation (PDE) for the generating functionFk(z,t),
∂Fk(z,t)
∂t = (µ(t)−zλ(t))(1−z)∂Fk(z,t)
∂z +k(2zλ(t)−λ(t)−µ(t))Fk(z,t)
=c(z,t)∂Fk(z,t)
∂z +k∂c(z,t)
∂z Fk(z,t), (5.2)
with
256
c(z,t) = (µ(t)−zλ(t))(1−z).
Note that the number of brancheskchanges at each branching time, so that the PDE
257
forFk(z,t)is valid only fortk−1≤t≤tk(corresponding to the operatorAk). At branch-
258
ing timetk, the solutionFk(z,tk)has to be transformed to provide the initial condition
259
for the PDE forFk+1(z,t)at timetk(corresponding to the operatorBk). Using Eq. (3.4)
260
and dropping the differential, we get
261
Fk+1(z,tk) =kλ(tk)Fk(z,tk). (5.3) The initial condition at crown age isF2(z,tc) =1 becauseQk=2m (tc) =δm,0.
262
Next, we definePn(s,t)as the probability that the birth-death process that started
263
with one species at time s has n species at time t. The corresponding generating
264
function is defined as,
265
G(z,s,t) =
∞ n=0
∑
znPn(s,t). (5.4)
The set of ODEs satisfied byPn(s,t), Eq. (2.1), transforms into a PDE,
266
∂G(z,s,t)
∂t =c(z,t)∂G(z,s,t)
∂z . (5.5)
Its solution was given by Kendall (Ref. [6], Eq. (9)),
267
G(z,s,t) =ξ(s,t) + (1−ξ(s,t)−η(s,t))z
1−zη(s,t) , (5.6) whereξ(s,t)andη(s,t)are given in Eqs. (4.2) and (4.3).
268
The generating functionFk(z,t)can be expressed in terms of the generating func-
269
tionG(z,s,t), as shown in the following lemma.
270
Lemma 5.1 The generating function Fk(z,t)of the variables Qkm(t)is given by
271
Fk(z,t) =H2(z,tc,t)
k−1
∏
j=2jλj(tj)H(z,tj,t) (5.7) with
272
H(z,s,t) =∂G(z,s,t)
∂z =(1−ξ(s,t)) (1−η(s,t))
(1−zη(s,t))2 . (5.8)
To prove the lemma, let us suppose that the solution of Eq. (5.2) is of the form,
273
Fk(z,t) =Ck(t)
k−1
∏
j=0∂zG(z,tj,t) (5.9)
whereCk(t)is a constant depending on the branching times. We used the convention
274
t0=t1=tcand the short-hand notation∂zfor the partial derivative with respect toz.
275
This expression can be rewritten as,
276
Fk(z,t) =Ck(t)1 k
k−1 i=0
∑
∂zG(z,ti,t)
k−1 j6=i,
∏
j=0∂zG(z,tj,t).
The partial derivatives ofFkcan now be computed,
∂zFk=Ck(t)
k−1
∑
i=0
∂z2G(z,ti,t)
k−1 j6=i,
∏
j=0∂zG(z,tj,t)
∂tFk=Ck(t)
k−1 i=0
∑
∂t∂zG(z,ti,t)
k−1 j6=i,j=0
∏
∂zG(z,tj,t).
We substitute these expressions into the PDE, Eq. (5.2),
k−1
∑
i=0
∂t∂zG(z,ti,t)
k−1 j6=i,
∏
j=0∂zG(z,tj,t)
=c(z,t)
k−1 i=0
∑
∂z2G(z,ti,t)
k−1 j6=i,
∏
j=0∂zG(z,tj,t)
+k∂zc(z,t)1 k
k−1
∑
i=0
∂zG(z,ti,t)
k−1 j6=i,
∏
j=0∂zG(z,tj,t).
This equation is satisfied if the following equation is satisfied for everyi=0,1, . . . ,k,
∂t∂zG(z,ti,t)
k−1 j6=i,
∏
j=0∂zG(z,tj,t)
=c(z,t)∂z2G(z,ti,t)
k−1 j6=i,
∏
j=0∂zG(z,tj,t)
+∂zc(z,t)∂zG(z,ti,t)
k−1 j6=i,
∏
j=0∂zG(z,tj,t).
This is the case if
277
∂t∂zG(z,ti,t) =c(z,t)∂z2G(z,ti,t) +∂zc(z,t)∂zG(z,ti,t), or, equivalently, if
278
∂z
∂tG(z,ti,t)
=∂z
c(z,t)∂zG(z,ti,t) .
This is an identity becauseG(z,ti,t) satisfies Eq. (5.5).
279
Next, we verify that the constantsCk(t)can be determined such that initial con-
280
ditions (5.3) are satisfied. This is indeed the case if we take
281
Ck(t) =
k−1
∏
j=2
jλ(tj).
Introducing the functionH(z,s,t)and usingt0=t1=tccomplete the proof of the
282
lemma.
283
Next, we use Eq. (5.7) to derive an explicit expression for the likelihood (3.5) of
284
Claim 3.1. It will be useful to have explicit expressions for derivatives of the function
285
H(z,s,t). It follows from Eq. (5.8) that
286
1 a!∂za
Hb(z,tj,t)
=
a+2b−1 a
Hb(z,tj,t)
η(tj,t) 1−zη(tj,t)
a
, (5.10)
whereaandbare positive integers.
287
To evaluate the numerator of Eq. (3.5), we have to extractQkmpp(tp)from the gen- erating functionFkp(z,tp). Using Leibniz’ rule,
Qkmpp(tp) = 1 mp!∂zmp
Fkp(z,tp)
z=0
=Ckp(t) mp! ∂zmp
"k
p−1
∏
j=0H(z,tj,tp)
#
z=0
=Ckp(t) mp!
∑
m|mp
mp m0,m1, . . . ,mkp−1
kp−1
∏
j=0
∂zmj[H(z,tj,tp)]z=0
=Ckp(t) mp!
∑
m|mp
mp!
∏imi!
kp−1
∏
j=0(mj+1)!
H(z,tj,tp)
η(tj,tp) 1−zη(tj,tp)
mj
z=0
=Ckp(t)
kp−1
∏
j=0H(0,tj,tp)
∑
m|mp
1
∏kp
−1 i=0 mi!
kp−1
∏
j=0(mj+1)!ηmj(tj,tp)
=
kp−1
∏
j=2
jλ(tj)
kp−1
∏
j=0
(1−ξ(tj,tp))(1−η(tj,tp))
∑
m|mp kp−1
∏
j=0
(mj+1)ηmj(tj,tp)
= (kp−1)!(1−ξ(tc,tp))2(1−η(tc,tp))2
kp−1
∏
j=2λ(tj) (1−ξ(tj,tp))(1−η(tj,tp))
m|m
∑
pkp−1
∏
j=0(mj+1)ηmj(tj,tp). (5.11)
To evaluate the denominator of Eq. (3.5), we have to extractQk=2m (tp)from the
288
generating function,
289
Qk=2m (tp) = 1 m!∂zm
F2(z,tp)
z=0= 1 m!∂zm
H2(z,tc,tp)
z=0.
Substituting into Eq. (3.6) and using Eq. (5.10), we get Pc(tc,tp) =
∞ m=0
∑
6 (m+2)(m+3)
1 m!∂zm
H2(z,tc,tp)
z=0
=H2(0,tc,tp)
∞ m=0
∑
(m+1)ηm(tc,tp)
= (1−ξ(tc,tp))2. (5.12)
Finally, substituting Eqs. (5.11) and (5.12) into the likelihood formula (3.5) of Claim 3.1,
Lkp,t,mp=(kp−1)!
kp+mp mp
(1−η(t1,tp))2
kp−1
∏
j=2
λ(tj)(1−ξ(tj,tp))(1−η(tj,tp))
m|m
∑
pkp−1
∏
j=0(mj+1)ηmj(tj,tp), (5.13)
which is identical to likelihood formula (4.1). This concludes the proof of the theo-
290
rem.
291
6 A note on sampling a fraction of extant species
292
Nee et al. [9] noted that one way to model the sampling of extant species is equivalent
293
to a mass extinction just before the present. This sampling model corresponds to
294
sampling each extanct species with a given probability fp, which has also been called
295
ρ-sampling [7]. We use the link with mass extinction to extend the previous formula
296
forn-sampling to the case ofρ-sampling.
297
First, we formulate theρ-sampling version of Claim 3.1.
298
Claim 6.1 Consider the diversity-dependent diversification model, given by specia-
299
tion ratesλn(t)and extinction ratesµn(t). The diversification process starts at crown
300
age tc with two ancestor species, and ends at the present time tp, at which extant
301
species are sampled with probability fp. Then, the likelihood of a phylogenetic tree
302
with kptips and branching timest, conditional on the event that both crown lineages
303
survive until the present, is equal to
304
Lkp,t=Ps(tc,t,tp,fp)
Pc(tc,tp) . (6.1)
The term Ps(tc,t,tp,fp)in the numerator, where the subscript s stands for sampling,
305
is equal to
306
Ps(tc,t,tp,fp) =
∞ m=0
∑
fpkp(1−fp)mQkmp(tp), (6.2)
where Qkmp(tp)is obtained from Eq. (3.1). The term Pc(tc,tp)in the denominator, where
307
the subscript c stands for conditioning, is equal to
308
Pc(tc,tp) =
∞ m=0
∑
6
(m+2)(m+3)Qk=2m (tp), (6.3) where Qk=2m (tp)is again obtained from Eq. (3.1).
309
Next, we establish as a reference the likelihood formula forρ-sampling in the
310
diversity-independent case.
311
Proposition 6.1 Consider the diversity-independent diversification model, given by
312
speciation ratesλ(t)and extinction ratesµ(t). The diversification process starts at
313
crown age tc with two ancestor species, and ends at the present time tp, at which
314
extant species are sampled with probability fp. Then, the likelihood of a phylogenetic
315
tree with kp tips and branching times t, conditional on the event that both crown
316
lineages survive until the present, is equal to
317
L(div-indep)
kp,t = (kp−1)!(1−ηe(tc,tp))2
kp−1
∏
i=2λ(ti)(1−ξe(ti,tp))(1−η(te i,tp)). (6.4) The functionsξe(t,tp)andη(t,te p)are given by
ξe(t,tp) =1− fp
α(t,tp) +fpRttpλ(s)α(t,s)ds
=1− fp
fp+ (1−fp)α(t,tp) +fpRtp
t µ(s)α(t,s)ds (6.5) η(t,te p) =1− α(t,tp)
α(t,tp) +fpRtp
t λ(s)α(t,s)ds
=1− α(t,tp)
fp+ (1−fp)α(t,tp) +fpRtp
t µ(s)α(t,s)ds, (6.6) with
318
α(t,s) =expZ s
t
(µ(s0)−λ(s0))ds0 .
Proof We use the equivalence betweenρ-sampling and a mass extinction, see
319
Ref. [9], Eq. (31). We introduce a modified extinction rate µ(t)containing a delta
320
function just before the present,
321
eµ(t) =µ(t)−lnfpδ(t−tp). (6.7) The likelihood formula is then obtained by settingmp=0 in Eq. (4.1), while evaluat- ing the functionsξ(t,tp)andη(t,tp)with the modified extinction rateeµ(t,tp). This establishes Eq. (6.4); it remains to be proven that the modified functionsξe(t,tp)and