Additional Analytical Support for a New Method to Compute the Likelihood of Diversification Models

(1)

HAL Id: hal-02941245

https://hal.archives-ouvertes.fr/hal-02941245

Submitted on 19 Nov 2020

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Additional Analytical Support for a New Method to Compute the Likelihood of Diversification Models

Giovanni Laudanno, Bart Haegeman, Rampal Etienne

To cite this version:

Giovanni Laudanno, Bart Haegeman, Rampal Etienne. Additional Analytical Support for a New

Method to Compute the Likelihood of Diversification Models. Bulletin of Mathematical Biology,

Springer Verlag, 2020, 82 (22), 22 p. �10.1007/s11538-020-00698-y�. �hal-02941245�

(2)

(will be inserted by the editor)

Additional analytical support for a new method to compute

1

the likelihood of diversification models

2

Giovanni Laudanno · Bart Haegeman ·

3

Rampal S. Etienne

4

5

Received: date / Accepted: date

6

Abstract

7

Molecular phylogenies have been increasingly recognized as an important source of

8

information on species diversification. For many models of macro-evolution, analyt-

9

ical likelihood formulas have been derived to infer macro-evolutionary parameters

10

from phylogenies. A few years ago, a general framework to numerically compute

11

such likelihood formulas was proposed, which accommodates models that allow spe-

12

ciation and/or extinction rates to depend on diversity. This framework calculates the

13

likelihood as the probability of the diversification process being consistent with the

14

phylogeny from the root to the tips. However, while some readers found the frame-

15

work presented in Etienne et al. (2012) convincing, others still questioned it (personal

16

communication), despitenumerical evidence that for special cases the framework

17

yields the same (i.e. within double precision) numerical value for the likelihood as

18

analytical formulas do that were independently derived for these special cases. Here

19

we proveanalyticallythat the likelihoods calculated in the new framework are correct

20

for all special cases with known analytical likelihood formula. Our results thus add

21

substantial mathematical support for the overall coherence of the general framework.

22

Keywords Diversification·Birth-death process

23

Giovanni Laudanno

Groningen Institute for Evolutionary Life Sciences, Box 11103, 9700 CC, Groningen, The Netherlands E-mail: glaudanno@gmail.com

Bart Haegeman

Centre for Biodiversity Theory and Modelling, Theoretical and Experimental Ecology Station, CNRS and Paul Sabatier University, Moulis, France

Rampal S. Etienne

Groningen Institute for Evolutionary Life Sciences, Box 11103, 9700 CC, Groningen, The Netherlands

(3)

Mathematics Subject Classification (2000):Primary: 60J80; Secondary: 92D15,

24

92B10, 60J85, 92D40

25

1 Introduction

26

One of the major challenges in the field of macro-evolution is understanding the

27

mechanisms underlying patterns of diversity and diversification. A very fruitful ap-

28

proach has been to model macro-evolution as a birth-death process which reduces the

29

problem to the specification of macroevolutionary events (i.e. speciation and extinc-

30

tion). However, providing likelihood expressions for these models given empirical

31

data on speciation and extinction events is quite challenging, for the following reason.

32

While such a likelihood is very easy to derive when full information is available for

33

all events, typically the data involves phylogenetic trees constructed with molecular

34

data collected from extant species alone. Hence, no extinction events and speciation

35

events leading to extinct species are recorded in such phylogenetic trees. For a vari-

36

ety of models this problem can be overcome by considering a reconstructed process,

37

whereby the phylogeny of extant species can be regarded as a pure-birth process with

38

time-dependent speciation rate [9]. But this approach is not generally valid.

39

Thus, the methods employed to derive likelihood expressions are usually applica-

40

ble to a limited set of models. They do not apply to models that assume that speciation

41

and extinction rates depend on the number of species in the system. Hence, potential

42

feedback of diversity itself on diversification rates, due to interspecific competition

43

or niche filling, is completely ignored. The first to incorporate such feedbacks were

44

Rabosky & Lovette [10], who made rates dependent on the number of species present

45

at every given moment in time, analogously to logistic growth models used in pop-

46

ulation biology. However, their model had to assume that there is no extinction for

47

mathematical tractability, which stands in stark contrast to the empirical data: the

48

fossil record provides us with many examples of extinct species.

49

Etienne et al. [3] presented a framework to compute the likelihood of phyloge-

50

netic branching times under a diversity-dependent diversification process that explic-

51

itly accounts for the influence of species that are not in the phylogeny, because they

52

have become extinct. Some of our colleagues have doubts that this framework con-

53

tains a formal argument that the solution of the set of ordinary differential equa-

54

tions that together constitute the framework gives the likelihood of the model for

55

a given phylogenetic tree. Instead, only numerical evidence for a small set of pa-

56

rameter combinations has been provided that the method yields, in the appropri-

57

ate limit, the known likelihood for the standard diversity-independent (i.e. using

58

constant-rates) birth-death model. This likelihood was first provided by Nee et al. [9],

59

using a breaking-the-tree approach. Later, Lambert & Stadler [8] used coalescent

60

point process theory to provide an approach to obtain likelihood formulas for a wider

61

(4)

set of models. These models did not include diversity-dependence. For example,

62

Lambert et al. [4, 7] applied their framework to the protracted birth-death model,

63

which is a generalization of the diversity-independent model where speciation is no

64

longer an instantaneous event [5]. For this model they provided an explicit likelihood

65

expression.

66

Here we provide an analytical proof that the likelihood of Etienne et al. [3] re-

67

duces to the likelihood of Lambert et al. [7] – and hence to that of Nee et al. [9] – for

68

the case of diversity-independent diversification.

69

The extant species belonging to a clade are often not all available for sequenc-

70

ing, because some species are difficult to obtain tissue from (either because they are

71

difficult to find/catch, or because they are endangered, or because they have recently

72

become extinct due to anthropogenic rather than natural causes) or because it is diffi-

73

cult to extract their DNA. This means that our data consists of a phylogenetic tree of

74

an incomplete sample of species, and thus of an incomplete set of speciation events,

75

even for those that lead to the species that we observe today. Models have been de-

76

veloped to deal with this incomplete sampling. One model assumes that we know

77

exactly how many extant species are missing from the phylogeny. This is the case for

78

well-described taxonomic groups, such as birds, where we have a good idea of the

79

species that are evolutionarily related, but we are simply missing some data points for

80

the reasons mentioned above. This sampling model is calledn-sampling [7]. Another

81

model assumes that we do not know exactly how many species are missing from the

82

phylogeny, but rather that we have an idea of the probabilityρ of a species being

83

sampled. This sampling scheme is calledρ-sampling [7], but is also referred to as

84

f-sampling [9]. The framework of Etienne et al. [3] assumesn-sampling, but in this

85

paper we show that it can also be extended to incorporateρ-sampling.

86

In the next section we summarize the framework of Etienne et al. [3] and we pro-

87

vide the likelihood formula analytically derived by Lambert et al. [7] for the special

88

case of diversity-independent but time-dependent diversification with n-sampling.

89

Then we proceed by showing that the probability generating functions of these two

90

likelihoods are identical. We end with a discussion where we point out how the frame-

91

work of Etienne et al. [3] can be extended to includeρ-sampling and how it relates to

92

the likelihood formula of Rabosky & Lovette [10] for the diversity-dependent birth-

93

death model without extinction.

94

2 The diversity-dependent diversification model

95

Diversification models are birth-death processes in which “birth” and “death” cor-

96

respond to speciation and extinction events, respectively. In the simplest case, the

97

per-species speciation rateλand the per-species extinction rateµare constants. Here

98

(5)

we consider diversification models in which the speciation and extinction rates de-

99

pend on the number of speciesnpresent at timet, i.e., diversity-dependent, which we

100

denote byλnandµn. We also allow the speciation and extinction rates to depend on

101

timet, i.e.,λn(t)andµn(t), although the latter dependence is often not explicit in our

102

notation.

103

We assume that the diversification process starts at timet_cfrom a crown, i.e., from two ancestor species. Assuming that at a later timet>t_cthe process hasnspecies, the transition probabilities in the infinitesimal time interval[t,t+dt]are

fromnton+1 species with probabilityλn(t)dt fromnton−1 species with probabilityµ_n(t)dt

number of speciesnunchanged with probability 1−λ_n(t)dt−µ_n(t)dt.

The diversification process runs until the present timet_p.

104

We denote byP_n(t)the probability that the process hasnspecies at timet. This

105

probability satisfies the following ordinary differential equation (ODE, called master

106

equation or forward Kolmogorov equation [1]),

107

dP_n(t)

dt =µn+1(n+1)P_n+1(t) +λn−1(n−1)Pn−1(t)−(λ_n+µn)nPn(t), (2.1)

where we omit in the notation the time dependence of the speciation and extinction

108

rates.

109

Sampling models

110

At the present timet_pa subset of thenextant species are observed and sampled. This

111

sampling process can been modelled in two different ways (see introduction). The

112

first model assumes that a fixed number of species is unsampled, which corresponds

113

to then-sampling scheme of Ref. [7]. That is, the number of extant species att_p

114

that are not sampled, a number we denote by m_p, is part of the data. The second

115

model assumes that each extant species at the present time is sampled with a given

116

probability, which has been called f-sampling [9] orρ-sampling [7]. In this case the

117

number of unsampled speciesm_pis a random variable, and the probability with which

118

extant species are sampled a parameter to estimate, which we denote by f_p.

119

(6)

†

† †

02468Complete tree

a)

b)

c)

number of lineages _t

c t₂ t₃ t₄ t₅t_p

Reconstructed tree

Fig. 1 a) Full tree where missing species are plotted as red dashed lines: the ones ending in a cross become extinct before the present, whereas the ones ending with a red dot are unsampled species at the present;

b) Corresponding reconstructed tree in which only extant species are present. This is the type of tree we usually work with because actual phylogenetic trees are usually obtained from molecular data taken from extant species; c) Lineages-through-time plot: the green line represents the number of lineages leading to extant species (k), the red line represents lineages leading to extinct or unsampled species (m), and the blue line represents the total number of lineages (n=k+m).

Reconstructed tree

120

A realization of the diversification process fromt_ctot_pcan be represented graphically

121

as a tree, see Figure 1. The complete tree shows all the species that have originated in

122

(7)

the process (Fig. 1a). However, in practice we have only access to the reconstructed

123

tree, i.e., the complete tree from which we remove all the species that became extinct

124

before the present or that were not sampled (Fig. 1b). While it would be straight-

125

forward to infer information about the diversification process based on the complete

126

tree, this task is much more challenging when only the reconstructed tree is available.

127

This paper deals with likelihood formulas for a reconstructed phylogenetic tree.

128

The number of tips equals the number of sampled extant speciesk_p. We assume that

129

also the number of unsampled extant species is known, a number we denote bym_p.

130

The information contained in a phylogenetic tree consists of a topology and a set

131

of branching times. For a large set of diversification models, including the diversity-

132

dependent one, all trees having the same branching times but different topologies are

133

equally probable [8]. Hence, we can discard the topology for the likelihood compu-

134

tation. We denote the vector of branching times byt= (t₂,t₃, . . . ,t_k_p₋₁), wheret_kis

135

the branching time at which the phylogenetic tree changes fromktok+1 branches.

136

It will be convenient to sett₀=t₁=t_candt_k_p =t_p.

137

Likelihood conditioning

138

It is common practice to condition the likelihood on the survival of both ancestor lin-

139

eages to the present time [9]. Indeed, we would only do an analysis on trees that have

140

actually survived to the present. In particular, we impose the crown age of the recon-

141

structed tree. This corresponds to conditioning on the event that both ancestor species

142

have extant descendant species. We do not require that these descendant species have

143

been sampled.

144

3 The framework of Etienne et al.

145

Etienne et al. [3] presented an approach to compute the likelihood of a phylogeny for

146

the diversity-dependent model. It is based on a new variable,Q^k_m(t), which they de-

147

scribed as “the probability that a realization of the diversification process is consistent

148

with the phylogeny up to timet, and hasn=m+kspecies at timet” (Ref. [3], Box

149

1), whereklineages are represented in the phylogenetic tree (because they are ances-

150

tral to one of thek_pspecies extant and sampled at present) andmadditional species

151

are present but unobserved (Fig. 1c). These species might not be in the phylogenetic

152

tree because they became extinct before the present or because they are either not

153

discovered or not sampled yet (see introduction). From hereon we will refer to these

154

species denoted bymas missing species. We cannot ignore these missing species, be-

155

cause in a diversity-dependent speciation process, they can influence the speciation

156

and extinction rates.

157

We start by describing the computation of the variableQ^k_m^p_p(t_p), which proceeds from the crown aget_cto the present timet_p. It is convenient to arrange the values

(8)

Q^k_m(t), withm=0,1,2, . . ., into the vectorQ^k(t). The initial vectorQ^k=2(t_c)is transformed into the vector Q^k(t) at a later timet as follows (Ref. [3], Appendix S1, Eq. (S1)):

Q^k(t) =A_k(tk−1,t)Bk−1(tk−1)Ak−1(tk−2,tk−1). . . A₃(t₂,t₃)B₂(t₂)A₂(t_c,t₂)Q^k=2(t_c),

withtk−1≤t≤t_k. The operators A_k and B_k are infinite-dimensional matrices that operate along the tree, on branches and nodes, respectively (Fig. 2). Continuing this computation until the present timet_p, we get

Q^k^p(t_p) =A_k_p(t_k_p₋₁,t_p)B_k_p₋₁(t_k_p₋₁)A_k_p₋₁(t_k_p₋₂,t_k_p−1). . .

A₃(t₂,t₃)B₂(t₂)A₂(t_c,t₂)Q^k=2(t_c). (3.1) Note that Eq. (3.1) generalizes Eq. (S1) of Ref. [3] to the case in which the rates are

158

time-dependent.

159

We specify the different terms appearing in the right-hand side of Eq. (3.1):

160161

– For the initial vector Q^k=2(t_c)we assume that there are no missing species at

162

crown age, that is,Q^k=2_m (t_c) =δm,0.

163

– The matrix A_kcorresponds to the dynamics ofQ^k_m(t)in the time interval[tk−1,t_k], during which the phylogenetic tree has k branches. Etienne et al. [3] argued that these dynamics are given by the following ODE system (Ref. [3], Box 1, Eq. (B2)):

dQ^k_m(t)

dt =µ_k+m+1(m+1)Q^k_m+1(t) +λk+m−1(m−1+2k)Q^k_m−1(t)

−(λ_m+k+µm+k)(m+k)Q^k_m(t), ∀m>0, dQ^k₀(t)

dt =µk+1Q^k₁(t)−(λ_k+µ_k)kQ^k₀(t), ifm=0.

(3.2) The quantityQ^k₀^p(t_p)is the likelihood for the tree at the present time, assuming

164

that all the species survived to the present have been sampled. We can collect the

165

coefficients ofQ^k_m(t)on the right-hand side of the ODE system in a matrix V_k(t).

166

If we do so, the system can be rewritten as

167

dQ^k(t)

dt =V_k(t)Q^k(t), which has solution

168

Q^k(t) =exp^Z t tk−1

V_k(s)ds

Q^k(t_k−1), so that

169

A_k(t_k−1,t) =exp ^Z t

tk−1

V_k(s)ds

. (3.3)

(9)

– The matrix B_k transforms the solution of the ODE system ending att_k into the

170

initial condition of the ODE system starting at t_k. It is a diagonal matrix with

171

componentskλk+mdt, so that

172

Q^k+1_m (t_k) = B_k(t_k)

m,mQ^k_m(t_k) =kλ_k+mdt Q^k_m(t_k). (3.4) The multiplication byλk+mdtcorresponds to the probability that a speciation oc-

173

curs in the time interval[t_k,t_k+dt]. In the likelihood expressions we will omit the

174

differential (a choice that is widely adopted across the vast majority of this kind

175

of models in the literature) as it is actually not essential in parameter estimation.

176

Therefore, we will work with a likelihood density, but for simplicity we will refer

177

to it as a likelihood.

178

We are then ready to formulate the claim made by Etienne et al. [3] (in particular, see

179

their Eqs. (S2) and (S6) in Appendix S1).

180

Claim 3.1 Consider the diversity-dependent diversification model, given by specia-

181

tion ratesλ_n(t)and extinction ratesµ_n(t). The diversification process starts at crown

182

age tc with two ancestor species, and ends at the present time tp, at which a fixed

183

number of species m_pare not sampled. A phylogenetic tree is constructed for the

184

sampled species. Then, the likelihood that the phylogenetic tree has k_ptips and vec-

185

tor of branching timest= (t₁,t₂, . . . ,t_k_p₋₁), conditional on the event that both crown

186

lineages survive until the present, is equal to

187

L_k_p,t,m_p= Q^k_m^p_p(t_p)

kp+mp

mp

P_c(t_c,t_p). (3.5)

The term Q^km^pp(t_p)in the numerator of this expression is obtained from Eq. (3.1). The

188

term P_c(t_c,t_p)in the denominator, where the subscript c stands for conditioning, is

189

equal to

190

P_c(t_c,t_p) =

∞

∑

m=0

6

(m+2)(m+3)Q^k=2_m (t_p), (3.6) where Q^k=2_m (t_p)is again obtained from Eq. (3.1).

191

The structure of the likelihood expression (3.5) can be understood intuitively. It

192

is proportional toQ^km^pp(t_p), which in Etienne et al.’s interpretation is the probability

193

that the diversification process generates the phylogenetic tree withk_ptips andm_p

194

missing species at present timet_p. The combinatorial factor ^k^p_m^+m^p

p

accounts for the

195

number of ways to selectm_pmissing species out of a total pool ofk_p+m_pspecies.

196

The factorPc(t_c,t_p)is the probability that both ancestor species at crown agetchave

197

descendant species at the present timet_p. Hence, this factor applies the likelihood

198

conditioning.

199

(10)

Etienne et al. [3] provided numerical evidence that Claim 3.1 is in agreement

200

with the likelihood provided by Nee et al. [9] under the hypothesis of diversity-

201

independent speciation and extinction rates and no missing species at the present.

202

However, a rigorous analytical proof, even for this specific case, has not yet been

203

given. In this paper we show that Claim 3.1 holds (1) for the diversity-independent

204

(but possibly time-dependent) case and (2) for the diversity-dependent case without

205

extinction (i.e., extinction rateµ=0).

206

_p

,t

₃

)

t

_c

t

₂

t

₃

t

_p

Fig. 2 An example of how to build a likelihood for a tree withk_p=4 tips. We start with a vector Q²(tc) at the crown age. We use Ak(tk,tk−1) and Bk(tk) to evolve the vector across the en- tire tree (on branches and nodes, respectively) up to the present time t_p according to Q⁴(t4) = A4(t4,t3)B3(t3)A3(t3,t2)B2(t2)A2(t2,tc)Q²(tc). At the present time the likelihood accounting formpmiss- ing species will be proportional to them_p-th component of the vectorL_4,m_p∝Q⁴_m

p(tp).

(11)

4 The likelihood for the diversity-independent case

207

Claim 3.1 proposes a likelihood expression for the case with a known number of

208

unsampled species at the present, i.e., it accounts forn-sampling. For the diversity-

209

independent case, i.e.,λn(t) =λ(t)andµn(t) =µ(t), the likelihood is contained in

210

a more general result established by Lambert et al. [7]. In the following proposition

211

we derive an explicit likelihood expression by restricting the result of Lambert et al.

212

to the diversity-independent case.

213

Proposition 4.1 Consider the diversity-independent diversification model, given by speciation ratesλ(t)and extinction rate µ(t). The diversification process starts at crown age t_cwith two ancestor species, and ends at the present time t_p, at which a fixed number of species m_pare not sampled. A phylogenetic tree is constructed for the sampled species. Then, the likelihood that the phylogenetic tree has k_ptips and vector of branching times t= (t₁,t₂, . . . ,t_k_p−1), conditional on the event that both crown lineages survive until the present, is equal to

L(div-indep)

kp,t,mp =(k_p−1)!

kp+mp

mp

(1−η(t_c,t_p))²

kp−1

∏

i=2

λ(t_i)(1−ξ(t_i,t_p))(1−η(t_i,t_p))

m|m

∑

_p

kp−1

∏

j=0

(m_j+1)(η(t_j,t_p))^m^j, (4.1) where we used the convention t₀=t₁=t_c. The components m_j(with j=0,1, . . . ,k_p−

214

1) of the vectorsm, in the sum on the second line, are non-negative integers satisfying

215

∑^k_j=0^p⁻¹m_j=m_p. The functionsξ(t,t_p)andη(t,t_p)are given by

216

ξ(t,t_p) =1− 1

α(t,t_p) +^R_t^t^pλ(s)α(t,s)ds

=1− 1

1+^R_t^t^pµ(s)α(t,s)ds (4.2) η(t,t_p) =1− α(t,t_p)

α(t,t_p) +^R_t^t^pλ(s)α(t,s)ds

=1− α(t,t_p)

1+^R_t^t^pµ(s)α(t,s)ds, (4.3) with

217

α(t,s) =exp^Z s t

(µ(s⁰)−λ(s⁰))ds⁰ .

The functionsξ(t,t_p)andη(t,t_p)are those appearing in Kendall’s solution of the

218

birth-death model (see Ref. [6], Eqs. (10–12)), and are useful to describe the process

219

when time-dependent rates are involved. Given the probabilityP_n(t,t_p)of realizing a

220

(12)

process starting with 1 species at timetand ending withnspecies at timet_p, we have

221

thatξ(t,t_p) =P0(t,t_p)andη(t,t_p) =^Pⁿ^?⁺¹^(t,t^p⁾

P_n?(t,t_p) for anyn^?>0.

222

Proof The likelihood forn-sampling was originally provided by Ref. [7], Eq. (7), but we start from the explicit version provided in Ref. [4], Eq. (1), see corrigendum in Ref. [2],

L(div-indep)

kp,t,mp =(k_p−1)!

kp+mp

mp

(g(t_c,tp))²

kp−1

∏

i=2

f(t_i,t_p)

∑

m|mp

kp−1

∏

j=0

(m_j+1)(1−g(t_j,tp))^m^j. (4.4) Lambert et al. [4, 7] specify the functions f(t,t_p)andg(t,t_p)as the solution of a system of ODEs for the case of protracted speciation, a model where speciation does not take place instantaneously but is initiated and needs time to complete. The standard diversification model is then obtained by taking the limit in which the speciation- completion rate tends to infinity. In this limit the four-dimensional system of Lambert et al. [4], Eq. (2), reduces to a two-dimensional system of ODEs,

f(t,t_p) =dg(t,t_p)

dt =λ(t) (1−q(t,tp))g(t,tp) dq(t,t_p)

dt =−µ(t) + (µ(t) +λ(t))q(t,tp)−λ(t)q²(t,t_p).

Note that in this paper timetruns from past to present rather than from present to past

223

as in Lambert et al. [4]. The conditions at the present timetpare given byg(t_p,t_p) =1

224

andq(t_p,t_p) =0.

225

The solution of this system of ODEs can be expressed in terms ofη(t,t_p)and ξ(t,t_p),

f(t,t_p) =λ(t) (1−ξ(t,t_p)) (1−η(t,t_p)) g(t,t_p) =1−η(t,t_p)

q(t,tp) =ξ(t,t_p),

which can be checked using the derivatives of the expressions 4.3 and 4.2

∂ η(t,t_p)

∂t =−λ(t) (1−ξ(t,t_p)) (1−η(t,t_p))

∂ ξ(t,t_p)

∂t =−(µ(t)−λ(t)ξ(t,t_p)) (1−ξ(t,t_p)).

Substituting the functions f(t,t_p)andg(t,t_p)into the likelihood expression (4.4) con-

226

cludes the proof.

227

The functions ξ(t,t_p)andη(t,tp)are directly related to the functions used by

228

Nee et al. [9]. In particular, the functions they denoted byP(t,t_p)andu_t correspond

229

in our notation to 1−ξ(t,t_p)andη(t,t_p), respectively.

230

(13)

This correspondence allows us to get an intuitive understanding of the likelihood

231

expression (4.1). First consider the case without missing species. Settingm_p=0, we

232

get

233

L(div-indep)

kp,t,0 = (k_p−1)!(1−η(t_c,t_p))²

kp−1

∏

i=2

λ(t_i)(1−ξ(t_i,t_p))(1−η(t_i,t_p)), which is identical to the breaking-the-tree likelihood of Nee et al. (Ref. [9], Eq. (20)).

234

In the latter approach the phylogenetic tree is broken into single branches: two for

235

the interval [t_c,t_p] and one for each interval [t_i,t_p] with i=2,3, . . . ,k_p−1. Each

236

branch contributes a factor(1−ξ(t_i,t_p))(1−η(t_i,t_p)), equal to the probability that

237

the branch starting atti has a single descendant species attp. For the two branches

238

originating att_c, the factor(1−ξ(t_i,t_p)), equal to the probability of having (one or

239

more) descendant species, drops due to the conditioning. For the other branches, there

240

is an additional factorλ(t_i)for the speciation events.

241

Next consider the case with missing species. Each of the branches resulting from

242

breaking the tree can contribute species to the pool ofm_pmissing species. For the

243

branch over the interval[t_j,t_p], there arem_j such species, each contributing a factor

244

η(t_j,t_p)to the likelihood. Indeed,(1−ξ(t_j,t_p))(1−η(t_j,t_p))(η(t_j,t_p))^m^j is equal

245

to probability of having exactlymj+1 descendant species at the present time. One

246

of these species is represented in the phylogenetic tree, justifying the combinatorial

247

factor(m_j+1)in the second line of Eq. (4.1).

248

Finally, we recall the expressions for the functionsξ(t,t_p)andη(t,t_p)in the case of constant rates,λ(t) =λ andµ(t) =µ,

ξ(t,tp) =µ 1−e^{−(λ−µ)(t}^p^−t) λ−µe^−(λ−µ^)(t^p^−t) η(t,t_p) =λ 1−e^{−(λ−µ)(t}^p^−t)

λ−µe^{−(λ−µ)(t}^p^−t) . 5 Equivalence for the diversity-independent case

249

Likelihood formula (4.1) allows speciation and extinction rates to be arbitrary func-

250

tions of time,λ(t)andµ(t). Here we show that, for the diversity-independent case,

251

we find the same likelihood formula with the approach of Etienne et al. [3].

252

Theorem 5.1 Claim 3.1 holds for the diversity-independent case.

253

Proof The proof relies heavily on generating functions. First, we introduce the

254

generating function for the variablesQ^k_m(t),

255

F_k(z,t) =

∞ m=0

∑

z^mQ^k_m(t). (5.1)

(14)

The set of ODEs satisfied byQ^k_m(t), Eq. (3.2), transforms into a partial differential equation (PDE) for the generating functionF_k(z,t),

∂F_k(z,t)

∂t = (µ(t)−zλ(t))(1−z)∂F_k(z,t)

∂z +k(2zλ(t)−λ(t)−µ(t))F_k(z,t)

=c(z,t)∂F_k(z,t)

∂z +k∂c(z,t)

∂z F_k(z,t), (5.2)

with

256

c(z,t) = (µ(t)−zλ(t))(1−z).

Note that the number of brancheskchanges at each branching time, so that the PDE

257

forF_k(z,t)is valid only fortk−1≤t≤t_k(corresponding to the operatorA_k). At branch-

258

ing timet_k, the solutionF_k(z,t_k)has to be transformed to provide the initial condition

259

for the PDE forF_k+1(z,t)at timet_k(corresponding to the operatorB_k). Using Eq. (3.4)

260

and dropping the differential, we get

261

F_k+1(z,t_k) =kλ(t_k)F_k(z,t_k). (5.3) The initial condition at crown age isF₂(z,t_c) =1 becauseQ^k=2_m (t_c) =δm,0.

262

Next, we defineP_n(s,t)as the probability that the birth-death process that started

263

with one species at time s has n species at time t. The corresponding generating

264

function is defined as,

265

G(z,s,t) =

∞ n=0

∑

zⁿPn(s,t). (5.4)

The set of ODEs satisfied byP_n(s,t), Eq. (2.1), transforms into a PDE,

266

∂G(z,s,t)

∂t =c(z,t)∂G(z,s,t)

∂z . (5.5)

Its solution was given by Kendall (Ref. [6], Eq. (9)),

267

G(z,s,t) =ξ(s,t) + (1−ξ(s,t)−η(s,t))z

1−zη(s,t) , (5.6) whereξ(s,t)andη(s,t)are given in Eqs. (4.2) and (4.3).

268

The generating functionF_k(z,t)can be expressed in terms of the generating func-

269

tionG(z,s,t), as shown in the following lemma.

270

Lemma 5.1 The generating function F_k(z,t)of the variables Q^k_m(t)is given by

271

F_k(z,t) =H²(z,t_c,t)

k−1

∏

j=2

jλ_j(t_j)H(z,t_j,t) (5.7) with

272

H(z,s,t) =∂G(z,s,t)

∂z =(1−ξ(s,t)) (1−η(s,t))

(1−zη(s,t))² . (5.8)

(15)

To prove the lemma, let us suppose that the solution of Eq. (5.2) is of the form,

273

F_k(z,t) =C_k(t)

k−1

∏

j=0

∂_zG(z,t_j,t) (5.9)

whereC_k(t)is a constant depending on the branching times. We used the convention

274

t₀=t₁=t_cand the short-hand notation∂zfor the partial derivative with respect toz.

∂zG(z,t_j,t).

We substitute these expressions into the PDE, Eq. (5.2),

k−1

∑

i=0

∂t∂zG(z,t_i,t)

k−1 j6=i,

∏

j=0

∂zG(z,t_j,t)

=c(z,t)

k−1 i=0

∑

∂_z²G(z,t_i,t)

k−1 j6=i,

∏

j=0

∂zG(z,t_j,t)

+k∂_zc(z,t)1 k

k−1

∑

i=0

∂_zG(z,t_i,t)

k−1 j6=i,

∏

j=0

∂_zG(z,t_j,t).

This equation is satisfied if the following equation is satisfied for everyi=0,1, . . . ,k,

∂t∂zG(z,ti,t)

k−1 j6=i,

∏

j=0

∂zG(z,t_j,t)

=c(z,t)∂_z²G(z,t_i,t)

k−1 j6=i,

∏

j=0

∂zG(z,t_j,t)

+∂_zc(z,t)∂_zG(z,t_i,t)

k−1 j6=i,

∏

j=0

∂zG(z,t_j,t).

This is the case if

277

∂_t∂_zG(z,t_i,t) =c(z,t)∂_z²G(z,t_i,t) +∂_zc(z,t)∂_zG(z,t_i,t), or, equivalently, if

278

∂z

∂tG(z,t_i,t)

=∂z

c(z,t)∂_zG(z,t_i,t) .

(16)

This is an identity becauseG(z,t_i,t) satisfies Eq. (5.5).

279

Next, we verify that the constantsC_k(t)can be determined such that initial con-

280

ditions (5.3) are satisfied. This is indeed the case if we take

281

C_k(t) =

k−1

∏

j=2

jλ(t_j).

Introducing the functionH(z,s,t)and usingt₀=t₁=t_ccomplete the proof of the

282

lemma.

283

Next, we use Eq. (5.7) to derive an explicit expression for the likelihood (3.5) of

284

Claim 3.1. It will be useful to have explicit expressions for derivatives of the function

285

H(z,s,t). It follows from Eq. (5.8) that

286

1 a!∂_z^a

H^b(z,t_j,t)

=

a+2b−1 a

H^b(z,t_j,t)

η(t_j,t) 1−zη(t_j,t)

a

, (5.10)

whereaandbare positive integers.

287

To evaluate the numerator of Eq. (3.5), we have to extractQ^k_m^p_p(t_p)from the generating functionF_k_p(z,t_p). Using Leibniz’ rule,

Q^k_m^p_p(t_p) = 1 mp!∂z^m^p

F_k_p(z,t_p)

z=0

=C_k_p(t) m_p! ∂z^m^p

"_k

p−1

∏

j=0

H(z,t_j,t_p)

#

z=0

=C_k_p(t) m_p!

∑

m|m_p

m_p m₀,m₁, . . . ,m_k_p−1

k_p−1

∏

j=0

∂_z^m^j[H(z,t_j,t_p)]_z=0

=C_k_p(t) m_p!

∑

m|m_p

m_p!

∏im_i!

kp−1

∏

j=0

(m_j+1)!

H(z,t_j,t_p)

∑

_p

kp−1

∏

j=0

(m_j+1)η^m^j(t_j,t_p). (5.11)

To evaluate the denominator of Eq. (3.5), we have to extractQ^k=2_m (t_p)from the

288

generating function,

289

Q^k=2_m (t_p) = 1 m!∂_z^m

F₂(z,t_p)

z=0= 1 m!∂_z^m

H²(z,t_c,t_p)

z=0.

(17)

Substituting into Eq. (3.6) and using Eq. (5.10), we get P_c(t_c,t_p) =

∞ m=0

∑

6 (m+2)(m+3)

1 m!∂_z^m

H²(z,t_c,t_p)

z=0

=H²(0,t_c,t_p)

∞ m=0

∑

(m+1)η^m(t_c,t_p)

= (1−ξ(t_c,t_p))². (5.12)

Finally, substituting Eqs. (5.11) and (5.12) into the likelihood formula (3.5) of Claim 3.1,

L_k_p_,t,m_p=(k_p−1)!

kp+m_p mp

(1−η(t₁,t_p))²

kp−1

∏

j=2

λ(t_j)(1−ξ(t_j,t_p))(1−η(t_j,t_p))

m|m

∑

_p

kp−1

∏

j=0

(m_j+1)η^m^j(t_j,t_p), (5.13)

which is identical to likelihood formula (4.1). This concludes the proof of the theo-

290

rem.

291

6 A note on sampling a fraction of extant species

292

Nee et al. [9] noted that one way to model the sampling of extant species is equivalent

293

to a mass extinction just before the present. This sampling model corresponds to

294

sampling each extanct species with a given probability f_p, which has also been called

295

ρ-sampling [7]. We use the link with mass extinction to extend the previous formula

296

forn-sampling to the case ofρ-sampling.

297

First, we formulate theρ-sampling version of Claim 3.1.

298

Claim 6.1 Consider the diversity-dependent diversification model, given by specia-

299

tion ratesλ_n(t)and extinction ratesµ_n(t). The diversification process starts at crown

300

age t_c with two ancestor species, and ends at the present time t_p, at which extant

301

species are sampled with probability f_p. Then, the likelihood of a phylogenetic tree

302

with k_ptips and branching timest, conditional on the event that both crown lineages

303

survive until the present, is equal to

304

L_k_p_,t=P_s(t_c,t,t_p,f_p)

Pc(t_c,t_p) . (6.1)

The term P_s(t_c,t,t_p,f_p)in the numerator, where the subscript s stands for sampling,

305

is equal to

306

P_s(t_c,t,t_p,f_p) =

∞ m=0

∑

f_p^k^p(1−f_p)^mQ^k_m^p(t_p), (6.2)

(18)

where Q^k_m^p(t_p)is obtained from Eq. (3.1). The term P_c(t_c,t_p)in the denominator, where

307

the subscript c stands for conditioning, is equal to

308

P_c(t_c,t_p) =

∞ m=0

∑

6

(m+2)(m+3)Q^k=2_m (t_p), (6.3) where Q^k=2_m (t_p)is again obtained from Eq. (3.1).

309

Next, we establish as a reference the likelihood formula forρ-sampling in the

310

diversity-independent case.

311

Proposition 6.1 Consider the diversity-independent diversification model, given by

312

speciation ratesλ(t)and extinction ratesµ(t). The diversification process starts at

313

crown age t_c with two ancestor species, and ends at the present time t_p, at which

314

extant species are sampled with probability f_p. Then, the likelihood of a phylogenetic

315

tree with k_p tips and branching times t, conditional on the event that both crown

316

lineages survive until the present, is equal to

317

L(div-indep)

kp,t = (k_p−1)!(1−ηe(t_c,t_p))²

kp−1

∏

i=2

λ(t_i)(1−ξe(t_i,t_p))(1−η(te _i,t_p)). (6.4) The functionsξe(t,t_p)andη(t,te _p)are given by

ξe(t,t_p) =1− f_p

α(t,t_p) +f_p^R_t^t^pλ(s)α(t,s)ds

=1− f_p

fp+ (1−fp)α(t,t_p) +fpRtp

t µ(s)α(t,s)ds (6.5) η(t,te _p) =1− α(t,t_p)

α(t,t_p) +fpRtp

t λ(s)α(t,s)ds

=1− α(t,t_p)

fp+ (1−fp)α(t,t_p) +fpRtp

t µ(s)α(t,s)ds, (6.6) with

318

α(t,s) =exp^Z ^s

t

(µ(s⁰)−λ(s⁰))ds⁰ .

Proof We use the equivalence betweenρ-sampling and a mass extinction, see

319

Ref. [9], Eq. (31). We introduce a modified extinction rate µ(t)containing a delta

320

function just before the present,

321

eµ(t) =µ(t)−lnf_pδ(t−t_p). (6.7) The likelihood formula is then obtained by settingmp=0 in Eq. (4.1), while evaluat- ing the functionsξ(t,t_p)andη(t,t_p)with the modified extinction rateeµ(t,t_p). This establishes Eq. (6.4); it remains to be proven that the modified functionsξe(t,t_p)and