Fast and accurate branch length estimation for phylogenomic trees: ERaBLE (Evolutionary Rates and Branch Length Estimation)

(1)

HAL Id: lirmm-01237447

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01237447

Submitted on 3 Dec 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Fast and accurate branch length estimation for

phylogenomic trees: ERaBLE (Evolutionary Rates and

Branch Length Estimation)

Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel Douzery, Fabio

Pardi

To cite this version:

Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel Douzery, Fabio Pardi. Fast and

ac-curate branch length estimation for phylogenomic trees: ERaBLE (Evolutionary Rates and Branch

Length Estimation). Rencontres ALPHY - Génomique Evolutive, Bioinformatique, Alignement et

Phylogénie, Mar 2015, Montpellier, France. �lirmm-01237447�

(2)

Fast and accurate branch length estimation

for phylogenomic trees: ERaBLE

(Evolutionary Rates and Branch Length Estimation).

Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel J.P. Douzery and Fabio Pardi

(3)

The Phylogenomics era

Important goal: build the species tree.

SPECIES 1 Gene 1ADN Gene 2ADN Gene 3ADN

SPECIES 2 Gene 2ADN Gene 3ADN

SPECIES 3 Gene 1ADN Gene 3ADN

SPECIES 4 ADNGene 1

SPECIES 5 Gene 1ADN Gene 2ADN Gene 3ADN

New challenges:

Use the maximum of available data. Account for gene rate heterogeneity. Deal with conflicts in gene trees topologies. ...

(4)

Categories of phylogenomics methods

Gene 1 AAGTCA 1 AGGACC 3 GACAGA 4 GAAACC 5 Gene 2 TACCAGC 1 ACTCCCC 2 ACTCTCT 5 Gene 3 ATGAC 1 AGGAG 2 AAGAG 3 AAGAC 5 Concatenate

AAGTCATACCAGCATGAC

1 ??????ACTCCCCAGGAG 2 AGGACC???????AAGAG 3 GACAGA???????????? 4

GAAACCACTCTCTAAGAC

5 δ1 δ13 δ14 δ15 δ34 δ35 δ45                                           δ2 δ12 δ15 δ25                   δ3 δ12 δ13 δ15 δ23 δ25 δ35                                           4 1 3 5 5 1 2 2 1 3 5 1 3 4 5 2 2 1 3 5 4 2 1 4 5 3 Supertree Distance-based Based on concatenated alignment

(5)

Categories of phylogenomics methods

Gene 1 AAGTCA 1 AGGACC 3 GACAGA 4 GAAACC 5 Gene 2 TACCAGC 1 ACTCCCC 2 ACTCTCT 5 Gene 3 ATGAC 1 AGGAG 2 AAGAG 3 AAGAC 5 Concatenate

AAGTCATACCAGCATGAC

1 ??????ACTCCCCAGGAG 2 AGGACC???????AAGAG 3 GACAGA???????????? 4

GAAACCACTCTCTAAGAC

5 δ1 δ13 δ14 δ15 δ34 δ35 δ45                                           δ2 δ12 δ15 δ25                   δ3 δ12 δ13 δ15 δ23 δ25 δ35                                           4 1 3 5 5 1 2 2 1 3 5 1 3 4 5 2 2 1 3 5 4 2 1 4 5 3 Supertree Distance-based Based on concatenated alignment Branch lengths:

(6)

Tree distances

Every tree with branch lengths can be represented with avector of tree distances by computing the distance dT

ij between each pair of taxa in the tree.

4

2 T=(T , b)

3

1 b

1

=1

b

2

=2

b

3

=3

b

5

=1

_b

₄

₌₂

⇔

d

T

₌







d

12

= 3

d

13

= 5

d

14

= 4

d

23

= 6

d

24

= 5

d

34

= 5







Vector of tree distances

d

T

ij

=

X

e∈Pij

(7)

Tree distances

Every vector of tree distances is the product of thetopological matrixof T by the branch lengths vector of T. i.e. dT₌_Ab.

dT₌







d12= 3 d13= 5 d14= 4 d23= 6 d24= 5 d34= 5







Vector of tree distances

b =







b1= 1 b2= 2 b3= 3 b4= 2 b5= 1







Branch lengths vector

A =







1 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0







Topological matrix 4 2 T=(T , b) 3 1 b1=1 b2=2 b3=3 b5=1 _b₄₌₂ d T₌_Ab

(8)

A distance-based method : the weighted least square method (WLS)

Branch lengths estimation with WLS:

minimize

X

ij

w

ij

(δ

ij

− d

ijT

)

2 input distances (estimated from 1 gene alignment) input topology δ =







δ12= 0.140 δ13= 0.163 δ14= 0.288 δ15= 0.336 δ23= 0.188 δ24= 0.413 δ25= 0.411 δ34= 0.298 δ35= 0.213







3 b3=? 1 b1=? 2 b2=? b6=? 5 b5=? 4 b4=? b7=? dT₌_Ab

(9)

A distance-based method : the weighted least square method (WLS)

Branch lengths estimation with WLS:

minimize

X

ij

w

ij

(δ

ij

− d

ijT

)

2

can be solved analytically in

O(n

3

)

input distances (estimated from 1 gene alignment) input topology δ =







δ12= 0.140 δ13= 0.163 δ14= 0.288 δ15= 0.336 δ23= 0.188 δ24= 0.413 δ25= 0.411 δ34= 0.298 δ35= 0.213







3 b3=? 1 b1=? 2 b2=? b6=? 5 b5=? 4 b4=? b7=? 3 b3=0.04 1 b1=0.02 2 b2=0.08 b₆ =0 .06 5 b5=0.09 4 b4=0.08 b7=0.17 output dT₌_Ab

(10)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis: Any gene Gk induces approximately the same tree up to ascale

factor αkand the removal of a number of lineages.

E A B 1,5 3 _{1,5 1,5}_E A _C B 3 2 ₁ ₁ _E A _C D 2 0,75 0,75 1,5

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis: The Proportional model. Any gene Gk induces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.

ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 ₁ ₁ _E A _C B 2 2 ₁ ₁ _E A _C D 2 1 1 2 δ_AE(1)=6 r1=1 α1=1 δ(2)_AE=3.75 r2=0.75 α2=1.33 δ_AE(3)=9 r3=1.5 α2=0.67 2 ₁ ₁ _E A _C D 2 E A B 1,5 0,75 0,75 1,5 ₃ 1,5 1,5E A _C B 3

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:

The Proportional model. Any gene G

k

induces the same tree up to

a

scale factor α

k

, representing the inverse of its evolutionary rate r

k

.

ˆ

α

1

× δ

(1)AE

≃ ˆ

α

2

× δ

(2)AE

≃ ˆ

α

3

× δ

(3)AE E A B 2 2 ₁ ₁

_E

A

_C

B

2

2 ₁ ₁ _E A _C D 2 1 1 2

δ

_AE(1)

=6

r

1

=1 α

1

=1

δ

(2)_AE

=3.75

r

2

=0.75 α

2

=1.33

δ

_AE(3)

=9

r

3

=1.5 α

2

=0.67

2 ₁ ₁ _E A _C D 2 E A B 1,5 0,75 0,75 1,5 ₃ 1,5 1,5E A _C B 3

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:The Proportional model. Any gene Gkinduces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.

ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 ₁ ₁ E A _C B 2 2 ₁ 1E A _C D 2 1 1 2 δ(1)_AE=6 r1=1 α1=1 δ_AE(2)=3.75 r2=0.75 α2=1.33 δ(3)_AE=9 r3=1.5 α2=0.67 2 ₁ ₁_E A C D 2 E A B 1,5 0,75 0,75 1,5 3 _{1,5 1,5}E A C B 3 After scaling ×α1=1 = ×α2=1.33 = ×α3=0.67 = E A _C D B b3=? b5=? b4=? b6 =? b7=? b1=? b2=?

(11)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

E A B 1,5 3 _{1,5 1,5}_E A _C B 3 2 ₁ ₁ _E A _C D 2 0,75 0,75 1,5

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:

The Proportional model. Any gene G

k

induces the same tree up to

a

scale factor α

k

, representing the inverse of its evolutionary rate r

k

.

ˆ

α

1

× δ

(1)AE

≃ ˆ

α

2

× δ

(2)AE

≃ ˆ

α

3

× δ

(3)AE E A B 2 2 ₁ ₁

_E

A

_C

B

2

2 ₁ ₁ _E A _C D 2 1 1 2

δ

_AE(1)

=6

r

1

=1 α

1

=1

δ

(2)_AE

=3.75

r

2

=0.75 α

2

=1.33

δ

_AE(3)

=9

r

3

=1.5 α

2

=0.67

2 ₁ ₁ _E A _C D 2 E A B 1,5 0,75 0,75 1,5 ₃ 1,5 1,5E A _C B 3

(12)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

E A B 1,5 3 _{1,5 1,5}_E A _C B 3 2 ₁ ₁ _E A _C D 2 0,75 0,75 1,5

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:

The Proportional model. Any gene G

k

induces the same tree up to

a

scale factor α

k

, representing the inverse of its evolutionary rate r

k

.

ˆ

α

1

× δ

(1)AE

≃ ˆ

α

2

× δ

(2)AE

≃ ˆ

α

3

× δ

(3)AE E A B 2 2 ₁ ₁

_E

A

_C

B

2

2 ₁ ₁ _E A _C D 2 1 1 2

δ

_AE(1)

=6

r

1

=1 α

1

=1

δ

(2)_AE

=3.75

r

2

=0.75 α

2

=1.33

δ

_AE(3)

=9

r

3

=1.5 α

2

=0.67

2 ₁ ₁ _E A _C D 2 E A B 1,5 0,75 0,75 1,5 ₃ 1,5 1,5E A _C B 3

(13)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(_ijk))is defined on the

taxa set Lk.

A given topology T defined on the taxa set L =

m

[

k=1

Lk.

Objective: find

the branch lengths ˆb1,. . . , ˆb2n−3of T ,

the scale factors ˆα1,. . . , ˆαmof the m genes,

solution of the problem:











minimize m

X

k=1

X

i,j∈Lk w(k) ij ( ˆαkδ(_ijk)− dijT)2 subjet to m

X

k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .

(14)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

taxa set Lk.

m

[

k=1

Lk.

Objective: find











minimize m

X

k=1

X

(15)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

taxa set Lk.

m

[

k=1

Lk.

Objective: find











minimize m

X

k=1

X

(16)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

taxa set Lk.

m

[

k=1

Lk.

Objective: find











minimize m

X

k=1

X

(17)

Resolution of the optimization problem

The minimization problem can be solved with the method of Lagrange multipliers and leads to the linear system inO(n + m)equations and unknowns:

1                        δt 1W1δ1 0 · · · 0 0 δt 2W2δ2 ... ... ... ... 0 · · · · · · δt mWmδm ] ] ] ] − A t 1_W 1_δ 1 − A t 2_W 2_δ 2 .. . − A t m W m δ m [ [ [ [ Z1 Z2 · · · Zm [ −δt 1W1A1 ] [ −δt 2W2A2 ] ... [ −δt mWmAm ] 1 1 ... 1           m X k=1 At kWkAk           0 ... ... ... 0 0 · · · 0 0                                              α1 α2 ... αm b1 b2 ... ... ... b2n−3 λ                       =                        0 ... ... ... ... ... ... ... 0 1                        m=# genes n=# taxa Filling the matrix and standard system solving inO((n + m)3₊_mn4₎

Solving inO(n3₊_mn2₎_{with our algorithms:}

Filling the matrix inO(mn2)instead ofO(mn4)

(18)

Resolution of the optimization problem

The minimization problem can be solved with the method of Lagrange multipliers and leads to the linear system inO(n + m)equations and unknowns:

1                        δt 1W1δ1 0 · · · 0 0 δt 2W2δ2 ... ... ... ... 0 · · · · · · δt mWmδm ] ] ] ] − A t 1_W 1_δ 1 − A t 2_W 2_δ 2 .. . − A t m W m δ m [ [ [ [ Z1 Z2 · · · Zm [ −δt 1W1A1 ] [ −δt 2W2A2 ] ... [ −δt mWmAm ] 1 1 ... 1           m X k=1 At kWkAk           0 ... ... ... 0 0 · · · 0 0                                              α1 α2 ... αm b1 b2 ... ... ... b2n−3 λ                       =                        0 ... ... ... ... ... ... ... 0 1                        m=# genes n=# taxa Filling the matrix and standard system solving inO((n + m)3₊_mn4₎

Solving inO(n3₊_mn2₎_{with our algorithms:}

Filling the matrix inO(mn2)instead ofO(mn4)

Block-solving system inO(n3)instead ofO(n + m)3

Complexity: ERaBLEO(n3₊_mn2₎_{vs. WLS} O(n3₎

(19)

(20)

(21)

Results on the OrthoMaM dataset [Douzery et al. 2014]

input data: m=6953 nucleotide exon alignments over n=4 to 40 mammals.

(22)

(23)

Conclusion

ERaBLE gives branch lengths to phylogenomic trees (e.g. estimated with supertree methods).

ERaBLE is relatively accurate in the estimations of branch lengths and of gene rates.

(24)

Thank you for your attention.

Any questions?