• Aucun résultat trouvé

Fast and accurate branch length estimation for phylogenomic trees: ERaBLE (Evolutionary Rates and Branch Length Estimation)

N/A
N/A
Protected

Academic year: 2021

Partager "Fast and accurate branch length estimation for phylogenomic trees: ERaBLE (Evolutionary Rates and Branch Length Estimation)"

Copied!
24
0
0

Texte intégral

(1)

HAL Id: lirmm-01237447

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01237447

Submitted on 3 Dec 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Fast and accurate branch length estimation for

phylogenomic trees: ERaBLE (Evolutionary Rates and

Branch Length Estimation)

Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel Douzery, Fabio

Pardi

To cite this version:

Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel Douzery, Fabio Pardi. Fast and

ac-curate branch length estimation for phylogenomic trees: ERaBLE (Evolutionary Rates and Branch

Length Estimation). Rencontres ALPHY - Génomique Evolutive, Bioinformatique, Alignement et

Phylogénie, Mar 2015, Montpellier, France. �lirmm-01237447�

(2)

Fast and accurate branch length estimation

for phylogenomic trees: ERaBLE

(Evolutionary Rates and Branch Length Estimation).

Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel J.P. Douzery and Fabio Pardi

(3)

The Phylogenomics era

Important goal: build the species tree.

SPECIES 1 Gene 1ADN Gene 2ADN Gene 3ADN

SPECIES 2 Gene 2ADN Gene 3ADN

SPECIES 3 Gene 1ADN Gene 3ADN

SPECIES 4 ADNGene 1

SPECIES 5 Gene 1ADN Gene 2ADN Gene 3ADN

New challenges:

Use the maximum of available data. Account for gene rate heterogeneity. Deal with conflicts in gene trees topologies. ...

(4)

Categories of phylogenomics methods

Gene 1 AAGTCA 1 AGGACC 3 GACAGA 4 GAAACC 5 Gene 2 TACCAGC 1 ACTCCCC 2 ACTCTCT 5 Gene 3 ATGAC 1 AGGAG 2 AAGAG 3 AAGAC 5 Concatenate

AAGTCATACCAGCATGAC

1 ??????ACTCCCCAGGAG 2 AGGACC???????AAGAG 3 GACAGA???????????? 4

GAAACCACTCTCTAAGAC

5 δ1 δ13 δ14 δ15 δ34 δ35 δ45                                           δ2 δ12 δ15 δ25                   δ3 δ12 δ13 δ15 δ23 δ25 δ35                                           4 1 3 5 5 1 2 2 1 3 5 1 3 4 5 2 2 1 3 5 4 2 1 4 5 3 Supertree Distance-based Based on concatenated alignment

(5)

Categories of phylogenomics methods

Gene 1 AAGTCA 1 AGGACC 3 GACAGA 4 GAAACC 5 Gene 2 TACCAGC 1 ACTCCCC 2 ACTCTCT 5 Gene 3 ATGAC 1 AGGAG 2 AAGAG 3 AAGAC 5 Concatenate

AAGTCATACCAGCATGAC

1 ??????ACTCCCCAGGAG 2 AGGACC???????AAGAG 3 GACAGA???????????? 4

GAAACCACTCTCTAAGAC

5 δ1 δ13 δ14 δ15 δ34 δ35 δ45                                           δ2 δ12 δ15 δ25                   δ3 δ12 δ13 δ15 δ23 δ25 δ35                                           4 1 3 5 5 1 2 2 1 3 5 1 3 4 5 2 2 1 3 5 4 2 1 4 5 3 Supertree Distance-based Based on concatenated alignment Branch lengths:

(6)

Tree distances

Every tree with branch lengths can be represented with avector of tree distances by computing the distance dT

ij between each pair of taxa in the tree.

4

2

T=(T , b)

3

1

b

1

=1

b

2

=2

b

3

=3

b

5

=1

b

4

=2

d

T

=

d

12

= 3

d

13

= 5

d

14

= 4

d

23

= 6

d

24

= 5

d

34

= 5

Vector of tree distances

d

T

ij

=

X

e∈Pij

(7)

Tree distances

Every vector of tree distances is the product of thetopological matrixof T by the branch lengths vector of T. i.e. dT=Ab.

dT=

d12= 3 d13= 5 d14= 4 d23= 6 d24= 5 d34= 5

Vector of tree distances

b =

b1= 1 b2= 2 b3= 3 b4= 2 b5= 1

Branch lengths vector

A =

1 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0

Topological matrix 4 2 T=(T , b) 3 1 b1=1 b2=2 b3=3 b5=1 b4=2 d T=Ab

(8)

A distance-based method : the weighted least square method (WLS)

Branch lengths estimation with WLS:

minimize

X

ij

w

ij

ij

− d

ijT

)

2 input distances (estimated from 1 gene alignment) input topology δ =

δ12= 0.140 δ13= 0.163 δ14= 0.288 δ15= 0.336 δ23= 0.188 δ24= 0.413 δ25= 0.411 δ34= 0.298 δ35= 0.213

3 b3=? 1 b1=? 2 b2=? b6=? 5 b5=? 4 b4=? b7=? dT=Ab

(9)

A distance-based method : the weighted least square method (WLS)

Branch lengths estimation with WLS:

minimize

X

ij

w

ij

ij

− d

ijT

)

2

can be solved analytically in

O(n

3

)

input distances (estimated from 1 gene alignment) input topology δ =

δ12= 0.140 δ13= 0.163 δ14= 0.288 δ15= 0.336 δ23= 0.188 δ24= 0.413 δ25= 0.411 δ34= 0.298 δ35= 0.213

3 b3=? 1 b1=? 2 b2=? b6=? 5 b5=? 4 b4=? b7=? 3 b3=0.04 1 b1=0.02 2 b2=0.08 b6 =0 .06 5 b5=0.09 4 b4=0.08 b7=0.17 output dT=Ab

(10)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis: Any gene Gk induces approximately the same tree up to ascale

factor αkand the removal of a number of lineages.

E A B 1,5 3 1,5 1,5E A C B 3 2 1 1 E A C D 2 0,75 0,75 1,5

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis: The Proportional model. Any gene Gk induces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.

ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1 E A C D 2 1 1 2 δAE(1)=6 r1=1 α1=1 δ(2)AE=3.75 r2=0.75 α2=1.33 δAE(3)=9 r3=1.5 α2=0.67 2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:

The Proportional model. Any gene G

k

induces the same tree up to

a

scale factor α

k

, representing the inverse of its evolutionary rate r

k

.

ˆ

α

1

× δ

(1)AE

≃ ˆ

α

2

× δ

(2)AE

≃ ˆ

α

3

× δ

(3)AE E A B 2 2 1 1

E

A

C

B

2

2 1 1 E A C D 2 1 1 2

δ

AE(1)

=6

r

1

=1 α

1

=1

δ

(2)AE

=3.75

r

2

=0.75 α

2

=1.33

δ

AE(3)

=9

r

3

=1.5 α

2

=0.67

2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:The Proportional model. Any gene Gkinduces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.

ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1E A C D 2 1 1 2 δ(1)AE=6 r1=1 α1=1 δAE(2)=3.75 r2=0.75 α2=1.33 δ(3)AE=9 r3=1.5 α2=0.67 2 1 1E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3 After scaling ×α1=1 = ×α2=1.33 = ×α3=0.67 = E A C D B b3=? b5=? b4=? b6 =? b7=? b1=? b2=?

(11)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis: Any gene Gk induces approximately the same tree up to ascale

factor αkand the removal of a number of lineages.

E A B 1,5 3 1,5 1,5E A C B 3 2 1 1 E A C D 2 0,75 0,75 1,5

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis: The Proportional model. Any gene Gk induces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.

ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1 E A C D 2 1 1 2 δAE(1)=6 r1=1 α1=1 δ(2)AE=3.75 r2=0.75 α2=1.33 δAE(3)=9 r3=1.5 α2=0.67 2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:

The Proportional model. Any gene G

k

induces the same tree up to

a

scale factor α

k

, representing the inverse of its evolutionary rate r

k

.

ˆ

α

1

× δ

(1)AE

≃ ˆ

α

2

× δ

(2)AE

≃ ˆ

α

3

× δ

(3)AE E A B 2 2 1 1

E

A

C

B

2

2 1 1 E A C D 2 1 1 2

δ

AE(1)

=6

r

1

=1 α

1

=1

δ

(2)AE

=3.75

r

2

=0.75 α

2

=1.33

δ

AE(3)

=9

r

3

=1.5 α

2

=0.67

2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:The Proportional model. Any gene Gkinduces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.

ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1E A C D 2 1 1 2 δ(1)AE=6 r1=1 α1=1 δAE(2)=3.75 r2=0.75 α2=1.33 δ(3)AE=9 r3=1.5 α2=0.67 2 1 1E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3 After scaling ×α1=1 = ×α2=1.33 = ×α3=0.67 = E A C D B b3=? b5=? b4=? b6 =? b7=? b1=? b2=?

(12)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis: Any gene Gk induces approximately the same tree up to ascale

factor αkand the removal of a number of lineages.

E A B 1,5 3 1,5 1,5E A C B 3 2 1 1 E A C D 2 0,75 0,75 1,5

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis: The Proportional model. Any gene Gk induces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.

ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1 E A C D 2 1 1 2 δAE(1)=6 r1=1 α1=1 δ(2)AE=3.75 r2=0.75 α2=1.33 δAE(3)=9 r3=1.5 α2=0.67 2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:

The Proportional model. Any gene G

k

induces the same tree up to

a

scale factor α

k

, representing the inverse of its evolutionary rate r

k

.

ˆ

α

1

× δ

(1)AE

≃ ˆ

α

2

× δ

(2)AE

≃ ˆ

α

3

× δ

(3)AE E A B 2 2 1 1

E

A

C

B

2

2 1 1 E A C D 2 1 1 2

δ

AE(1)

=6

r

1

=1 α

1

=1

δ

(2)AE

=3.75

r

2

=0.75 α

2

=1.33

δ

AE(3)

=9

r

3

=1.5 α

2

=0.67

2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Hypothesis:The Proportional model. Any gene Gkinduces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.

ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1E A C D 2 1 1 2 δ(1)AE=6 r1=1 α1=1 δAE(2)=3.75 r2=0.75 α2=1.33 δ(3)AE=9 r3=1.5 α2=0.67 2 1 1E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3 After scaling ×α1=1 = ×α2=1.33 = ×α3=0.67 = E A C D B b3=? b5=? b4=? b6 =? b7=? b1=? b2=?

(13)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(ijk))is defined on the

taxa set Lk.

A given topology T defined on the taxa set L =

m

[

k=1

Lk.

Objective: find

the branch lengths ˆb1,. . . , ˆb2n−3of T ,

the scale factors ˆα1,. . . , ˆαmof the m genes,

solution of the problem:

minimize m

X

k=1

X

i,j∈Lk w(k) ij ( ˆαkδ(ijk)− dijT)2 subjet to m

X

k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .

(14)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(ijk))is defined on the

taxa set Lk.

A given topology T defined on the taxa set L =

m

[

k=1

Lk.

Objective: find

the branch lengths ˆb1,. . . , ˆb2n−3of T ,

the scale factors ˆα1,. . . , ˆαmof the m genes,

solution of the problem:

minimize m

X

k=1

X

i,j∈Lk w(k) ij ( ˆαkδ(ijk)− dijT)2 subjet to m

X

k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .

(15)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(ijk))is defined on the

taxa set Lk.

A given topology T defined on the taxa set L =

m

[

k=1

Lk.

Objective: find

the branch lengths ˆb1,. . . , ˆb2n−3of T ,

the scale factors ˆα1,. . . , ˆαmof the m genes,

solution of the problem:

minimize m

X

k=1

X

i,j∈Lk w(k) ij ( ˆαkδ(ijk)− dijT)2 subjet to m

X

k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .

(16)

New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation

Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(ijk))is defined on the

taxa set Lk.

A given topology T defined on the taxa set L =

m

[

k=1

Lk.

Objective: find

the branch lengths ˆb1,. . . , ˆb2n−3of T ,

the scale factors ˆα1,. . . , ˆαmof the m genes,

solution of the problem:

minimize m

X

k=1

X

i,j∈Lk w(k) ij ( ˆαkδ(ijk)− dijT)2 subjet to m

X

k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .

(17)

Resolution of the optimization problem

The minimization problem can be solved with the method of Lagrange multipliers and leads to the linear system inO(n + m)equations and unknowns:

1                        δt 1W1δ1 0 · · · 0 0 δt 2W2δ2 ... ... ... ... 0 · · · · · · δt mWmδm ] ] ] ] − A t 1W 1δ 1 − A t 2W 2δ 2 .. . − A t m W m δ m [ [ [ [ Z1 Z2 · · · Zm [ −δt 1W1A1 ] [ −δt 2W2A2 ] ... [ −δt mWmAm ] 1 1 ... 1           m X k=1 At kWkAk           0 ... ... ... 0 0 · · · 0 0                                              α1 α2 ... αm b1 b2 ... ... ... b2n−3 λ                       =                        0 ... ... ... ... ... ... ... 0 1                        m=# genes n=# taxa Filling the matrix and standard system solving inO((n + m)3+mn4)

Solving inO(n3+mn2)with our algorithms:

Filling the matrix inO(mn2)instead ofO(mn4)

(18)

Resolution of the optimization problem

The minimization problem can be solved with the method of Lagrange multipliers and leads to the linear system inO(n + m)equations and unknowns:

1                        δt 1W1δ1 0 · · · 0 0 δt 2W2δ2 ... ... ... ... 0 · · · · · · δt mWmδm ] ] ] ] − A t 1W 1δ 1 − A t 2W 2δ 2 .. . − A t m W m δ m [ [ [ [ Z1 Z2 · · · Zm [ −δt 1W1A1 ] [ −δt 2W2A2 ] ... [ −δt mWmAm ] 1 1 ... 1           m X k=1 At kWkAk           0 ... ... ... 0 0 · · · 0 0                                              α1 α2 ... αm b1 b2 ... ... ... b2n−3 λ                       =                        0 ... ... ... ... ... ... ... 0 1                        m=# genes n=# taxa Filling the matrix and standard system solving inO((n + m)3+mn4)

Solving inO(n3+mn2)with our algorithms:

Filling the matrix inO(mn2)instead ofO(mn4)

Block-solving system inO(n3)instead ofO(n + m)3

Complexity: ERaBLEO(n3+mn2)vs. WLS O(n3)

(19)
(20)
(21)

Results on the OrthoMaM dataset [Douzery et al. 2014]

input data: m=6953 nucleotide exon alignments over n=4 to 40 mammals.

(22)
(23)

Conclusion

ERaBLE gives branch lengths to phylogenomic trees (e.g. estimated with supertree methods).

ERaBLE is relatively accurate in the estimations of branch lengths and of gene rates.

(24)

Thank you for your attention.

Any questions?

Références

Documents relatifs

function for those partitions is equal to the product of several “small” generating functions for the regions of the partitions, as shown in Fig.. It suffices to prove the

Notre approche consiste donc en deux modules, un premier recherchant une solution dans l’arbre de r´esolution et qui retourne une requˆete quand cela est n´ecessaire, et un

Before considering the use of absolute qPCR data in comparative studies, it remains crucial to validate absolute qPCR-based TL estimates by comparing among study, indi- viduals

L'évaluation d'un nœud de l'arbre de recherche a pour but de déterminer l'optimum de l'ensemble des solutions réalisables associé au nœud en question ou, au contraire, de

So, if we want to study the estimated length by local estimators when the res- olution tends to zero, we have at first to study the number of occurrences of a pattern of

The irradiances during the flight for the 312 nm channel as a function of solar zenith angle as measured by the various

The first chapter is a theoretical one which deals with a review literatureon reading in the foreign language classroom, starting with a definition of reading, types of

To this aim, we give asymptotics for the number σ (n) of collisions which occur in the n-coalescent until the end of the chosen external branch, and for the block counting