HAL Id: lirmm-01237447
https://hal-lirmm.ccsd.cnrs.fr/lirmm-01237447
Submitted on 3 Dec 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Fast and accurate branch length estimation for
phylogenomic trees: ERaBLE (Evolutionary Rates and
Branch Length Estimation)
Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel Douzery, Fabio
Pardi
To cite this version:
Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel Douzery, Fabio Pardi. Fast and
ac-curate branch length estimation for phylogenomic trees: ERaBLE (Evolutionary Rates and Branch
Length Estimation). Rencontres ALPHY - Génomique Evolutive, Bioinformatique, Alignement et
Phylogénie, Mar 2015, Montpellier, France. �lirmm-01237447�
Fast and accurate branch length estimation
for phylogenomic trees: ERaBLE
(Evolutionary Rates and Branch Length Estimation).
Manuel Binet, Olivier Gascuel, Celine Scornavacca, Emmanuel J.P. Douzery and Fabio Pardi
The Phylogenomics era
Important goal: build the species tree.
SPECIES 1 Gene 1ADN Gene 2ADN Gene 3ADN
SPECIES 2 Gene 2ADN Gene 3ADN
SPECIES 3 Gene 1ADN Gene 3ADN
SPECIES 4 ADNGene 1
SPECIES 5 Gene 1ADN Gene 2ADN Gene 3ADN
New challenges:
Use the maximum of available data. Account for gene rate heterogeneity. Deal with conflicts in gene trees topologies. ...
Categories of phylogenomics methods
Gene 1 AAGTCA 1 AGGACC 3 GACAGA 4 GAAACC 5 Gene 2 TACCAGC 1 ACTCCCC 2 ACTCTCT 5 Gene 3 ATGAC 1 AGGAG 2 AAGAG 3 AAGAC 5 ConcatenateAAGTCATACCAGCATGAC
1 ??????ACTCCCCAGGAG 2 AGGACC???????AAGAG 3 GACAGA???????????? 4
GAAACCACTCTCTAAGAC
5 δ1 δ13 δ14 δ15 δ34 δ35 δ45 δ2 δ12 δ15 δ25 δ3 δ12 δ13 δ15 δ23 δ25 δ35 4 1 3 5 5 1 2 2 1 3 5 1 3 4 5 2 2 1 3 5 4 2 1 4 5 3 Supertree Distance-based Based on concatenated alignment
Categories of phylogenomics methods
Gene 1 AAGTCA 1 AGGACC 3 GACAGA 4 GAAACC 5 Gene 2 TACCAGC 1 ACTCCCC 2 ACTCTCT 5 Gene 3 ATGAC 1 AGGAG 2 AAGAG 3 AAGAC 5 ConcatenateAAGTCATACCAGCATGAC
1 ??????ACTCCCCAGGAG 2 AGGACC???????AAGAG 3 GACAGA???????????? 4
GAAACCACTCTCTAAGAC
5 δ1 δ13 δ14 δ15 δ34 δ35 δ45 δ2 δ12 δ15 δ25 δ3 δ12 δ13 δ15 δ23 δ25 δ35 4 1 3 5 5 1 2 2 1 3 5 1 3 4 5 2 2 1 3 5 4 2 1 4 5 3 Supertree Distance-based Based on concatenated alignment Branch lengths:
Tree distances
Every tree with branch lengths can be represented with avector of tree distances by computing the distance dT
ij between each pair of taxa in the tree.
4
2
T=(T , b)
3
1
b
1=1
b
2=2
b
3=3
b
5=1
b
4=2
⇔
d
T=
d
12= 3
d
13= 5
d
14= 4
d
23= 6
d
24= 5
d
34= 5
Vector of tree distances
d
Tij
=
X
e∈PijTree distances
Every vector of tree distances is the product of thetopological matrixof T by the branch lengths vector of T. i.e. dT=Ab.
dT=
d12= 3 d13= 5 d14= 4 d23= 6 d24= 5 d34= 5
Vector of tree distances
b =
b1= 1 b2= 2 b3= 3 b4= 2 b5= 1
Branch lengths vector
A =
1 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0
Topological matrix 4 2 T=(T , b) 3 1 b1=1 b2=2 b3=3 b5=1 b4=2 d T=AbA distance-based method : the weighted least square method (WLS)
Branch lengths estimation with WLS:
minimize
X
ijw
ij(δ
ij− d
ijT)
2 input distances (estimated from 1 gene alignment) input topology δ =
δ12= 0.140 δ13= 0.163 δ14= 0.288 δ15= 0.336 δ23= 0.188 δ24= 0.413 δ25= 0.411 δ34= 0.298 δ35= 0.213
3 b3=? 1 b1=? 2 b2=? b6=? 5 b5=? 4 b4=? b7=? dT=AbA distance-based method : the weighted least square method (WLS)
Branch lengths estimation with WLS:
minimize
X
ij
w
ij(δ
ij− d
ijT)
2can be solved analytically in
O(n
3)
input distances (estimated from 1 gene alignment) input topology δ =
δ12= 0.140 δ13= 0.163 δ14= 0.288 δ15= 0.336 δ23= 0.188 δ24= 0.413 δ25= 0.411 δ34= 0.298 δ35= 0.213
3 b3=? 1 b1=? 2 b2=? b6=? 5 b5=? 4 b4=? b7=? 3 b3=0.04 1 b1=0.02 2 b2=0.08 b6 =0 .06 5 b5=0.09 4 b4=0.08 b7=0.17 output dT=AbNew methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis: Any gene Gk induces approximately the same tree up to ascalefactor αkand the removal of a number of lineages.
E A B 1,5 3 1,5 1,5E A C B 3 2 1 1 E A C D 2 0,75 0,75 1,5
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis: The Proportional model. Any gene Gk induces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.
ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1 E A C D 2 1 1 2 δAE(1)=6 r1=1 α1=1 δ(2)AE=3.75 r2=0.75 α2=1.33 δAE(3)=9 r3=1.5 α2=0.67 2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis:
The Proportional model. Any gene G
kinduces the same tree up to
a
scale factor α
k, representing the inverse of its evolutionary rate r
k.
ˆ
α
1× δ
(1)AE≃ ˆ
α
2× δ
(2)AE≃ ˆ
α
3× δ
(3)AE E A B 2 2 1 1E
A
C
B
2
2 1 1 E A C D 2 1 1 2δ
AE(1)=6
r
1=1 α
1=1
δ
(2)AE=3.75
r
2=0.75 α
2=1.33
δ
AE(3)=9
r
3=1.5 α
2=0.67
2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis:The Proportional model. Any gene Gkinduces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.
ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1E A C D 2 1 1 2 δ(1)AE=6 r1=1 α1=1 δAE(2)=3.75 r2=0.75 α2=1.33 δ(3)AE=9 r3=1.5 α2=0.67 2 1 1E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3 After scaling ×α1=1 = ×α2=1.33 = ×α3=0.67 = E A C D B b3=? b5=? b4=? b6 =? b7=? b1=? b2=?
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis: Any gene Gk induces approximately the same tree up to ascalefactor αkand the removal of a number of lineages.
E A B 1,5 3 1,5 1,5E A C B 3 2 1 1 E A C D 2 0,75 0,75 1,5
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis: The Proportional model. Any gene Gk induces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.
ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1 E A C D 2 1 1 2 δAE(1)=6 r1=1 α1=1 δ(2)AE=3.75 r2=0.75 α2=1.33 δAE(3)=9 r3=1.5 α2=0.67 2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis:
The Proportional model. Any gene G
kinduces the same tree up to
a
scale factor α
k, representing the inverse of its evolutionary rate r
k.
ˆ
α
1× δ
(1)AE≃ ˆ
α
2× δ
(2)AE≃ ˆ
α
3× δ
(3)AE E A B 2 2 1 1E
A
C
B
2
2 1 1 E A C D 2 1 1 2δ
AE(1)=6
r
1=1 α
1=1
δ
(2)AE=3.75
r
2=0.75 α
2=1.33
δ
AE(3)=9
r
3=1.5 α
2=0.67
2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis:The Proportional model. Any gene Gkinduces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.
ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1E A C D 2 1 1 2 δ(1)AE=6 r1=1 α1=1 δAE(2)=3.75 r2=0.75 α2=1.33 δ(3)AE=9 r3=1.5 α2=0.67 2 1 1E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3 After scaling ×α1=1 = ×α2=1.33 = ×α3=0.67 = E A C D B b3=? b5=? b4=? b6 =? b7=? b1=? b2=?
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis: Any gene Gk induces approximately the same tree up to ascalefactor αkand the removal of a number of lineages.
E A B 1,5 3 1,5 1,5E A C B 3 2 1 1 E A C D 2 0,75 0,75 1,5
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis: The Proportional model. Any gene Gk induces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.
ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1 E A C D 2 1 1 2 δAE(1)=6 r1=1 α1=1 δ(2)AE=3.75 r2=0.75 α2=1.33 δAE(3)=9 r3=1.5 α2=0.67 2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis:
The Proportional model. Any gene G
kinduces the same tree up to
a
scale factor α
k, representing the inverse of its evolutionary rate r
k.
ˆ
α
1× δ
(1)AE≃ ˆ
α
2× δ
(2)AE≃ ˆ
α
3× δ
(3)AE E A B 2 2 1 1E
A
C
B
2
2 1 1 E A C D 2 1 1 2δ
AE(1)=6
r
1=1 α
1=1
δ
(2)AE=3.75
r
2=0.75 α
2=1.33
δ
AE(3)=9
r
3=1.5 α
2=0.67
2 1 1 E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Hypothesis:The Proportional model. Any gene Gkinduces the same tree up to ascale factor αk, representing the inverse of its evolutionary rate rk.
ˆ α1× δ(1)AE≃ ˆα2× δ(2)AE≃ ˆα3× δ(3)AE E A B 2 2 1 1 E A C B 2 2 1 1E A C D 2 1 1 2 δ(1)AE=6 r1=1 α1=1 δAE(2)=3.75 r2=0.75 α2=1.33 δ(3)AE=9 r3=1.5 α2=0.67 2 1 1E A C D 2 E A B 1,5 0,75 0,75 1,5 3 1,5 1,5E A C B 3 After scaling ×α1=1 = ×α2=1.33 = ×α3=0.67 = E A C D B b3=? b5=? b4=? b6 =? b7=? b1=? b2=?
New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(ijk))is defined on thetaxa set Lk.
A given topology T defined on the taxa set L =
m
[
k=1
Lk.
Objective: find
the branch lengths ˆb1,. . . , ˆb2n−3of T ,
the scale factors ˆα1,. . . , ˆαmof the m genes,
solution of the problem:
minimize mX
k=1X
i,j∈Lk w(k) ij ( ˆαkδ(ijk)− dijT)2 subjet to mX
k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(ijk))is defined on thetaxa set Lk.
A given topology T defined on the taxa set L =
m
[
k=1
Lk.
Objective: find
the branch lengths ˆb1,. . . , ˆb2n−3of T ,
the scale factors ˆα1,. . . , ˆαmof the m genes,
solution of the problem:
minimize mX
k=1X
i,j∈Lk w(k) ij ( ˆαkδ(ijk)− dijT)2 subjet to mX
k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(ijk))is defined on thetaxa set Lk.
A given topology T defined on the taxa set L =
m
[
k=1
Lk.
Objective: find
the branch lengths ˆb1,. . . , ˆb2n−3of T ,
the scale factors ˆα1,. . . , ˆαmof the m genes,
solution of the problem:
minimize mX
k=1X
i,j∈Lk w(k) ij ( ˆαkδ(ijk)− dijT)2 subjet to mX
k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .New methode ERaBLE: Evolutionnary Rates and Branch Length Estimation
Input data: m distance vectors δ1,. . . , δmwhere δk= (δ(ijk))is defined on thetaxa set Lk.
A given topology T defined on the taxa set L =
m
[
k=1
Lk.
Objective: find
the branch lengths ˆb1,. . . , ˆb2n−3of T ,
the scale factors ˆα1,. . . , ˆαmof the m genes,
solution of the problem:
minimize mX
k=1X
i,j∈Lk w(k) ij ( ˆαkδ(ijk)− dijT)2 subjet to mX
k=1 Zkαˆk= 1 Outputs: ˆ be et ˆrk= 1 ˆ αk .Resolution of the optimization problem
The minimization problem can be solved with the method of Lagrange multipliers and leads to the linear system inO(n + m)equations and unknowns:
1 δt 1W1δ1 0 · · · 0 0 δt 2W2δ2 ... ... ... ... 0 · · · · · · δt mWmδm ] ] ] ] − A t 1W 1δ 1 − A t 2W 2δ 2 .. . − A t m W m δ m [ [ [ [ Z1 Z2 · · · Zm [ −δt 1W1A1 ] [ −δt 2W2A2 ] ... [ −δt mWmAm ] 1 1 ... 1 m X k=1 At kWkAk 0 ... ... ... 0 0 · · · 0 0 α1 α2 ... αm b1 b2 ... ... ... b2n−3 λ = 0 ... ... ... ... ... ... ... 0 1 m=# genes n=# taxa Filling the matrix and standard system solving inO((n + m)3+mn4)
Solving inO(n3+mn2)with our algorithms:
Filling the matrix inO(mn2)instead ofO(mn4)
Resolution of the optimization problem
The minimization problem can be solved with the method of Lagrange multipliers and leads to the linear system inO(n + m)equations and unknowns:
1 δt 1W1δ1 0 · · · 0 0 δt 2W2δ2 ... ... ... ... 0 · · · · · · δt mWmδm ] ] ] ] − A t 1W 1δ 1 − A t 2W 2δ 2 .. . − A t m W m δ m [ [ [ [ Z1 Z2 · · · Zm [ −δt 1W1A1 ] [ −δt 2W2A2 ] ... [ −δt mWmAm ] 1 1 ... 1 m X k=1 At kWkAk 0 ... ... ... 0 0 · · · 0 0 α1 α2 ... αm b1 b2 ... ... ... b2n−3 λ = 0 ... ... ... ... ... ... ... 0 1 m=# genes n=# taxa Filling the matrix and standard system solving inO((n + m)3+mn4)
Solving inO(n3+mn2)with our algorithms:
Filling the matrix inO(mn2)instead ofO(mn4)
Block-solving system inO(n3)instead ofO(n + m)3
Complexity: ERaBLEO(n3+mn2)vs. WLS O(n3)
Results on the OrthoMaM dataset [Douzery et al. 2014]
input data: m=6953 nucleotide exon alignments over n=4 to 40 mammals.
Conclusion
ERaBLE gives branch lengths to phylogenomic trees (e.g. estimated with supertree methods).
ERaBLE is relatively accurate in the estimations of branch lengths and of gene rates.
Thank you for your attention.
Any questions?