Universit´e Libre de Bruxelles Facult´e des Sciences

(1)

Universit´e Libre de Bruxelles

Facult´e des Sciences

Models and Methods in Molecular Phylogenetics

Daniele Catanzaro

——–

Service Graphes et Optimisation Math´ ematique (G.O.M.) D´ epartement d’Informatique

Promoteur

Prof. Martine Labb´ e

Ann´ ee acad´ emique 2007-2008

Th` ese pr´ esent´ ee en vue de l’obtention du titre de Docteur en Sciences

(2)

(3)

(4)

(5)

This work consists of contributions in molecular phylogenetics. Specifically:

Characterization of evolutionary models: The thesis provides an analysis of the nucleotide and codon evolutionary substitution models from the point of view of systems theory. It investigates the mathematical conditions and the correspondent biological interpretation, for their applicability.

Mathematical Programming approach to the minimum evolution principle: The thesis pro- vided a number of mixed integer programming formulations for the minimum evolution principle of phylogenetic estimation.

Applications: The thesis proposes a number of approximate algorithms to tackle the problem

of phylogenetic estimation under different criteria. Namely, an Ant Colony Optimization

algorithm (ACO) for minimum evolution, a non-isomorphic tree enumerator, and a very

large-scale neighborhood search algorithm are introduced for the first time in molecular

phylogenetics.

(6)

(7)

I acknowledge financial support from a number of institutions. In particular: from September 2003 to June 2004 my Ph.D. studies were funded by the Metaheuristic Network, a Train- ing and Research Network funded by the Improving and Human Potential Programme of the European Community (grant HPRN-CT-1999-00106). The framework was carried out at I.R.I.D.I.A. labs, Universit´ e Libre de Bruxelles, under the supervision of Prof. Marco Dorigo. From July 2004 to September 2005 my research was supported by the Communaut´ e Fran¸caise de Belgique (ARC 11649/20022770) and the R´ egion Wallonne. The project was car- ried out at U.E.G. labs, Universit´ e Libre de Bruxelles, under the supervision of Prof. Michel C. Milinkovitch. Since October 2005 I had the unique opportunity to receive the mandate of Aspirant F.N.R.S. (A 2/5- MCF/DM - A 506) funded by the Belgian National Funds for Scientific Research.

Thanks to Prof. Martine Labb´ e for her teachings and for continually spurring me to improve my skills through ideas, suggestions, and brainstorming sessions. She gave me the possibility to know several people in the Operations Research field, to attend interesting talks and conferences, and helped me to fix many concepts unclear in my background. I am very grateful to her.

Thanks to Prof. Marco Dorigo who supervised me during the year spent at I.R.I.D.I.A.

Marco believed in my potentialities and allowed me to start, with the DEA, the long path that led me to write this thesis. I really appreciated his rigid teachings.

Thanks to Prof. Raffaele Pesenti, for the second time in a thesis redacted by me. He firstly introduced me to the beautiful world of Operations Research and motivated me to pursue the award of the Ph.D. I want to acknowledge here not only the value that Raffaele kept adding to my research through comments, discussions, and precious hints, but also the personal support he always provided me.

Thanks to Prof. Guy Latouche for the support he gave me in clarifying and writing the

introduction and the part relative to the stochastic models. I am grateful to him for the time

he dedicated to me.

(8)

Thanks to Prof. Michel C. Milinkovitch for introducing me to the fascinating world of molecular biology. He encouraged my initiatives and gave me the chance to assist him in various tasks such as replying to reviewers, writing papers, preparing posters, speaking at conferences.

Thanks to Dr. Patrick Mardulyn who kindly and patiently gave me support in many man- ners such as answering specific questions, giving possible research ideas, providing counter- examples of wrong hypotheses. His notable help was really precious.

Thanks to Prof. Mike Steel, Prof. Joe Felsenstein, Prof. Juan Jos´ e Salazar-Gonzales, Prof.

Yves van de Peer, Prof. Cecilia Lanave, Prof. Allen Rodrigo, Prof. Luc Vanhamme, Dr.

Alexandros Stamatakis, Dr. Tal Pupko, Dr. Chantal Korostensky, Dr. Anne-Marie Marini, for the support they gave me in kindly answering to my (not always simple) questions.

Thanks to Prof. Hugues Bersini, Prof. Gianluca Bontempi, Prof. Leo Oberdan, Prof. Muriel Moser for the support they gave me in the final (and more difficult) period of my Ph.D. career.

I wish to thank all people have been working with at I.R.I.D.I.A. labs, G.O.M. labs, U.E.G. labs and I.B.M.M. in general: Prof. Bernard Fortz, Dr. Thomas St¨ utzle, Dr. Elio Tuci, Dr. Mauro Birattari, Dr. Bruno Marchal, Dr. Vito Trianni, Roderich Groß, Shervin Nouyan, Christos Ampatzis, Christian Blum, Davide Faconti, G´ eraldine Heilporn, Alessandra Godi, Vincent Ho, Aykut ¨ Ozsoy, Prasanna Balaprakash, Max Manfrin, Carlotta Piscopo, Rosa Maria Figueiredo, Justine Sciarrabone, Julien Guglielmini, Laurent Gatto, Athanasia Tzika, Raph¨ ael Helaers, Daniel Monteyne, Marie-Anne Vaesen, Insa Cassen, Eva D’amico, Lydia Spedale, Cathy Jean, M´ elanie Boeckstaens, Patrice Godard, Luis Antonio Urrestarazu, Stephan Vissers. On a more personal basis, I wish to thank Elena Sanchez, Claudio Procida, Sandro Calandrino, and Enzo Porcasi.

Thanks to Anne-Sophie for sharing with me each single good or bad moment, for her perseverance, for her patience. I really love her.

This work is dedicated to my family, for all the tragic moments we passed through in these years, for the strength and stubbornly with which we reacted. A special thank to my father and my mother, for all what they gave me and for all what they taught me. Their rigid education allowed me to get out from difficult periods of my life. My approach to the issues discussed in this thesis proudly reflects their teachings.

Brussels, June 16th 2007.

(9)

Table of contents . . . . v

List of figures . . . . ix

List of tables . . . . xi

1 Introduction . . . . 1

2 The minimum evolution criterion of phylogenetic reconstruction . . . . 7

2.1 The minimum evolution problem . . . . 7

2.2 A possible taxonomy of the literature . . . . 11

2.3 Versions of the minimum evolution problem . . . . 12

2.3.1 The least-squares models . . . . 12

2.3.2 The linear programming models . . . . 17

2.4 Approaches to solving the minimum evolution problem . . . . 18

2.4.1 Exact algorithms . . . . 19

2.4.2 Non-exact algorithms . . . . 20

2.5 Statistical consistency of the minimum evolution problem . . . . 24

(10)

2.5.1 Statistical consistency of the edge weight estimation models . . . . 24

2.5.2 Bounds on the statistical consistency of approaches to solving the minimum evolution problem . . . . 25

3 Mixed integer programming models for MEP under the linear programming edge weight estimation model . . . . 29

3.1 Structure of the EPT matrices . . . . 29

3.2 The EPT Model . . . . 34

3.3 The Spanning Tree Model . . . . 36

3.4 The Fixed Tree Model . . . . 37

3.5 Lower and upper bounds . . . . 39

3.5.1 Bounds by surrogate relaxation of the EPT Problem . . . . 39

3.5.2 Combinatorial bounds . . . . 41

3.5.3 Further inequalities . . . . 46

3.6 Computational results . . . . 47

4 Approximate algorithms for MEP under the least-squares edge weight estimation model . . . . 51

4.1 An ACO algorithm for the minimum evolution principle under OLS edge weight estimation . . . . 51

4.1.1 A first primal bound . . . . 52

4.1.2 The ant colony optimization algorithm . . . . 53

4.1.3 Discussion . . . . 57

4.1.4 Results and comparison . . . . 60

4.2 On the tree isomorphism . . . . 61

4.3 Enumerating non-isomorphic trees: Overview . . . . 64

(11)

4.3.1 From tree codes to EPT matrix: How to generate correct topologies . . . . 65

5 The likelihood principle of phylogenetic estimation: A very large-scale neighborhood approach . . . . 69

5.1 Notation and problem formulation . . . . 69

5.2 Algorithm overview . . . . 73

5.2.1 The VLSN techniques . . . . 74

5.2.2 Minimum cost assignment neighborhood (MCAN) . . . . 74

5.2.3 Minimum cost cycle neighborhood (MCCN) and simple cost cycle neighborhood (SCCN) . . . . 76

5.2.4 Topological optimization phase: Iterated Local Search . . . . 77

5.2.5 Likelihood optimization phase: optimization of edge weights and substitution probabilities . . . . 79

5.3 Results and Discussion . . . . 81

5.3.1 Datasets description . . . . 81

5.3.2 Technical remarks . . . . 82

5.3.3 VLSN techniques performances . . . . 84

5.3.4 Neighborhoods efficiency and heuristic dynamics . . . . 86

5.3.5 Proxies efficiency . . . . 88

5.3.6 Numerical comparisons . . . . 88

5.3.7 On the multi-modal nature of the likelihood function . . . . 89

6 Conclusion and Perspectives . . . . 91

References . . . . 95

(12)

7 Appendix: On the estimation procedures for the GTR model . . . 105

7.1 Introduction . . . 105

7.2 On mathematical assumptions of the GTR model . . . 107

7.3 On the congruency between P and the GTR model . . . 107

7.4 Biological interpretation of Sylvester’s conditions . . . 108

7.5 Sufficient conditions to exclude incongruent estimations of F

^#

. . . 109

7.6 An alternative estimation procedure . . . 110

7.6.1 Consequences on evolutionary distances . . . 112

7.6.2 Goodness of solutions . . . 113

7.6.3 A numerical example . . . 115

7.7 Substitution models for codons evolution: Overview . . . 119

7.8 A possible estimation method for codon models: Overview . . . 121

(13)

1.1 An example of a phylogeny . . . . 2

2.1 Mapping n-ary trees . . . . 8

2.2 An example of phylogeny with five taxa . . . . 10

2.3 Bipartition of a phylogeny . . . . 14

2.4 Clustering heuristics . . . . 20

3.1 An example of phylogeny with five taxa . . . . 29

3.2 Example of two complementary phylogenies . . . . 36

3.3 An example of unlabeled phylogeny . . . . 39

3.4 A example of comb with 10 leaves . . . . 40

3.5 Topological isomorphism . . . . 46

4.1 High-level C-like pseudo-code for the Sequential Addition heuristic . . . . 53

4.2 High-level pseudo-code for the ACO algorithm . . . . 54

4.3 Principle of the ACO-ME algorithm . . . . 56

4.4 Tuning of the parameters of ACO-ME part I . . . . 58

4.5 Tuning of the parameters of ACO-ME part II . . . . 59

(14)

4.6 Tuning of the parameters of ACO-ME part III . . . . 59

4.7 Comparison of the performances between ACO-ME and PAUP part I . . . . 62

4.8 Comparisons of the performances between ACO-ME and PAUP part II . . . . 63

4.9 Isomorphism among phylogenies . . . . 63

4.10 Tree codes . . . . 65

4.11 An EPT matrix non-coding a tree . . . . 66

4.12 Example of possible unrooted binary tree for four leaves . . . . 68

5.1 An example of a phylogenetic graph with four leaves . . . . 70

5.2 An example of phylogeny with five taxa . . . . 71

5.3 An example of a rooted phylogeny . . . . 72

5.4 The minimum assignment neighborhood (MCAN) . . . . 75

5.5 The minimum cost cycle neighborhood (MCCN) . . . . 76

5.6 High-level pseudo-code for the Iterated Local Search . . . . 78

5.7 High-level pseudo-code for the numerical Simulated Annealing (SA) . . . . 80

5.8 Relative efficiencies of different cooling schedules . . . . 83

5.9 An example of NNI swap . . . . 84

5.10 Average negative log-likelihood progression for 64 taxa . . . . 85

5.11 Average negative log-likelihood progression for 300 taxa . . . . 85

5.12 Relative contribution of the various VLSN operators . . . . 87

5.13 Negative log-likelihood progression of VLSN search on different datasets . . . . . 88

5.14 VLSN vs. RAxML-VI . . . . 90

(15)

2.1 An example of EPT matrix . . . . 10

2.2 Articles by subject matter. . . . 12

2.3 Articles by type of edge weight estimation method. . . . 18

2.4 Articles by type of solution method. . . . 23

2.5 Consistency of different versions of the minimum evolution criterion . . . . 25

2.6 Solution approaches versus safety radius . . . . 26

2.7 Articles dealing with the statistical consistency problem. . . . 27

3.1 An example of EPT matrix . . . . 30

3.2 Number of isomorphic and non-isomorphic trees . . . . 38

3.3 Experimental results . . . . 48

3.4 Experimental results . . . . 48

3.5 Experimental results . . . . 49

3.6 Experimental results . . . . 49

5.1 Details of the datasets used for experimental analysis . . . . 82

5.2 Possible Cooling schedules implemented in the Simulated Annealing . . . . 82

(16)

5.3 Average number of VLSN calls for 64 taxa . . . . 86

5.4 Average number of VLSN calls for 300 taxa . . . . 86

5.5 Analysis of the proxy efficiency . . . . 89

7.1 The four conditions of Sylvester’s criterion . . . 108

7.2 Amino acid chemical-physical properties (see [35, 89, 157]). . . 121

(17)

Introduction

Molecular phylogenetics studies the hierarchical evolutionary relationships among organisms by means of molecular data. These relationships are typically described through a weighted tree, called phylogeny (see Figure 1.1), whose leaves represent taxa, internal vertices the intermediate ancestors, edges the evolutionary relationships between pairs of organisms, and edge weights the evolutionary distances (i.e., measures of the dissimilarity) between pairs of organisms [59].

Phylogenetic estimation has been practiced since Darwin [67]. Initially, physical characters of taxa, such as morphology or physiology, were used to estimate the corresponding phylogeny [134]. Nowadays, phylogenetic estimation can also be carried out through the use of molecular data extracted from taxa, such as protein fragments, DNA and RNA sequences, or (less frequently) the whole genome [134].

Since no one could observe the real evolutionary process over thousands or millions of years, in general there is no way to validate one phylogeny from among plausible alternatives [67]. For this reason, researchers have proposed various criteria of molecular phylogenetic estimation [59]. These criteria can usually be expressed in terms of objective functions, and the phylogenies that optimize them are referred to as optimal [67]. An optimal phylogeny is validated by computing a measure of the dissimilarity from the true phylogeny (i.e., the phylogeny that one would obtain if all the molecular data from the set of taxa were avail- able [59]). If the measure of the dissimilarity asymptotically tends to zero as the amount of molecular data analyzed increases, then the corresponding criterion is said to be statistically consistent [67].

Historically, the first criterion to be proposed was the parsimony criterion [28]. This

criterion is based on Ockham’s rasor and states that when multiple competing explanations of

an observed phenomenon are equal in other respects, the one requiring the fewest assumptions

should be preferred [134]. Hence, under the parsimony criterion, a phylogeny is defined to be

optimal (or the most parsimonious) if for each path from one leaf to another the minimal

number of mutations occurs [75]. Finding the most parsimonious phylogeny for a set of taxa is

(18)

Homo

Pan

Gorilla 76

82 14

94 70

199 Pongo

457 Macaca

Leaf or taxon (observed species)

Edge

(evolutionary relationship) Internal vertex

(hypothetical ancestor)

Fig. 1.1.An example of a phylogeny.

N P-Hard [75]; as a further drawback, in some circumstances (see [59], p. 113), this criterion may be statistically inconsistent [55, 120].

The distance-based criterion [59] was then proposed as possible alternative to the parsi- mony criterion. Specifically, the distance-based criterion aims to find a phylogeny that best fits a given matrix of evolutionary distances among pairwise molecular data [59]. Different definitions of “fitting” give rise to different distance-based criteria [59]. One of the most im- portant distance-based criteria is that of Minimum Evolution (ME). This criterion states that the optimal phylogeny for a set of taxa is the one whose sum of edge weights, estimated from the corresponding evolutionary distances, is minimal [66, 128]. Phylogenies satisfying the ME criterion are determined by solving a Minimum Evolution Problem (MEP) whose versions depend on how the edge weight estimation is performed. The ME criterion has the benefit of generally being statistically consistent, although finding the phylogeny that satisfies the ME criterion is still N P-Hard [40].

More recently, the likelihood criterion [56] was introduced. This criterion states that under

many plausible explanations of an observed phenomenon, the one with the highest probability

of occurring should be preferred [56]. Hence, under the likelihood criterion, a phylogeny is

defined to be optimal (or the most likely) if it has the highest probability of explaining the

observed taxa [56]. In contrast to the parsimony criterion, the likelihood criterion has the

important benefit of being statistically consistent ([67], p. 50). Again, however, the problem of

finding the most likely phylogeny is N P-Hard [34, 123]. A current extension of the likelihood

(19)

criterion, Bayesian inference [50, 86, 100], uses prior and posterior probabilities to measure the quality of the phylogeny provided ([67], p. 67-69). Just as with the likelihood criterion, however, finding the optimal phylogeny using Bayesian inference is N P-Hard [59].

In this thesis we provide a critic of the mathematical models used to describe molecular evolution and propose exact and approximate algorithms to estimate phylogenies under the minimum evolution and the likelihood criteria. The first chapter is dedicated to a review of the available literature on MEP. Specifically, we provide a possible taxonomy of the different versions of MEP according to the type of objective function, constraints, and methods used to estimate edge weights. Finally, we present a synopsis of MEP by reviewing the most widely used approaches to solving the various versions of MEP and discussing the statistical consistency of the phylogenies they provide.

The second chapter describes a number of mixed integer programming models to estimate phylogenies under the minimum evolution criterion. Specifically, we study the structures of the Edge Path Tree (EPT) matrices and exploit these properties in order to obtain an ad hoc linear model characterized by a polynomial number of integer and continuous variables. We then describe a second model characterized by a smaller (polynomial) number of integer and continuous variables but containing two exponential sets of constraints. We also show a third model, characterized by a polynomial number of integer and continuous variables, exploiting the isomorphism among phylogenies. Finally, we investigate some combinatorial lower bounds and analyze the results obtained by numerical experiments on some real instances.

The third chapter introduces approximate algorithms to estimate phylogenies under the minimum evolution criterion. Specifically, the Ant Colony Optimization (ACO) metaheuristic is at the core of the first approximate algorithm. ACO is an optimization technique inspired from the foraging behavior of real ant colonies. This behavior is exploited in artificial ant colonies for the search of approximate solutions to discrete optimization problems. We adapt ACO to the phylogenetic estimation problem and show that it is potentially competitive in comparison with state-of-the-art algorithms. A second approximate algorithm is based on the non-isomorphic phylogeny generation. We give an overview of the techniques that can be used to implement this generation and provide useful speed-up strategies to generate correct phylogenies from the EPT matrices.

The fourth chapter introduces Very Large-Scale Neighborhood (VLSN) techniques for

phylogenetic estimation under the maximum likelihood criterion. VLSN techniques belong to

the class of local search algorithms, and are characterized by a neighborhood size that grows

exponentially with the input data. The underlying idea of VLSN techniques is that a greater

neighborhood improves the quality of the locally optimal solutions found. These techniques

request efficient (typically polynomial) means to search for the best local optimum inside a

neighborhood of a given solution. We adapt these techniques to estimate phylogenies of large

datasets of nucleotide sequences under the maximum likelihood criterion. We show that the

use of the VLSN techniques speeds up convergency to topological local optima, and increases

the overall performances of stochastic-based search algorithms.

(20)

Finally, in the appendix we formally characterize the mathematical conditions, and dis- cuss their biological interpretation, which lead to the inapplicability of the General Time Reversible (GTR) model of DNA (codon) sequence evolution. We investigate the relations between, on one hand, the occurrence of negative eigenvalues and, on the other hand, both sequence length and sequence divergence. We then propose a possible re-formulation of proce- dures currently used to estimate distances from nucleotide sequences in terms of a non-linear optimization problem. We analytically investigate the effect of our approach on the estimated evolutionary distances and transition probability matrix. Finally, we provide an analysis on the quality of the solution we propose.

Some chapters of this thesis are partially based on articles that have been published or submitted for publication. Specifically:

Chapter 1 is contained in:

D. Catanzaro. The minimum evolution problem: Overview and classification. Networks. To appear.

Chapter 2 is contained in:

D. Catanzaro, M. Labb´ e, R. Pesenti, and J. J. Salazar. Mathematical models to reconstruct phylogenetic trees. Networks. To appear.

Chapter 3 is contained in:

D. Catanzaro, R. Pesenti, and M. C. Milinkovitch. An ant colony optimization algorithm to estimate phylogenies under the minimum evolution principle. BMC Evolutionary Biology 7:228, 2007.

Chapter 4 is contained in:

D. Catanzaro, R. Pesenti, and M. C. Milinkovitch. Estimating phylogenies under maximum likelihood: A very large-scale neighborhood approach. Currently submitted for publica- tion.

The appendix is contained in:

D. Catanzaro, R. Pesenti, and M. C. Milinkovitch. A non-linear optimization procedure to estimate distances and instantaneous substitution rate matrices under the GTR model.

Bioinformatics 2006, 22(6):708-715.

(21)

L. Gatto, D. Catanzaro, and Milinkowitch. Assessing the applicability of the GTR nucleotide

substitution model through simulations. Evolutionary Bioinformatics Online 2006, 2:153-

163.

(22)

(23)

The minimum evolution criterion of phylogenetic reconstruction

Parallel and back mutations are rare at the molecular level in well-conserved molecular data [116], i.e., data whose basic biochemical function has undergone minimal change throughout the evolution of a species. Hence, in absence of convergent or reverse evolution, a minimum path from one taxon to another (i.e., a process requiring the minimal amount of mutation) can approximate the correspondent real evolutionary process [13]. However, in the long term (periods of environmental change, including the intracellular environment), the union of min- imum paths will not form an overall minimum due to the presence of changing selective forces [13]. Therefore, a minimal length phylogeny provides a lower bound on the total number of mutation events that could have occurred along the evolution of the taxa analyzed. These are the fundamental considerations at the core of the Minimum Evolution (ME) criterion of phylogenetic estimation. In this chapter we formally state the minimum evolution criterion as an optimization problem and provide a possible taxonomy of the literature about it.

2.1 The minimum evolution problem

In this section we formally state the minimum evolution criterion as an optimization problem.

To this end, we first introduce some preliminary definitions that will be useful throughout the thesis.

Denote Γ as the set of n organisms (taxa) to be analyzed, and consider an unweighted graph G = (V, E ), called a phylogenetic graph, where V = V

e

∪ V

i

is the set of vertices.

V

_e

is the set of n leaves representing the n taxa in Γ , and V

_i

the set of internal vertices

representing the common ancestors. By analogy, E = E

_e

∪ E

_i

is the set of edges, where E

_e

is

the set of external edges, i.e., the set of edges with one extreme being a leaf, and E

i

is the

set of internal edges, i.e., the set of edges with both extremes being internal vertices. Then a

phylogeny of the set Γ is any spanning tree T of G such that each internal vertex has degree

three, and each leaf has degree one. Denote E (T ) = E

e

(T ) ∪ E

i

(T ) as the set of edges of a

phylogeny T where, by analogy, E

_e

(T ) and E

_i

(T ) are the set of external and internal edges of

T , respectively. Since a phylogeny T is a tree then the following property holds:

(24)

Fig. 2.1. The 4-ary tree (on the left) can be transformed into a phylogeny by adding a dummy vertex and edge (dashed, on the right).

|E

_i

(T )| + |E

_e

(T )| = |V

_i

| + |V

_e

| − 1. (2.1) Moreover, since a phylogeny T is characterized by having internal vertices with degree equal to three, the following property holds:

2|E

_i

(T )| + 2|E

_e

(T )| = 3|V

_i

| + |V

_e

|. (2.2) Combining (2.1) and (2.2) it follows that |V

i

| = (n − 2) and |E

i

| = (n − 3).

It is worth noting that there is no biological reason for imposing the degree constraint on the internal vertices of a phylogeny. Nevertheless, the constraint is usually imposed as it simplifies the formulation of the minimum evolution problem [142]. Moreover, the constraint is not an oversimplification since any m-ary tree (i.e., a tree characterized by having internal vertices of degree m) can be transformed into a phylogeny by adding “dummy” vertices and edges (e.g., see Figure 2.1 and [142]).

We denote T as the set of all the possible (2n − 5)!! phylogenies of Γ (where n!! is the double factorial of n) [59], and we assume that a weight function f : E (T ) → < is given. We denote w as the (2n − 3)-vector of edge weights associated to a phylogeny T , p

_ij

as the path from vertex i to j in T , and L(T ) as the length of T , i.e., the sum of the associated edge weights. Finally, we define a phylogeny T with weights w as a metric tree if all the entries of w are non-negative [150].

We assume that a n × n distance matrix D = {d

_ij

} of evolutionary distances [27] between

each pair of taxa i and j in Γ is given a priori. Such evolutionary distances measure the

dissimilarity between pairwise molecular data, and are usually computed on the basis of a

given Markov substitution model of molecular evolution (e.g., those described in [27, 59, 79,

88, 93, 99, 124, 149]) or, more rarely, by means of metric models (e.g., those described in

[13, 91]). We say that a distance matrix D is a dissimilarity matrix if for each pair of distinct

taxa i and j , d

_ij

> 0, d

_ij

= d

_ji

and d

_ii

= 0 [150]. In addition, we say that a dissimilarity

matrix D is metric if the triangle inequality also holds [57]:

(25)

d

_ij

≤ d

_ik

+ d

_kj

∀ i, j, k ∈ Γ. (2.3) Metric distance matrices are more likely to be generated when, for example, covarion models [60, 62, 85, 105] of molecular evolution (see [43]), or the models described in [13, 91] are used.

We say that a dissimilarity matrix D is additive if there exists a metric tree phylogeny such that the sum of edge weights along the path between leaves i and j is equal to d

_ij

, for all i, j ∈ Γ [150]. Define δ

_ij

= P

e∈pij

w

_e

for any pair of vertices i and j of V . Then the following theorem holds:

Proposition 2.1. If T = (V, E) is a metric tree phylogeny then for any edge e inducing a bipartition of V in two subsets A and B such that vertices i, j ∈ A and vertices z, t ∈ B, the condition δ

_ij

+ δ

_zt

≤ max{δ

_iz

+ δ

_jt

, δ

_it

+ δ

_jz

} is satisfied.

Proof. Select an edge e of T inducing a bipartition of V into two subsets A and B such that vertices i and j belong to A, and vertices z and t belong to B. Since T is a metric tree phylogeny we have:

δ

_ij

+ δ

_zt

≤ δ

_iz

+ δ

_jt

δ

_ij

+ δ

_zt

≤ δ

_it

+ δ

_jz

.

In fact, any path between two vertices not lying on the same subset of the partition crosses the edge e in T whose weight is non-negative by hypothesis. Hence, the theorem follows.

u t

It is easily seen that the following corollary holds:

Corollary 2.2. Let D be a distance matrix. If T = (V, E) is a metric tree phylogeny such that δ

_ij

= d

_ij

, i, j ∈ Γ , then for any four distinct taxa i, j, z and t such that there exists one edge e in T inducing a bipartition of V in two subsets A and B such that i, j ∈ A and z, t ∈ B, D satisfies the following condition:

d

_ij

+ d

_zt

≤ max{d

_iz

+ d

_jt

, d

_it

+ d

_jz

}. (2.4) Condition (2.4) is known as four-point condition [150]. Buneman [25, 44] showed that the four-point condition is not only necessary but also sufficient for D to be additive. An intuition of the four-point condition can be obtained by considering the phylogeny shown in Figure 2.2. If a phylogeny is a metric tree then the following inequality holds:

d

_AB

+ d

_CD

= e

_A

+ e

_B

+ e

_C

+ e

_D

≤ d

_AD

+ d

_BC

= e

_A

+ e

_B

+ e

_C

+ e

_D

+ 2e

₁

(since e

₁

by hypothesis is a non-negative quantity). On the other hand, since d

_AD

= e

_A

+

e

₁

+ e

_D

, d

_BC

= e

_B

+ e

₁

+ e

_C

, d

_AC

= e

_A

+ e

₁

+ e

_C

and d

_BD

= e

_B

+ e

₁

+ e

_D

then the relation

d

_AD

+ d

_BC

= d

_AC

+ d

_BD

holds, and the four-point condition follows.

(26)

e

_A

e

_D

e

_c

e

_B

e

₁

B

D

C A

P Q

Fig. 2.2.An example of a phylogeny with four leaves.

We say that an additive distance matrix D is ultrametric if the following inequality holds for any triplet of taxa i, j, k [52]:

d

_ij

≤ max{d

_ik

, d

_kj

}. (2.5)

Additive distance matrices, respectively ultrametric distance matrices, are generated when, for example, the Markov substitution model of molecular evolution described in [76], respec- tively the model described in [52], is used. As observed by Farach et al. [52], ultrametric distance matrices are highly desirable in biology since evolutionary distances, as measured in time, satisfy (2.5).

In accordance with the literature [59], we represent a phylogeny by means of a particular network matrix called an Edge–Path incidence matrix of a Tree (EPT) ([115], p. 550). An EPT matrix X of a phylogeny T is characterized by having a row for each path between two leaves and a column for each edge. The generic entry x

_ij,e

is then equal to 1 if edge e belongs to the path p

_ij

from leaf i to leaf j and is 0 otherwise. As an example, Table 2.1 shows the EPT matrix corresponding to the phylogeny shown in Figure 2.2.

The correspondence between a phylogeny T and the associated EPT matrix X induces a bijection between the set of all the possible phylogenies T and the set of all the possible associated EPT matrices X . Hence, with an abuse of notation, in the following we will denote a phylogeny T by means of its associate EPT matrix X. Moreover, we will denote an edge e of T as the corresponding column of X, and a path on T as the corresponding row of X.

e

_A

e

_B

e

_C

e

_D

e

₁

AB 1 1 0 0 0 AC 1 0 1 0 1 AD 1 0 0 1 1 BC 0 1 1 0 1 BD 0 1 0 1 1 CD 0 0 1 1 0

Table 2.1.The EPT matrix corresponding to the phylogeny shown in Figure 2.2.

(27)

Given a distance matrix D, the problem of determining a phylogeny that satisfies the ME criterion can be formalized, in its most general form, as follows:

Minimum Evolution Problem (MEP) min

(X,w)

L(X, w) s.t. f (D, X, w) = 0 X ∈ X , w ∈ <

⁽²ⁿ⁻³⁾₀+

where L(X, w) indicates the length of a phylogeny X with associated edge weights w, and f (D, X, w) is a function correlating the distance matrix D with the phylogeny X and edge weights w. Thus, any version of MEP is completely characterized by defining the functions L(X, w) and f(D, X, w).

Apart from some polynomial cases [52, 150], each version of MEP has been proved to be N P -Hard [40, 52, 95, 106, 108, 152]. In the next three sections, we propose a possible taxonomy of these versions and discuss the main approaches proposed to solve them.

2.2 A possible taxonomy of the literature

We classify the literature on MEP according to three main perspectives: version, the approach used to solve it, and the statistical consistency of the phylogeny provided. Here, we briefly introduce each perspective and we list in Table 2.2 the corresponding references.

From the first perspective, we classify the literature on the basis of the type of function f (D, X, w) and the structure of objective function L(X, w) used. The function f (D, X, w) imposes conditions on the differences (incongruences) between the evolutionary distances derived using the weights w of phylogeny X and the distances defined by matrix D. Some versions in the literature, the so-called least-squares models, require that the sum of the squares of these differences be minimized. Other versions, typically based on linear program- ming, require that the sum of only these differences be minimized, and, further, that entries of w be non-negative and satisfy the triangle inequality (2.3). In turn, the least-squares mod- els can be further differentiated on the basis of the presence (or absence) of the positivity constraint, i.e., the non-negativity of edge weights. The positivity constraint has important biological implications; we discuss these, together with the versions of MEP, in Section 2.3.

Secondly, we classify the literature on the basis of the approach used to solve the var-

ious versions of MEP. In general, the approaches used are either exact or non-exact. The

former includes algorithms based on exhaustive enumeration and on branch-and-bound. The

latter can be further divided into approximation algorithms and heuristics. In turn, the

approaches based on heuristics can be subdivided into constructive, clustering and construc-

tive/clustering. We discuss this classification in Section 2.4.

(28)

Perspective References

Version [10, 13, 23, 24, 29, 30, 42, 43, 52, 61, 65, 67, 68, 69, 80]

[83, 91, 97, 107, 108, 118, 127, 128, 135, 142, 147, 148, 150]

Approach to solution

[3, 22, 31, 32, 36, 37, 38, 39, 43, 51, 52, 54, 64, 65, 66]

[67, 71, 87, 98, 106, 119, 127, 128, 130, 131, 136, 138, 141, 143]

[150, 152, 156]

Statistical consistency [5, 41, 43, 68, 70, 82, 84, 91, 103, 112, 128, 151]

Table 2.2.References classified by perspective.

Finally, we classify the literature on the basis of the statistical consistency of the phylo- genies provided by different versions of MEP. In Section 2.5 we show that some versions of MEP may lead to statistically inconsistent results; for this reason, they are generally frowned upon by the scientific community [41].

2.3 Versions of the minimum evolution problem

The minimum evolution problem can be divided into two subproblems: (i) determining the structure of the optimal phylogeny (i.e., the entries of matrix X), and (ii) finding the as- sociated optimal edge weights (i.e., the entries of w) that best fit the distance matrix D.

Historically, the latter subproblem was among the first aspects of molecular phylogenetics to be studied, and is at the core of MEP. In fact, the edge weight estimation subproblem determines the choice of the function f(D, X, w) and of the structure of function L(X, w), and hence the version of MEP.

The literature proposes two main families of edge weight estimation models (whose ref- erences are listed in Table 2.3): the least-squares models and the linear programming mod- els. The former are discussed in Section 2.3.1 and the latter are discussed in Section 2.3.2.

Throughout this section, if not explicitly stated, we will always intend that phylogeny X is assigned.

2.3.1 The least-squares models

The least-squares models were first introduced by Cavalli-Sforza and Edwards in 1967 [29].

The authors considered each evolutionary distance d

_ij

among pairwise molecular data as uniformly distributed independent random variables satisfying the additive property (2.4).

In other words, Cavalli-Sforza and Edwards assumed that each entry d

_ij

could be thought

of as the resulting sum of mutation events w

_e

accumulated on each edge e belonging to the

path p

_ij

linking taxa i and j on X, i.e., in matrix form:

(29)

Xw = D

⁴

(2.6) where D

⁴

is the n(n − 1)/2 vector whose components are obtained by taking row by row the entries of the strictly upper triangular matrix of D. In general, equation (2.6) may not admit solutions. For this reason, the authors proposed the use of the Ordinary Least-Squares (OLS) to find the entries of vector w. Specifically, the authors suggested that the values ρ

ij

= P

e∈p_ij

x

ij,e

w

e

, called expected distance estimates [59], should minimize the function:

n

X

i=1 n

X

j=1:j6=i

(d

_ij

− ρ

_ij

)

²

=

n

X

i=1 n

X

j=1:j6=i

(d

_ij

− X

e∈p_ij

x

_ij,e

w

_e

)

²

.

Some authors disagreed with Cavalli-Sforza and Edwards’ model. Specifically, Fitch and Margoliash [61] observed that, due to the common evolutionary history of the taxa analyzed and the presence of sampling errors in molecular data, the assumption that the evolutionary distances {d

_ij

} are uniformly distributed independent random variables cannot be considered generally true. Therefore, the authors proposed to modify Cavalli-Sforza and Edwards’ model by introducing the quantities {ω

_ij

} representing the variances of {d

_ij

}. Fitch and Margoliash called the new model Weighted Least-Squares (WLS) and proposed to minimize the function:

n

X

i=1 n

X

j=1

ω

ij

(d

ij

− X

e∈pij

x

ij,e

w

e

)

²

.

Fitch and Margoliash [61] proposed to set ω

_ij

= 1/d

²_ij

, whereas with analogous arguments, Beyer et al. [13] set ω

_ij

= 1/d

_ij

. Under WLS, the function f (D, X, w) of MEP becomes:

(X

^t

ΩX)w = X

^t

ΩD

⁴

(2.7)

where Ω is a diagonal matrix with diagonal entries equal to {ω

_ij

}.

Subsequently, Bulmer [24], Chakraborty [30], and Hasegawa et al. [80] noted that the

weights {ω

ij

} account for the variance of {d

ij

}, but not for their dependencies. Consequently,

they proposed to substitute the values {ω

_ij

} with the covariances of {d

_ij

}, and called the

new model Generalized Least-Squares (GLS). Specifically, Chakraborty [30] modeled molec-

ular evolution as a Poisson process in which mutations are random events occurring along

each path in the phylogeny, and derived the covariances of the evolutionary distances by con-

sidering path-per-path the number of mutation events observed between pairs of molecular

data. A very similar approach was used by Bulmer [24] and Hasegawa et al. [80]: the former

used an approximation of the Poisson process to compute the covariances of the evolutionary

distances, whereas the latter used a Markov model [79]. Under GLS, the function f (D, X, w)

of MEP becomes:

(30)

14 2 The minimum evolution criterion of phylogenetic reconstruction

Oliv: “chap01” — 2005/1/20 — 15:28 — page 7 — #7

Suppose T is a tree of topology T . Let l be the edge length function on E, let B be the vector with entries l(e

_i

). Then

A

^T

× B = D

^T

,

where D

^T

is the vector form with entries d

^T_(ij)

. This matrix formulation shall prove to be useful as we consider various least-squares approaches to edge length estimation.

1.2.5 Unweighted and balanced averages

Given any pair, X, Y , of disjoint subsets of L, and any metric d on L, we use the notation d

_X_|_Y

to denote the (unweighted) average distance from X to Y under d:

d

_X_|_Y

= 1

| X || Y |

!

x∈X,y∈Y

d

_xy

, (1.1)

where | X | denotes the number of taxa in the subset X . The average distances shall prove to be useful in the context of solving for optimal edge lengths in a least-squares setting. Given a topology T with leaf set L, and a metric d on L, it is possible to recursively calculate all the average distances for all pairs A, B of disjoint subtrees of T . If A = { a } , and B = { b } , we observe that d

_A_|_B

= d

_ab

. Suppose one of A, B has more than one element. Without loss of generality, B separates into two subtrees B

₁

and B

₂

, as shown in Fig. 1.2, and we calculate

d

_A_|_B

= | B

1

|

| B | d

_A_|_B₁

+ | B

2

|

| B | d

_A_|_B₂

. (1.2) It is easy to see that equations (1.1) and (1.2) are equivalent. Moreover, the same equations and notation apply to define δ

_A_|_B

, that is, the (unweighted) average of distance estimates between A and B.

Pauplin [38] replaced equation (1.2) by a “balanced” average, using 1/2 in place of | B

1

| / | B | and | B

2

| / | B | for each calculation. Given a topology T , we recursively define d

^T_A_|_B

: if A = a, and B = b, we similarly define d

^T_A_|_B

= d

ab

, but

B

a b

A

B

₁

B

₂

Fig. 1.2. Calculating average distances between subtrees.

Fig. 2.3. The edge{a,b}induces a bipartition on the set of leaves{A,B1,B2}of the above phylogeny. The value of wabunder the OLS edge weight estimation model is a function average distance between{A}and{B1,B2}.

(X

^t

Ψ

⁻¹

X)w = X

^t

Ψ

⁻¹

D

⁴

(2.8)

where Ψ is the covariance matrix of the evolutionary distances.

The computational complexity required to solve by means of matrix formulae the above models (respectively O(n

⁴

) for the OLS and WLS models, and O(n

⁶

) for the GLS model [23]) represented in the 1970s and 1980s a serious bottleneck for their empirical application. For this reason, several authors investigated alternative strategies to reduce the computational effort required to implement them.

Vach [147] noted that any edge of a phylogeny X induces a bipartition (also called split [8], see Figure 2.3) on the set of leaves of X, and that such bipartition can be used to approximate the OLS model. Specifically, given a phylogeny X, Vach proved that: (i) the value of an edge weight w

_e

is a function of the average distance between the leaves belonging to the two subsets of a bipartition induced by edge e, and (ii) such a value does not depend on the phylogeny but only on the leaves contained in the bipartition [148].

This result was reached independently by Rzhetsky and Nei [128] who provided an O(n

³

) algorithm to solve the OLS model [129]. This algorithm was further improved by Gascuel [65]

who decreased its order of complexity to O(n

²

). Finally, Bryant and Waddell in [23] proposed a unified and generalized framework to speed up the solution of the OLS, WLS and GLS models. Specifically, the authors provided an optimal algorithm to solve the OLS model, an O(n

³

) algorithm to solve the WLS model, and an O(n

⁴

) algorithm to solve the GLS model.

Finally, Makarenkov and Lapointe [107] have recently introduced a particular WLS model

usable in all cases in which some evolutionary distances are partially given or uncertain

(cases usually met, for example, when dealing with fossil data [107]). The model assumes

that properties (2.4-2.5) hold for the distance matrix D, and assigns {ω

_ij

} ∈ {0, 1/2, 1} as a

function of the uncertainty degree of the entries {d

_ij

}. The authors proved that solving this

particular version of MEP is N P-Hard.

(31)

The positivity constraint

The additive property of the distance matrix D in Cavalli-Sforza and Edwards’ model guaran- tees that the phylogeny provided by (2.6), (2.7) and (2.8) is a metric tree [52], i.e., implicitly imposes the constraint w ≥ 0. Unfortunately, when the distance matrix D is generic (e.g., it is obtained by means of Markov models, see Section 2), all the least-squares models considered so far may lead to the occurrence of negative entries in the vector w, i.e., to a phylogeny that is not metric tree [67, 97]. Negative edge weights are infeasible both from a conceptual point of view (a distance, being an expected number of mutation events over time, cannot be negative [91]) and from a biological point of view (evolution cannot proceed backwards [116, 142]). For the latter reason at least, non-metric tree phylogenies are generally unacceptable to biologists [68].

In response, some authors investigated the consequences of adding or guaranteeing the positivity constraint in the least-squares models. Gascuel and Levy [69] observed that the presence of the positivity constraint transforms ([17], p. 187) any least-square model into a non-negative linear regression problem which involves projecting the distance matrix D onto the positive cone defined by the set of metric trees associated to a given phylogeny X [68]. Therefore, the authors proposed an algorithm to generate a sequence of least-squares projections of the distance matrix D onto the convex set of the metric trees until an additive distance matrix (and the corresponding phylogeny) is obtained [69]. A similar approach had previously been provided by Hubert and Arabie [83]. Both algorithms are characterized by a computational complexity of O(n

⁴

).

Farach et al. [52] proposed a number of models to perturb a distance matrix D in order to achieve additive or ultrametric matrices. Specifically, the authors proposed a first model in which, given upper and lower bounds on the evolutionary distances, an additive (ultrametric) distance matrix between these bounds must be found. In a second model, the authors assumed a distance matrix D in which some evolutionary distances are partially given or uncertain, and studied the possibility of assigning these entries in order to obtain a new distance matrix that satisfies the additive (ultrametric) property. They proposed the L

∞

-norm and L

₁

-norm to constrain the entries of D to satisfy the additive (ultrametric) property. The authors proved that both models can be solved in O(n

²

+ n log n) time when an ultrametric distance matrix is required under the L

∞

-norm. By contrast, the authors proved that both models become hard when an ultrametric or an additive distance matrix is required under the L

₁

- norm. However, to the best of our knowledge, nothing is known about the hardness of finding an additive distance matrix under the L

∞

-norm in either model.

Barth´ elemy and Gu´ enoche [10], and Makarenkov and Leclerc [108] proposed two iterative

O(n

⁴

) and O(n

⁵

) algorithms which exploit the Lagrangian relaxations of the OLS and WLS

models to find a phylogeny that satisfies the positivity constraint. Specifically, starting from

a leaf, the algorithms generate iteratively a phylogeny with a growing number of leaves by

solving an optimization problem which finds the best non-negative edge weights that minimize

(32)

the OLS (respectively WLS) model. A previous and similar algorithm, called FITCH, was also proposed by Felsenstein [58].

A different approach from those described above is followed in the Balanced Least-Squares (BLS) edge weight estimation model [42, 43]. The BLS model is based on a seminal work of Pauplin [118] in which the author modifies equation (2.6) by requiring that all edges of a phylogeny X be weighted in the same way (see Section 2.3.1). As a result, the new model allows the positivity constraint to be satisfied if the triangle inequality holds for the distance matrix D [43]. Desper and Gascuel proved that the BLS model is a special form of the WLS model in which the variances of the evolutionary distances are proportional to their topological distance (i.e., the number of edges belonging to the path between the corresponding endpoint taxa), and inversely proportional to the length of the molecular data [43]. Solving a BLS model requires a computational complexity of O(n

²

+ n log n).

The objective function in the least-squares models

Under the least-squares edge weight estimation model, several objective functions have been proposed in the literature. Below, we just give an overview of them, postponing to Section 2.5 our discussion of the impact that each objective function has on the statistical consistency of MEP.

Rzhetsky and Nei showed that if the entries of the distance matrix D were not subjected to sampling errors and all the molecular data from the (set of) taxa were available, then the total length of the true phylogeny must be the shortest [127]. Therefore the authors suggested the use of the following objective function:

L(X, w) =

2n−3

X

e=1

w

e

. (2.9)

Some authors proposed modifications to (2.9) to deal with a non-perfectly additive distance matrix D. Specifically, Kidd and Sgaramella-Zonta [91] suggested the use of the following objective function:

L(X, w) =

2n−3

X

e=1

|w

_e

|,

whereas Swofford et al. [142] and Gascuel et al. [68] proposed:

L(X, w) =

2n−3

X

e=1

max{w

_e

, 0}.

Specifically, this last function is used in a well-known phylogenetic software package called

PAUP 4.0* [141]. Farach et al. [52] and Argawala et al. [3] proposed the function:

(33)

L(X, w) = ||w||

_∞

= ||X

^†

D

⁴

||

_∞

= max

e=1,...,2n−3

|w

_e

|,

where || · ||

_∞

is the L

_∞

-vector norm and X

^†

is the Moore-Penrose generalized inverse of X. This function is particularly related to the statistical consistency of MEP and will be discussed in Section 2.5.

Under the BLS model the objective function is independent by edge weights. Specifically, the length of a phylogeny becomes [67]:

n−1

X

i=1 n

X

j=i+1

2

^1−τ^ij

d

_ij

(2.10)

where

τ

_ij

=

2n−3

X

e=1

x

_ij,e

.

2.3.2 The linear programming models

Linear Programming (LP) models are currently the only alternatives to the least-squares models. The earliest article using a LP model was proposed by Beyer et al. [13]. The authors observed that if the evolutionary distances between pairs of molecular data have to reflect the number of mutation events required over time to convert one molecular sequence into another, then the evolutionary distances must satisfy the triangle inequality (2.3) [54]. Moreover, since any edge weight of a phylogeny is de facto an evolutionary distance, also the entries of vector w must satisfy the triangle inequality [13]. Hence, Beyer et al. proposed a LP model in which, for a given phylogeny, the (2n − 3) non-negative edge weights must satisfy n(n − 1)/2 triangle inequalities.

An analogous model was proposed by Waterman et al. [150], where the authors studied the case in which the additive property (2.4) also holds for the distance matrix D. Waterman et al. mixed the additive property of the distance matrix with the triangle inequality and modified Beyer et al.’ model by imposing that:

w

_e

≥ 0 e = 1, . . . , 2n − 3 X

e∈p_ij

w

_e

≥ d

_ij

∀ i < j, i, j ∈ Γ

The authors were able to prove that the assignment of edge weights by linear programming

yields at least (n − 1) of the 2

ⁿ

triangle inequalities as equalities, and at most (n − 1) edge

weights equal to zero.

(34)

Edge weight estimation model References Least-Squares

OLS [23, 29, 65, 68, 91, 127, 128, 147, 148]

WLS [61, 107]

GLS [24, 30, 80]

BLS [42, 43, 67, 118, 135]

Least-Squares with positivity constraint [10, 52, 58, 69, 83, 108]

Linear programming [13, 150]

Table 2.3.References classified by type of edge weight estimation model.

The objective function in the linear programming models

Under the LP edge weight estimation models, Beyer et al. [13] and Waterman et al. [150]

suggested the following objective functions:

L(X, w) =

2n−3

X

e=1

w

e

, (2.11)

L(X, w) =

n−1

X

i=1 n

X

j>i

P

e∈pij

w

_e

− d

_ij

d

_ij

, (2.12)

and

L(X, w) =

n−1

X

i=1 n

X

j>i

P

e∈pij

w

_e

− d

_ij

d

²_ij

. (2.13)

The objective function (2.11) is analogous to Rzhetsky and Nei’s (2.9) in the least-squares model [13]. On the other hand, the objective function (2.12) represents the overall sum of the deviations between the path that connects leaves and its corresponding evolutionary distance, whereas objective function (2.13) differs from (2.12) only in the use of a χ

²

-type weighting [150]. Waterman et al. noted that both (2.12) and (2.13) can be considered as analogous LP versions of the WLS when ω

_ij

= 1/d

_ij

or ω

_ij

= 1/d

²_ij

, respectively [150].

2.4 Approaches to solving the minimum evolution problem

Apart from some polynomial cases [52, 150], each version of MEP has been proved to be

N P-Hard. This result is generally obtained by reduction from the Steiner tree problem [78],

the Quartet Consistency problem [40], the Subgraph Clique Partition problem [52], or the

k-Colorability problem [152]. The N P-Hardness of MEP entailed researchers to develop both

exact and non-exact solution algorithms. Exact algorithms have been developed much more

recently and are far less numerous than non-exact algorithms. Below, we discuss both classes

of algorithms and list the respective references in Table 2.4.

(35)

2.4.1 Exact algorithms

Waterman et al. [150] proved that MEP becomes easy when the distance matrix is additive and a metric tree phylogeny is required to satisfy (2.6). In this case, the authors showed that the solution is unique and, to determine it, provided an O(n

²

) algorithm, called Sequential Algorithm. Culberson [38] improved the algorithm by decreasing its order of complexity to O(n log n). A slower version of Culberson’s algorithm, called ADDTREE, was proposed by Sattah and Tversky [131].

When dealing with generic distance matrices and using the OLS edge weight estimation model, the only program able to provide exact solutions to the instances of MEP is, to the best of our knowledge, PAUP 4.0* [141]. PAUP 4.0* combines an ingenuous exhaustive enumeration of the possible phylogenies X with an efficient computation of edge weights w.

Specifically, PAUP 4.0* implements Bryant and Waddell’s algorithms (see [23] and Section 2.3.1) which can be used to estimate the edge weight vector w of a given phylogeny with an overall O(n

²

) computational complexity. PAUP 4.0* is able to solve instances containing no more than a dozen taxa.

Wu et al. [152] studied two families of instances of MEP, called the Minimum Ultrametric Tree (MUT) problem and the Metric Minimum Ultrametric Tree (MMUT) problem. As their names suggest, the first set of instances is characterized by ultrametric distance matrices, and the second by metric ultrametric distance matrices which differ from the first only in the presence of the distance metric property. The authors proved that both the MUT and the MMUT problems are N P -hard [52, 152], and provided an exact algorithm based on the implicit enumeration of all phylogenies, able to find optimal solutions for datasets containing up to 25 taxa. Wu et al.’s algorithm was then improved by Chen and Chang [31] who provided a tighter lower bound for the MUT problem, and by Yu et al. [156] who provided a parallel version of the algorithm. The latter improvement allowed the authors to find the optimal solutions for datasets containing up to 38 taxa.

Lu et al. [106] introduced a version of MEP in which the input is represented by a metric distance matrix and the solution is required to be a Steiner tree of the phylogenetic graph G.

They proved that such a problem is MAX SN P-Hard. Chen et al. [32] even formulated a more particular case, called the Bottleneck Steiner Tree (BST) problem, where the solution of MEP is a Steiner tree in which the greatest edge weight is minimized. The authors provided an exact algorithm of complexity O(n log n) to solve the BST problem, but unfortunately, neither Lu et al. nor Chen et al. provided any relation to previous analogous versions of MEP, or any biological interpretation of the resulting phylogeny.

Finally, to the best of our knowledge, there are no exact algorithms able to tackle instances

of MEP under any other least-squares edge weight estimation model. We will present in the

next chapter a number of possible mixed integer programming formulations to tackle instances

of MEP under the linear programming edge weight estimation model.

(36)

Fig. 2.4. Clustering heuristics: initially a graph-star is considered; subsequently two vertices (circled) are selected, marked (white vertices) and joined by an internal vertex. The algorithm is iterated on the remaining black vertices.

2.4.2 Non-exact algorithms

In contrast to exact algorithms, non-exact algorithms generally date back further and are more numerous. Non-exact algorithms are typically heuristics, i.e., algorithms that produce reasonably good solutions in short computing time ([117], p. 401). Although in general heuris- tics do not provide any formal guarantee of the quality of the solution found, for some of them (hereafter referred to as approximation algorithms ), it is possible to prove that the solution found is optimal up to a small constant factor ([117], p. 409).

Before describing the main heuristics, we need to introduce some definitions. Define a partial phylogeny X

_m

of a set of taxa Γ as an m-leaf phylogeny whose leaves are taxa of a subset Γ

⁰

⊂ Γ , with m = |Γ

⁰

|. Given a partial phylogeny X

_m

of Γ , denote V

_X_m

and E (X

_m

) as the corresponding set of vertices and edges of X

_m

, respectively. We say that we insert a leaf i / ∈ Γ

⁰

into an edge e = (r, s) of X

_m

when we generate a new partial phylogeny X

_m+1

of Γ with vertex-set V

_X_m+1

= V

_X_m

∪{i, t} and edge-set E

_X_m+1

= E

_X_m

∪{(r, t), (t, s), (t, i)}\{(r, s)}.

In other words, we insert a leaf i into the partial phylogeny when we divide edge e = (r, s) with the new vertex t and we join the leaf i to t.

Approximation algorithms

The family of approximation algorithms is relatively recent: to the best of our knowledge, the first works are dated 1999. The first approximation algorithm for MEP under a least-squares edge weight estimation model was provided by Argawala et al. [3]. The authors studied a particular version of MEP which minimizes the L

∞

-vector norm of the differences between the entries of vector w and the entries of an ultrametric distance matrix D. Argawala et al.

provided for such a problem an algorithm characterized by an approximation ratio ([63], p.

128; [117], p. 409) equal to 3. Their work is at the core of Atteson’s article [5] which has

strong implications for the statistical consistency of a number of versions of MEP. We discuss

this issue more extensively in Section 2.5.

(37)

More recently, Wu et al. [152] provided a 1.5(1 + dlog ne)-approximation algorithm for the MUT problem (see Section 2.4.1), and Lu et al. [106] studied a particular case of MEP in which edge weights are restricted to the values {1, 2}. Lu et al. proved that this particular case of MEP can be approximated with an error ratio of 8/5, but did not provide any biological interpretation for the resulting phylogeny. Similarly, Chen et al. [32] provided a 2r- approximation algorithm for the BST problem (see Section 2.4.1), where r is the best-known approximating ratio for the Steiner tree problem, but in this case also, the authors failed to provide any biological interpretation of the resulting phylogeny.

Heuristics

Heuristics can be classified according to the edge weight estimation model used and the strategy proposed to build a phylogeny. Specifically we can distinguish between classes of clustering, constructive and constructive/clustering heuristics. The idea at the core of the class of clustering heuristics is the so-called star decomposition algorithm (see Figure 2.4 and ([59], p. 48)). The algorithm starts from a star-graph containing n leaves and one internal vertex and then iteratively joins any pair of vertices with a new internal vertex until a phylogeny is obtained. By contrast, the idea at the core of the class of constructive heuristics is to start from a partial phylogeny of the set Γ and then to insert iteratively a new leaf until a phylogeny of the set Γ is obtained. Finally, the class of constructive/clustering heuristics combines the two previously cited approaches by generating typically a first phylogeny with a clustering and/or constructive heuristic and then applying a local search to it. Below, we discuss the main heuristics available in this area, postponing to Section 2.5 the analysis of their statistical consistency.

Heuristics using metric properties

Heuristics using metric properties (triangle inequality, additive and ultrametric properties) are typically constructive in nature. The earliest heuristic exploiting this strategy was pro- vided by Farris in 1972 [54]. At each iteration, this heuristic computes all the possible inser- tions of a new leaf into a partial phylogeny and chooses the one whose length is minimal.

Farris’s heuristic requires metric distance matrices and exploits the triangle inequalities to compute edge weights of a given phylogeny. Tateno et al. [143] and Faith [51] extended Farris’s heuristic by introducing techniques to handle ambiguities and errors in the input molecular data.

Cs˝ ur¨ os et al. [36] improved the speed performances of Farris’s heuristic by substituting its edge weight estimation with a new one exploiting the harmonic mean of a triplet of taxa.

Cs˝ ur¨ os [37] also showed that when the additive property holds for distance matrix D, his

heuristic, called HGT/FP, is characterized by an order of complexity of O(n

²