On the challenge of reconstructing level-1 phylogenetic networks from triplets and clusters

(1)

HAL Id: hal-01391430

https://hal-upec-upem.archives-ouvertes.fr/hal-01391430

Submitted on 8 May 2017

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons

Attribution| 4.0 International License

On the challenge of reconstructing level-1 phylogenetic networks from triplets and clusters

Philippe Gambette, Katharina Huber, Steven Kelk

To cite this version:

Philippe Gambette, Katharina Huber, Steven Kelk. On the challenge of reconstructing level-1 phy-

logenetic networks from triplets and clusters. Journal of Mathematical Biology, Springer Verlag

(Germany), 2017, 74 (7), pp.1729-1751. �10.1007/s00285-016-1068-3�. �hal-01391430�

(2)

DOI 10.1007/s00285-016-1068-3

Mathematical Biology

On the challenge of reconstructing level-1 phylogenetic networks from triplets and clusters

Philippe Gambette¹ · K. T. Huber² · S. Kelk³

Received: 26 March 2014 / Revised: 5 April 2016 / Published online: 31 October 2016

Abstract Phylogenetic networks have gained prominence over the years due to their ability to represent complex non-treelike evolutionary events such as recombination or hybridization. Popular combinatorial objects used to construct them are triplet systems and cluster systems, the motivation being that any networkNinduces a triplet system R(N)and a softwired cluster systemS(N). Since in real-world studies it cannot be guaranteed that all triplets/softwired clusters induced by a network are available, it is of particular interest to understand whether subsets ofR(N)or S(N)allow one to uniquely reconstruct the underlying network N. Here we show that even within the highly restricted yet biologically interesting space of level-1 phylogenetic networks it is not always possible to uniquely reconstruct a level-1 network N, even when all triplets inR(N)or all clusters inS(N)are available. On the positive side, we introduce a reasonably large subclass of level-1 networks the members of which are uniquely determined by their induced triplet/softwired cluster systems. Along the way, we also establish various enumerative results, both positive and negative, including results which show that certain special subclasses of level-1 networks N can be uniquely

B

Philippe Gambette philippe.gambette@u-pem.fr K. T. Huber

K.Huber@uea.ac.uk S. Kelk

steven.kelk@maastrichtuniversity.nl

1 LIGM (UMR 8049), CNRS, ENPC, ESIEE Paris, Université Paris-Est, Marne-la-Vallée, 77454 Paris, France

2 School of Computing Sciences, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ, UK

3 Department of Knowledge Engineering (DKE), Maastricht University, Maastricht, The Netherlands

(3)

reconstructed from proper subsets ofR(N)andS(N). We anticipate these results to be of use in the design of algorithms for phylogenetic network inference.

Mathematics Subject Classification O5C05·92D15

1 Introduction

Phylogenetic trees are essentially graph-theoretical trees whose set of leaves is labelled by a set of species or organisms (more abstractly,taxa) and which do not have any degree-two vertices, except possibly the root. They have been the model of choice for many years for shedding light on the evolutionary past of a set of taxa. However, in cases where the taxa are suspected to have undergone reticulate evolutionary events such as hybridization or recombination, trees have been found to not always be appro- priate (Sneath 1975). The need for structures capable of appropriately dealing with such data sets, combined with the fact that different evolutionary processes have given rise to them, has resulted in the introduction of a number of more general structures for representing evolutionary relationships. Subsumed under the name “phylogenetic network” these include hybrid phylogenies (Baroni et al. 2006), ancestral recombination graphs (Hein 1990), galled trees (Gusfield et al. 2004;Wang et al. 2001), normal networks (Wilson 2010), regular networks (Baroni et al. 2004), tree-sibling networks (Cardona et al. 2008), level-knetworks (Jansson et al. 2006;Iersel et al. 2009), median networks (Bandelt 1994) and NeighborNets (Bryant and Moulton 2003), to name just a few, which all generalize a phylogenetic tree in one way or another.

Apart from median networks and NeighborNets which are a special type of split- based phylogenetic network, the basic graph-theoretical structure underpinning a phylogenetic network is a rooted directed acyclic graph (DAG) that has a unique root and whose set of sinks is a given set of taxa. One of the combinatorially sim- plest types of phylogenetic network, but still complicated enough to be of interest to Evolutionary Biology, is that of a binary level-1 network (see Fig.1 for an example). Such structures have attracted a considerable amount of interest in the literature (see e. g.Jansson et al. 2006;Gusfield et al. 2004;Rosselló and Valiente 2009;Huber et al. 2011) and can informally be thought of as degree-constrained rooted DAGs with vertex-disjoint undirected cycles. (Formal definitions of all terms will follow in later sections). However, this simplicity has proven to be deceptive, as the combinatorial

Fig. 1 A binary level-1 phylogenetic networkNon X= {1, . . . ,10}that is also 4-outwards and saturated. As in all figures all arcs of the network are directed downwards, so we do not explicitly indicate the direction of arcs

(4)

structure of such networks has turned out to be more complicated than originally thought (see e.g.Gambette and Huber 2012;Huber and Moulton 2013). Limits on our ability to reconstruct level-1 networks constitute lower bounds on how well we can reconstruct phylogenetic networks more generally. On the other hand, positive results for reconstructing level-1 networks can be an important first step towards algorithms for reconstructing more complex phylogenetic networks.

In this paper, we start by establishing a number of enumerative results for binary level-1 networks. These include upper and lower bounds on the number of vertices and arcs in such networks. We gradually shift our focus ontocluster systems, that is, collections of non-empty subsets of the leaves, andtriplet systems, that is, binary phylogenetic trees on just three leaves. Guided by the fact that these systems have been used for reconstructing phylogenetic networks [see e. g.Morrison(2009) and Huson et al.(2010) for recent overviews], we are particularly interested in finding bounds on the minimum size of a triplet system/cluster system required to “uniquely determine” a level-1 network. For trees this question is well understood. Specifically, for a phylogenetic treeT onn ≥3 leaves it is well-known thatT is uniquely determined by its induced triplet systemR(T)(leading to an upper bound of_n

3

for such a minimum-sized set) and that n −2 carefully chosen triplets from R(T) suffice to uniquely reconstructT whenT is binary [see Theorem 3 ofSteel(1992) and its Corollary]. For this case, it is also well-known thatT is uniquely determined by its induced cluster systemC(T)and that for a minimum-sized cluster system to uniquely determineT, it must have|C(N)| =2n−1 elements.

As we shall see, the situation is more complicated for binary level-1 networks.

Every level-1 networkN induces a triplet systemR(N)and a certain cluster system S(N)called thesoftwired cluster system of N[seeHuson and Scornavacca(2011) for background] but their ability to fully capture the topological structure ofN is not as strong as one might hope. Let us say that a binary level-1 networkNisencodedby its induced triplet system if for every binary level-1 networkNsuch thatR(N)=R(N), we have N = N. Continuing, we say that a binary level-1 network is4-outwardsif its underlying graph does not have a cycle of length four or less. It is precisely the 4-outwards binary level-1 networks N that are encoded byR(N)as well asS(N) (Gambette and Huber 2012) (where we define a binary level-1 network to be encoded by its induced softwired cluster system in an analogous way).

Intriguingly, ifR(N) = R(N)is replaced by R(N) ⊆ R(N)(as is the case in our formalization of “uniquely determining”) then the assumption that N is 4- outwards is no longer strong enough to guarantee uniqueness. A similar observation holds forS(N)(see Sects.6and7for examples for both cases). However, the situation changes for both if, in addition to being 4-outwards we require thatN issaturated, that is, none of its vertices is incident with more than one cut arc (Theorem6.3and Theorem7.3). Simple networks onn ≥4 leaves are 4-outwards, saturated networks that have precisely one cycle in their underlying graph. We show that at most 2n−1 carefully chosen triplets suffice to uniquely determine such networks. As the network on four leaves depicted in Fig.6indicates, this bound is however not tight because five triplets suffice in that case (which can be checked by a simple case analysis). Given that any binary level-1 networkNcontains at least one triplet for any three of its leaves and so|R(N)| ≥_n

3

holds, this suggests that at least for simple phylogenetic networks

(5)

there is a considerable amount of redundancy inR(N)with regards to reconstructing N fromR(N). To establish a similar result for general binary level-1 networks N might not be straightforward in view of Proposition4.2, which suggests that|R(N)|

is not easily expressible in terms of a natural parameter associated with a phylogenetic networkN, namely its number of non-trivial cut arcs (see Sect.3). This is somewhat surprising in view of the close relationship between the triplet system induced by a binary level-1 network N and its associated softwired cluster system S(N)[see e. g.Gambette and Huber 2012, Proposition 2 and Theorem 1 for details concerning this relationship] because the size ofS(N)isclosely related to the number of cut arcs ofN(Theorem4.1). As in the case of triplet systems, it is easy to find examples of binary level-1 networks N that indicate that there is redundancy in the softwired cluster system induced byNwith regards to uniquely determiningN. Again focusing on simple networksN, we show that at mostncarefully chosen (softwired) clusters induced byN suffice to uniquely determine N (Corollary7.2). However, we do not know if this bound is sharp.

Given that in phylogenetic analyses one is hardly ever guaranteed to have all triplets/clusters induced by a (as yet unknown) phylogenetic network available, the above observations have profound consequences for phylogenetic network reconstruction. One of the most important ones is that a phylogenetic network reconstructed from a triplet or cluster system need not be the network that gave rise to this system.

The paper is organized as follows. In the next section, we present basic terminology of relevance to this paper, including the definition of a level-knetwork and that of a gall in a level-1 network. In Sect.3, we define cut arcs and present formulas for counting the number of vertices, arcs, and galls in a binary level-1 network. These results improve on the results inChoy et al.(2005) which imply that the number of vertices in a binary level-1 network onn leaves is linear inn and that the number of hybrid vertices is at mostn −1. In Sect.4.1, we formally define the softwired cluster systemS(N) induced by a binary level-1 network N and establish Theorem4.1. In Sect.4.2, we define the triplet systemR(N)induced by a binary level-1 networkN and establish Proposition4.2. In Sect.5, we establish in Proposition5.1a relationship between the triplet system induced by a binary level-1 network N and a certain partition of the leaf set of N that will be crucial for showing Theorem6.3. In Sect.6, we first formalize the notion of “uniquely determining” and then present the aforementioned examples for triplet systems. Starting in that section and continuing in Sect.7, we investigate saturated, 4-outwards, binary level-1 networks and establish Theorem6.3 and Theorem7.3, respectively.

2 Definitions and notation

In this section we present only basic definitions and notation to avoid overloading the reader. Concepts such as triplets and (softwired) clusters are formalized in subsequent sections.

Throughout the paper, let X denote a finite set of size n ≥ 2. Also all graphs G considered have non-empty finite sets of vertices and edges (or arcs in caseGis directed) and have no loops or multiple edges (or arcs in caseGis directed).

(6)

Suppose for the following thatG =(V,A)is a directed acyclic graph (DAG). If v andware vertices ofG such that there exists an arca fromv towinGthen we denote that arc by(v, w)and refer tovas thetail of a, denoted bytail(a), and towas thehead of a, denoted byhead(a). Supposev∈V is a vertex ofG. Then we denote byout deg(v)the out-degree ofv(i.e. the number of arcs whose tail is incident tov) and byi ndeg(v)thein-degreeofv(the number of arcs whose head is incident tov).

The sum of the out-degree and the in-degree ofvis called thedegreeofv, denoted by deg(v). Ifi ndeg(v)=1 andout deg(v)=0 thenvis called aleafofG. The set of leaves ofGis denoted byL(G). Every vertex inV−L(G)is called aninterior vertex ofG. IfGhas a unique vertexρ=ρG ∈V withi ndeg(ρ)=0 andout deg(ρ)≥2 thenρ is called therootofG andG is called arooted DAG. IfGis a rooted DAG with leaf setX andG =(V,A)is a further rooted DAG with leaf setX then we say thatGisequivalenttoGif there exists a graph isomorphism fromGtoGthat is the identity onX.

Aphylogenetic network N on X is a rooted DAG whose set of leaves isX, and every interior vertexvofNexcept the rootρNis either (i) asplit vertexofN, that is, i ndeg(v)=1 andout deg(v)≥2 or (ii) ahybrid vertexof N, that is,i ndeg(v)≥2 andout deg(v)≥1. In case only the size of X is of relevance to the discussion then we will simply callN aphylogenetic network on|X|leavesand if the setX is of no relevance to the discussion then we will simply call a phylogenetic networkNonXa phylogenetic network. We denote the set of hybrid vertices of a phylogenetic network NbyH(N)and say thatNisbinaryif the root ofNas well as every split vertex ofN has out-degree two and every hybrid vertex ofN has out-degree one and in-degree 2.

An undirected graph G is called biconnected if G is connected and G −v is connected for allv ∈ V(G). A maximal biconnected subgraphH ofG is called a biconnected componentof G. (We say that a biconnected component isnon-trivial if it contains more than one edge.) LetU(N)be theunderlying graphof N i.e. the undirected graph obtained from N by ignoring the orientation of its arcs. We say that a binary phylogenetic network N is alevel-k (phylogenetic) network, if every biconnected component ofU(N)contains at mostkhybrid vertices. Reflecting the fact that a cycle of length three in the underlying graph of a phylogenetic network is indistinguishable (from a triplet or cluster perspective) from a split vertex, we follow common practice and will always assume that a cycle in the underlying graph of a level-1 networkN contains at least four vertices.

Note that a phylogenetic networkNfor whichH(N)= ∅holds is simply arooted phylogenetic treeonX [sensuSemple and Steel(2003)]. Thus, level-0 networks are rooted phylogenetic trees. All phylogenetic trees considered in this article are rooted so we henceforth drop the “rooted” prefix.

We denote the class of all binary level-1 networks on n ≥ 2 leaves byL1(n).

Alternatively, we will also useL1(X)to denote that class if we want to emphasize the leaf setXof the networks inL1(n).

Now, suppose thatNis a level-knetwork,k≥1. Then we callN properifNis not also a level-lnetwork for some 0≤l ≤k−1. Note that in casek=1 such a network must have at least three leaves and at least one hybridization vertex. In that case, we call a non-trivial biconnected component ofU(N)with its original directions inN restored agallofN and denote the set of galls of a level-1 networkNbyG(N). IfN

(7)

Fig. 2 Two examples of simple level-1 networks on

X= {x₁, . . . ,x₅}. Note that both networks are 4-outwards and saturated

(i) (ii)

is binary, contains precisely one gallC, and every leaf ofNis adjacent with a vertex ofC thenN is calledsimple. Together with phylogenetic trees, such networks may be viewed as the building blocks of (proper) level-1 networks (Iersel et al. 2009). For the convenience of the reader, we present examples of two simple level-1 networks on X = {x1, . . . ,x5}in Fig.2.

3 Counting arcs, vertices and galls

In this section, we present some enumerative results concerning the number of vertices, arcs, and galls of a level-1 network. We start by introducing some relevant notation.

SupposeNis a phylogenetic network onX. FollowingIersel et al.(2010), we say that a phylogenetic treeT onXisdisplayedbyNif there exists a subgraphNofNthat is asubdivisionofT i.e.T can be obtained fromNby repeatedly suppressing vertices with in-degree and out-degree both equal to 1. ForNa level-1 network, we denote the number of galls ofNbyg(N), that is, we letg(N)= |G(N)|.

3.1 Counting arcs and vertices

In caseNis a binary level-0 network onn≥2 leaves, that is,Nis a binary phylogenetic tree onnleaves, it is easy to see thatN has 2n−1 vertices and 2(n−1)edges (see e. g.Semple and Steel 2003, Proposition 2.1.3 for the corresponding result for unrooted binary phylogenetic trees). For the more general case thatNis a binary, proper, level-k network onn ≥2 (and thus onn≥3) leaves, andk≥1, it was shown in (van Iersel 2009, Lemma 4.5) that any such network can contain at most 2n−1+k(n−1)vertices and at most 2n−2+³₂k(n−1)arcs. Denoting forn≥ 3 the subclass of all proper level-1 networks inL1(n)byL1(n)⁻, the sizes of the vertex and arc sets of a network N = (V,A)in∈ L1(n)⁻can thus be at most 3n−2 and 3.5(n−1), respectively.

Moreover, if follows from (van Iersel 2009, Lemma 4.4) that|V| = 2n+1 = |A|

holds in the special case thatNis simple. The next result indicates that the size of the

(8)

vertex set of a simple level-1 network lends itself to providing lower bounds on the sizes of the vertex set and arc set of a general proper level-1 network, respectively.

Lemma 3.1 Let n ≥3and suppose N =(V,A)∈ L1(n)⁻. Then2n+1 ≤ |V| ≤ 3n−2and2n+1 ≤ |A| ≤ 3.5(n−1). These bounds are tight if n =3, in which case N must be a simple level-1 network.

Proof SupposeXhas sizenand assume thatN =(V,A)is a network inL1(n)⁻with leaf setX. The upper bounds have already been established in the above discussion, so it suffices to prove that 2n+1 ≤ |V|and 2n+1≤ |A|. Letg =g(N)and note thatg≥1 holds becauseN is assumed to be proper. We start by adding a new taxon y∈/ Xjust above the root ofN, in the following way: introduce a new vertexu, add an arc fromuto the root ofN, and add an arc fromutoy. Let N^ybe this new network, whose root isu.N^y has|V| +2 vertices and|A| +2 arcs. LetT = (VT,AT)be any binary phylogenetic tree onX∪ {y}that is displayed byN^y. Then there exists a subgraphTofN^ythat is a subdivision ofT. Observe thatTmust contain 2gvertices whose in-degree and out-degree (inT) are both equal to 1. Specifically,gof them are hybrid vertices ofN^y and the othergare tails of arcs inN^ywhose head is contained in H(N^y). (The correctness of these claims requires the root of T to be the same vertex as the root ofN^y, and this is the reason for the addition ofyin the first place.) Consequently,|VT| =(|V| +2)−2g. Noting thatT has 2(n+1)−1 vertices (i.e.

because it is binary) we have,

|V| = |VT| +2g−2=(2(n+1)−1)+2g−2=2n−1+2g≥2n+1, where the last inequality follows becauseg ≥1. Similarly, to obtainTfromN^y, we need to delete for every hybrid vertexh ∈ H(N^y)precisely one of its incoming arcs (v,h). Hence, bothvandhwill have in-degree and out-degree 1 inT. (Note thatv might be the root of the gall that containshinN^y). Hence,|AT| =(|A|+2)−g−2g.

Noting (again, because it is binary) thatT has 2(n+1)−2 arcs, we have

|A| = |AT| +3g−2=2(n+1)−2+3g−2=2n+3g−2≥2n+1 where, as before, the last inequality follows becauseg ≥1.

It can easily be verified that the bounds are tight in the casen =3. Specifically, all expressions evaluate to 7. Finally, if n = 3 then N must be a simple level-1 network, because otherwise it would either be a tree (and thus not proper) or violate the assumption that every cycle in the underlying graph ofN contains at least four vertices.

3.2 Counting galls

We next establish a formula for counting the number of galls of a level-1 network. To this end, we require further terminology. SupposeG=(V,A)is a rooted DAG. Then an arca ∈ Ais called acut arcofGif the deletion ofa disconnects the underlying graphU(G)ofG. Ifais a cut arc ofGsuch thathead(a)is a leaf ofGthen we calla atrivialcut arc ofGand anon-trivialcut arc ofGotherwise. We denote the number of non-trivial cut arcs of a level-1 networkN bycN.

(9)

SupposeNis a level-1 network onX. For a gallCofN, we call an arc ofNwhose tail but not its head is a vertex ofC anoutgoing arcofC. Note that our assumption that every cycle inU(N)has at least four vertices implies that every gall must have at least three outgoing arcs. Moreover, ifNis binary then we call two distinct leaves xandyof N acherryof N ifx andyhave a common parent. Note that that parent must be a split vertex ofN. In addition, ifNis a binary phylogenetic tree and|X| =3 thenN is called atriplet (on X). Saying that a vertexvof a rooted DAGGisbelowa vertexwofGifwlies on a directed path from the root ofGtovbut is distinct fromv, we denote a triplettonX = {a,b,c}for which the last common ancestor ofaandb is below the root oftbyab|c(or equivalentlyc|ab). Finally, a collectionRof triplets is called atriplet system (on

t∈RL(t)).

Theorem 3.2 Let n≥2and suppose N ∈L1(n). Then g(N)≤n−cN−2and this bound is tight if either N is a phylogenetic tree or every gall of N has exactly three outgoing arcs.

Proof We prove the theorem by induction onn ≥2. Suppose N ∈L1(n). Then the stated inequality clearly holds in the form of an equality forn=2 since in that caseN is a phylogenetic tree. It also holds forn=3 because in that caseNis either a triplet and so has one non-trivial cut arc but no gall, orN is a simple level-1 network and so has precisely one gall but no non-trivial cut arcs.

Suppose thatNhasn≥4 leaves and assume thatg(N)≤n−1−cN −2 holds for all level-1 networksN ∈L1(n−1). Clearly,g(N)=n−cN−2 holds in case Nis a phylogenetic tree as in that caseg(N)=0 and every non-trivial cut arc ofNis an interior edge of N, of which there aren−2. So assume thatN ∈L1(n)⁻. To see the stated bound forg(N), we distinguish between the cases that (i)Ncontains a gall Cwhose outgoing arcs are all trivial cut arcs and (ii) that this is not the case, that is, N contains a cherry.

Assume first that Case (i) holds. We distinguish the cases thatChas three outgoing arcs and thatC has at least four outgoing arcs. Assume first thatC has at least four outgoing arcs. Then there must exist a leafaofNthat is the head of an outgoing arc of Cbut is not adjacent with the unique hybrid vertex ofC. Consider the rooted DAGN obtained fromNby first removinga, its parenta∈V(N), and all arcs adjacent with aand then adding a new arc from the parent ofato the child ofacontained inV(C). Clearly,Nis a binary level-1 network onL(N)\{a}andg(N)=g(N)andcN =cN

both hold. Since|L(N)| = n−1, we haveg(N)= g(N)= n−1−cN −2 = n−3−cN <n−cN−2, by the induction hypothesis. Consequently,g(N) <n−cN−2 holds in this case.

Next, assume that C has exactly three outgoing arcs a1,a2,a3. Let N be the rooted DAG obtained from N by contracting C as well as a1, a2, and a3 into a new leaf x. Clearly, N is a binary level-1 network on L(N) ∪ {x}\{head(a1),head(a2),head(a3)}and g(N) = g(N)−1 and cN = cN −1.

Thus,g(N)≤n−2−cN−2 and, so,g(N)≤n−cN−2 holds in this case too.

Last but not least, assume that Case (ii) holds, that is,Ncontains two leavesxand ythat form a cherry. LetNdenote the rooted DAG obtained fromNby first deleting x, its parent p, and all arcs incident withpand then adding a new arc from the parent of p toy. Clearly N is a binary level-1 network onL(N)\{x}andg(N)= g(N)

(10)

andcN =cN−1 both hold. Consequently,g(N)= g(N)≤ n−1−cN −2 = n−1−(cN−1)−2=n−cN−2 holds by the induction hypothesis. This concludes the proof of this case and thus the proof of the stated bound forg(N).

It remains to establish that the stated bound forg(N)is tight for a level-1 network N ∈L1(n)for which all of its galls have precisely three outgoing arcs. To see this, one can again perform induction onn≥2 but this time assuming thatg(N)=n−cN−2 holds for all level-1 networksN∈L1(n−1)for which every gall has precisely three outgoing arcs. In this context it should be noted that the casesn ∈ {2,3}andN is a phylogenetic tree onnleaves have already been observed above. We leave the details to the interested reader.

4 Counting clusters and triplets

In this section we establish enumerative results for computing the sizes of the so-called hardwired and softwired cluster system, respectively, that have both been introduced in the literature for phylogenetic network reconstruction (Huson et al. 2010). In addition, we establish that the corresponding result for triplets does not hold. We start with clusters.

4.1 Counting clusters

We call a non-empty subset of X aclusterand refer to a set of clusters of X as a cluster system on X, or just a cluster system if the set X is clear from the context.

Suppose for the following thatNis a phylogenetic network onXand thatv∈V(N).

Then we define the clusterCN(v)associated withv to comprise of all leaves ofN that are belowvand letCN(v)= {v}in casevis a leaf ofN. Again, we simplify our notation by writingC(v)rather thanCN(v)if Nis clear from the context. Note that C(ρN)=X. Then thehardwired cluster systemC(N)associated withNis the cluster system{C(v) : v∈ V(N)}. Note that ifN ∈L1(X)⁻, then Lemma3.1implies that 2n+1≤ |C(N)| ≤3n−2 and ifNis a phylogenetic tree, then|C(N)| = |V(N)| = 2n−1, wherendenotes|X|in both cases. Denoting byT(N)the set of phylogenetic trees on X displayed byN, thesoftwired cluster systemS(N)associated withN is defined asS(N)=

T∈T(N)C(T).

To illustrate this definition, consider the level-1 networkN1onX = {x1, . . . ,x5} depicted in Fig.3i. ThenS(N1)comprises the clustersX,{x2,x3,x4,x5},{x3,x4,x5}, {x4,x5},{x2,x3,x4},{x3,x4},{x1},{x2},{x3},{x4}, and{x5}.

It is not too difficult to argue thatS(N)containsO(n)clusters. To see this, letT be a tree onXdisplayed byNand letvbe a vertex ofT. From the definition of display it follows that a subdivision ofT can be topologically embedded withinN. Fix such an embedding, and letTandvbe the images ofT andvinN. Ifvis the head of a cut arc inN, or the root ofN, thenCT(v)will be equal toCN(v), irrespective of the exact embedding. Ifvis not the head of a cut arc, nor the root, then it is a vertex of some gall of N. In that case, there are only (at most) two possibilities forCT(v). Specifically, the choice of cluster is completely determined by which of the two edges incoming to the hybridization vertex in the gall, are inT(irrespective of the exact topology of the embedding). Now, from Lemma3.1NcontainsO(n)vertices. Given that (as argued)

(11)

Fig. 3 Two networksN1and N2onX= {x1, . . . ,x₅}. Note that both networks are 4-outwards, but neither is saturated

(i) (ii)

each vertex can contribute at most two clusters, it follows thatS(N)containsO(n) clusters. The next result improves on this O(n)observation by providing a formula for the size of the closely related cluster systemS(N)⁻ :=S(N)\{X}. (This is also an improvement on the result presented inGambette and Huber 2012, Proposition 3.) To establish it, we require further terminology.

SupposeN∈L1(X)andX⊆X. Then we define therestriction N|X of N to X to be the network inL1(X)obtained fromN by deleting all vertices inX −Xand then applying the following “cleaning up” operations in any order until no more can be applied¹: (1) suppressing vertices with in-degree and out-degree both equal to one;

(2) deleting vertices with out-degree zero that are not an element inX; (3) collapsing multi-arcs into a single arc; (4) if a gall G has been created that has exactly two outgoing cut arcs(u, v),(u, v), then deleting these two cut arcs and all the arcs ofG and adding arcs(r, v)and(r, v), whereris the unique vertex ofGwhose children are uandu; (5) deleting vertices with in-degree zero and out-degree one. (Note that ifN is a phylogenetic tree this definition specializes to the usual definition of “restriction”

used in the tree literature.) We often writeN|X−x as shorthand forN|X−{x}. Theorem 4.1 Let n≥2and suppose N ∈L1(n). Then|S(N)⁻| =3n−4−cN. Proof We prove the theorem by induction onn ≥2. Suppose N ∈L1(n). Then the stated equality holds ifn =2 as thenN is a phylogenetic tree on two leaves and if n =3 because in that caseN is either simple and socN =0 holds orN is a triplet.

In the former case,|T(N)| = 2 and both phylogenetic trees contained inT(N)are triplets. Thus,|S(N)⁻| = 5 = 3n−4−cN holds in this case. In the latter case, cN=1 follows and thus|S(N)⁻| =4=3n−4−cN in this case, too.

Now supposen>3 and assume that the theorem holds for all networksNwith at mostn−1 leaves. LetX =L(N). We distinguish between the cases that every cut arc ofN is trivial and the case thatN contains at least one non-trivial cut arc.

Suppose first that every cut arc of N is trivial. Then cN = 0 and N is simple.

Note that sincen >3, at least one of the two directed paths from the rootρN to the

1 In this paper only a subset of these “cleaning up” operations will be required. However, we list them all to retain consistency with the wider literature.

(12)

hybrid vertexhN of N must contain at least two vertices distinct fromρN andhN. Let P1denote such a path. Moreover, letv ∈ V(P1)denote the vertex on P1 that is adjacent withρN and letx ∈ X denote the leaf of N that is adjacent withv. Let X = X − {x}andN = N|X. Clearly N ∈L1(n−1)andc_N =cN =0. Thus,

|S(N)⁻| =3(n−1)−4, by the induction hypothesis. Observe that the definition of S(N)implies that S(N)⁻ contains exactly three clusters that S(N)⁻ does not.

Indeed, in case the other directed path fromρNtohN also contains vertices distinct fromρN andhN, the three clusters missing from S(N)⁻ are{x},CN(v)\{h}and CN(v), where h is the leaf below hN. Otherwise, the three clusters missing from S(N)⁻are{x},CN(v)\{h}andCN(v)wherevis the child ofvthat is not contained inX. Consequently,|S(N)⁻| = |S(N)⁻| +3=3(n−1)−4−0+3=3n−4−cN

holds in this case.

Now suppose that N has a non-trivial cut arc e = (u, v). LetY1 = {l ∈ X : lis belowv}andY2 = X−Y1. Note that 2≤ |Y1|<n. Hence, 1≤ |Y2| ≤n−2.

Consider the rooted DAGN1with leaf setY1obtained fromNby deleting all vertices (plus their incident arcs) that are not belowv and the rooted DAG N2onY2∪ {v}

obtained from N by deleting all vertices below v (plus their incident arcs). Since

|Y1| ≥ 2 it follows that N1 ∈ L1(Y1)and since|Y2∪ {v}| ≥2, we have thatN2 ∈ L1(Y2∪ {v}). Note that a phylogenetic treeT is displayed byNif and only ifT|Y₁is displayed byN1andT|Y2∪{v}is displayed byN2. Consequently,S(N)⁻=S(N1)^{− ·}∪ {C∈S(N2)⁻:v /∈C}∪ {C^· − {v} ∪Y1:v∈CandC ∈S(N2)⁻}must hold. Thus,

|S(N)⁻| = |S(N1)⁻| + |S(N2)⁻|.

Leti =1,2 and letni = |L(Ni)|andci =cN_i. Then|S(Ni)⁻| =3ni−4−ci, by the induction hypothesis. Consequently,|S(N)⁻| =3n1−4−c1+3n2−4−c2. Since n1+n2=n+1 andc1+c2=cN−1 it follows that|S(N)⁻| =3(n+1)−8−(cN−1)= 3n−4−cN,holds in this case, too.

4.2 Counting triplets

In view of the close relationship between the cluster systemC(T)induced by a phylogenetic treeT on at least three leaves and the triplet systemR(T)induced byT [see e. g. (Dress et al. 2012) or (Iersel and Kelk 2011)] it is reasonable to hope that the companion result to Theorem4.1might hold for the triplet systemR(N)induced by a phylogenetic networkN on at least three leaves. Put differently, it should be possible to express the size ofR(N)in terms of the number of galls and non-trivial cut arcs of N. As the next result shows, this is not the case. We start with defining the triplet systemR(N).

SupposeN ∈ L1(X), where|X| ≥3 anda,b, andcare distinct elements in X.

Then the tripletab|cis said to beconsistentwithN if there exist distinct verticesv andwinNand directed paths inN fromvtocandw, respectively, and fromwtoa andb, respectively, such that any pair of those paths does not have an interior vertex in common. The triplet systemR(N)is then the set of all tripletst withL(t)⊆ X that are consistent withN.

(13)

To illustrate this definition consider the simple level-1 network N2 on X = {x1, . . . ,x5}depicted in Fig.2ii. ThenR(N2)comprises the sixteen tripletsx3|x1x2, x4|x1x2, x5|x1x2, x1|x3x4, x4|x1x3, x1|x3x5, x5|x1x3, x1|x4x5, x2|x3x4, x4|x2x3, x2|x3x5,x5|x2x3,x2|x4x5,x3|x4x5,x5|x3x4,x1|x2x3.

Proposition 4.2 For all n≥6, there exist distinct networks N1,N2∈L1(n)with the same number of galls and non-trivial cut arcs but|R(N1)| = |R(N2)|.

Proof LetXdenote a finite set of size at least two and leta,b,c, andddenote pairwise distinct elements not contained in X. Consider the binary level-1 networksN1 and N2onX :=X∪ {a,b,c,d}depicted in Fig.4, where the triangle markedT denotes some binary phylogenetic tree onX.

As can be easily checked, N1andN2have the same number of leaves and both contain one gall and havecT+3 non-trivial cut arcs. Moreover, for any 3-setY ∈_X

3

, there exists exactly one triplet onYthat is contained inR(N1)except forY = {a,b,c}

for whicha|bc,c|ab ∈ R(N1)holds. Hence,|R(N1)| = _n

3

+1, wheren = |X|. Similarly for every 3-subsetY ∈ _X

3

, there exists exactly one triplet on Y that is contained inR(N2)except forY = {a,c,x}withx∈ X∪{b}for whicha|cx,c|ax ∈ R(N2)holds. Consequently,|R(N2)| =_n

3

+1+ |X|>_n

3

+1= |R(N1)|.

5 Triplet systems and the partitionC ut(N)

In this section, we start turning our attention to the question of how many triplets suffice to uniquely determine a binary level-1 network. Central to our arguments will be a special type of subsets of Xcalled SN-sets which were originally introduced in Jansson et al.(2006) and further studied in, for example,Iersel et al.(2009),Iersel and Kelk(2011).

Suppose|X| ≥3 andRis a triplet system onX. Then a subsetS⊆Xis called an SN-setofRif there is no tripletx y|z∈Rwithx,z∈Sandy∈/S. In addition, such

Fig. 4 Two binary level-1 networksN1andN2on X∪ {a,b,c,d}for which the respective number of leaves, galls, and non-trivial cut arcs are the same yet

|R(N2)| = |R(N1)|– see the proof of Proposition4.2for details

(14)

a setSis callednon-trivialifS =X.²Last but not least, a non-trivial SN-setSforR is calledmaximalif there is no non-trivial SN-set that is a strict superset ofS.

As it turns out, for a binary network N (of any level) the SN-sets of R(N)are closely related to the cut arcs of N in the sense that if(u, v)is a cut arc ofN, then CN(v)is an SN-set ofR(N)because there cannot exist a tripletx y|z ∈R(N)such thatx,z∈CN(v)andy∈/CN(v). We call a cut arc(u, v)ofN highestif there does not exist a cut arc(u, v)of N such that there is a directed path fromv tou. We denote byCut(N)the partition ofXinduced by, for each highest cut arc(u, v)ofN, taking the clusterCN(v)of X. By (Iersel and Kelk 2011, Observation 3)Cut(N)is exactly the set of maximal SN-sets ofR(N).

To illustrate, consider the networkN1onX =X∪ {a,b,c,d}depicted in Fig.4.

ThenCut(N1)is the bipartition{{a,b,c},X∪ {d}}.

We begin with an auxiliary observation which relies on the concept of compatibility of pairs of sets, whereby two non-empty finite sets A and B are calledcompatible if A∩B ∈ {∅,A,B}holds andincompatible otherwise. More generally, a cluster systemCis calledcompatibleif any two clusters inCare compatible andincompatible otherwise (see e. g.Semple and Steel 2003, Section 3.5 andDress et al. 2012for more on such objects).

Observation 1 Suppose that n ≥ 3and that N and N are two networks inL1(n) such thatR(N)⊆R(N). Letv ∈ V(N)andv ∈ V(N)denote two split vertices that are heads of cut arcs of N and N, respectively. Then the induced clusters CN(v) and CN(v)are compatible. In particular, if CN(v)CN(v)then CN(v)is not a maximal SN-set forR(N).

Proof LetC = CN(v)andC = CN(v). Clearly, if C = C thenC andC are compatible. So supposeC = C. Assume for the sake of contradiction thatC and C are not compatible, that is, C ∩C ∈ {∅,/ C,C}. Choose elements x ∈ C\C, y ∈ C ∩C andz ∈ C\C. Then, out of the three possible triplets with leaf set {x,y,z}, only the tripletx y|zcan be contained inR(N). Hence,x y|z∈R(N)and, so,Ccannot be an SN-set ofR(N); a contradiction as the incoming arc ofvis a cut arc ofNand, so,Cmust be an SN-set ofR(N). Thus,CandCmust be compatible.

To see the remainder of the observation, assume thatCC. Then sinceR(N)⊆ R(N)andCis an SN-set ofR(N), it follows thatCis also an SN-set ofR(N).

SinceCis also an SN-set ofR(N), it cannot be a maximal SN-set ofR(N).

The next result will be required by the induction argument that we will use in the proof of Theorem6.3. The proof of the proposition relies on the facts that for any saturated networkN ∈L1(X)(i) the partitionCut(N)contains at least three elements and (ii) there exists a gallBofN such that the root ofNis a vertex ofB.

Proposition 5.1 Suppose that|X| ≥ 3, that N is saturated network inL1(X)and that Nis a network inL1(X)such thatR(N)⊆R(N). Then Cut(N)=Cut(N).

2 There is some confusion in earlier literature whether∅should be considered an SN-set. Here we allow this, as it does not adversely affect our analysis. Although it sounds a little strange,∅is also a non-trivial SN-set. We adopt this convention to ensure consistency with earlier publications:Xis the only trivial SN-set.

(15)

) ii ( )

i (

Fig. 5 The structure of networksiNandiiNconsidered in the proof of Proposition5.1

Proof The proof contains multiple parts so we first describe it at a high level, and then give details. The entire proof is devoted to proving that Cut(N) ⊆ Cut(N), and after thisCut(N) =Cut(N)follows immediately from the fact thatCut(N) andCut(N)are both partitions ofX. The proof thatCut(N)⊆Cut(N)holds is a long proof by contradiction, which starts with the assumption that there exists some C ∈Cut(N)−Cut(N). We then show that|C| ≥2 must hold. Combined with the assumption that N is saturated we then infer that, up to symmetry, the structure of N is as indicated in Fig.5i. Choosing elementsx,p,q ∈ X as described below we obtain thatR(N)must contain two distinct tripletst1andt2with leaf set{x,p,q}. By examining the structure ofCut(N)we identify that at least one of two cases, referred to as Cases (i) and (ii) below, must hold. However, we show that Case (i) cannot hold, and thus conclude that Case (ii) must hold. We then show that, up to symmetry, the structure of Nis as indicated in Fig.5ii. We argue thatx,p,q are below three distinct highest cut arcs of N, and that none of these are the cut arc incident to the hybridization vertex ofB(whereBis the topmost gall ofN, as shown in Fig.5ii).

This implies thatR(N)cannot contain botht1andt2which finally yields the required contradiction.

Let us then start by assuming, for the sake of contradiction, that there exists some C∈Cut(N)−Cut(N).

Proof that|C| ≥2: Since bothCut(N)andCut(N)are partitions ofX, there exists someC∈ Cut(N)distinct fromCsuch thatC∩C= ∅. Since, in view of (Iersel and Kelk 2011, Observation 3) recalled above,Cis a maximal SN-set ofR(N), and, by Observation1,CandCare compatible, it follows thatCC. Thus, there exists a further elementC∈Cut(N)distinct fromCandCsuch thatC∩C= ∅holds, too. Thus,|C| ≥2.

The structure of N: Letr ∈V(N)denote the head of the cut arc(r,r)ofNfor which C=CN(r)holds and letBrdenote the gall ofNthat containsrin its underlying cycle (which exists because|C| ≥2 andNis saturated). In view of the usual assumption that no gall in a phylogenetic network has two or fewer outgoing cut arcs,Br has at least three outgoing cut arcsc1,c2andc3(see Fig.5i). Letc1denote the outgoing cut arc of

(16)

Br whose tail is the hybrid vertexhB_r ofBr. Letz∈CN(hB_r), letx∈CN(head(c2)) and let y ∈ CN(head(c3)). Clearly,R(N)contains two distinct tripletst andton {x,y,z}.

The structure of N: SinceR(N)⊆R(N)we also havet,t∈R(N). SinceCut(N) is the partition ofX induced by the maximal SN-sets ofR(N), it follows that either (i) there exists some element A ∈ Cut(N)such thatx,y,z ∈ A or (ii) there exist distinct elementsCx,Cy,Cz ∈Cut(N)such thata ∈Ca, for alla∈ {x,y,z}.

Assume first that Case (i) holds. We claim thatC ⊆ A. To see this, note that we were free to choose any two cut arcsc2andc3distinct fromc1and subsequently we had a free choice ofz,x,y. For anyZ := {x,y,z}chosen this way—let us call this a valid choice—it is straightforward to see that there exist two triplets onZinR(N) and thus inR(N). SinceAis an SN-set ofR(N)it follows that as soon as two of the three leaves of a triplet onZare contained inA, so too is the third. Now, let{x,y,z}be our initial valid choice, so by assumption{x,y,z} ⊆ A. Simple case analysis shows that for any element p ∈C, at least one of{x,y,p},{x,p,z}or{p,y,z}is a valid choice. Hence, p ∈ Awhich proves the claim. SinceC ∈/ Cut(N)we have in fact C= A. ButC Acannot hold either becauseCis a maximal SN-set forR(N)and Ais a maximal SN-set forR(N), and by Observation1this cannot happen. Thus, Case (ii) must hold (see Fig.5ii).

The triplets t1and t2: Leth∈V(N)denote the hybrid vertex of the topmost gallK of N, that is, the gall ofNthat contains the root ofNin its vertex set (which must exist becauseN is saturated). Also note that becauseC∈Cut(N)it follows that(r,r)is a highest cut arc ofN and thusris a vertex of K. Since|Cut(N)| ≥ 3 there exist distinct elementsC1,C2 ∈ Cut(N)− {C}such thatCN(h)∈ {C,C1,C2}. Choose some p ∈ C1and someq ∈ C2. Combined with the definition ofR(N)it follows thatR(N)must contain two tripletst1andt2on{x,p,q}, two triplets on{y,p,q}, and two triplets on{z,p,q}. Note that sinceR(N)⊆ R(N), those six triplets are also contained inR(N). (In the next part of the proof we assumeCN(h) = Cz, where these terms will be defined in due course, and the critical point here is that CN(h)=Cx.This is genuinely without loss of generality because when selectingt1

andt2in the present part of the proof it does not matter whether they are on{x,p,q}, {y,p,q}or{z,p,q}.)

The taxa x,p,q are all beneath distinct highest cut arcs of N,but none of these are incident to the hybridization vertex:Usingx,y,z,p andq, we next analyze the structure ofCut(N)(see Fig.5ii). Observe first that since|Cut(N)| ≥3, the root ofNmust be contained in a gallBofN. Leth∈V(N)denote the unique hybrid vertex ofB. LetCp,Cq ∈Cut(N)be such that p∈Cpandq ∈Cq.

We claim thatCpandCqare distinct elements inCut(N)− {Cx,Cy,Cz}. To see this, note first that, since the setsCx,CyandCzare pairwise distinct andt,t∈R(N), it follows that one ofx,y, andzmust be contained inCN(h). Without loss of generality, assume thatz ∈ CN(h) = Cz. Note next that Cp = Cq. Indeed, at least two elements of{x,y,z}are not contained inCp, becauseCx,CyandCzare distinct.

Suppose, without loss of generality,x∈/ Cp. IfCp=Cq, then only the triplet pq|x will be contained inR(N), contradicting the fact thatt1andt2are distinct triplets on {x,p,q}contained inR(N). It remains to show thatCp,Cq ∈ {C/ x,Cy,Cz}. Assume

(17)

for the sake of contradiction thatp ∈Cx. Then onlyx p|qis inR(N), becauseq∈/Cx, contradicting the fact that botht1andt2are inR(N). Similarly, if pis inCy, then at most one of the two triplets on{y,p,q}are inR(N), and if pis inCz, at most one of the two triplets on{z,p,q}are inR(N). SoCp∈ {/ Cx,Cy,Cz}. By a symmetrical argument,Cq∈ {/ Cx,Cy,Cz}. This proves the claim.

By the previous claim, neither p norq is in Cz. Since x ∈/ Cz it follows that t1

andt2cannot both be contained inR(N)which gives the final contradiction. Thus, Cut(N)⊆Cut(N), as required.

Since bothCut(N)andCut(N)are partitions of X, it follows thatCut(N)= Cut(N).

6 Triplet systems thatL1(X)-define

As is well-known, every binary phylogenetic treeT onX is defined by the triplet set R(T)induced byT, where a a phylogenetic treeS on X is said to bedefinedby a triplet systemRonX, if, up to equivalence,S is the unique phylogenetic tree onX for whichR⊆R(S)holds [see e. g. (Semple and Steel 2003)]. In this context it is important to note that this uniqueness only holds within the space of phylogenetic trees because all networksN ∈L1(X)that displayThave the property thatR(T)⊆R(N). Combined with the fact that the network N pictured in Fig.6i is, up to equivalence, the only binary level-1 network on X = {x1, . . . ,x4}that is consistent with all five triplets depicted in Fig.6ii—a simple case analysis can be applied to verify this—and

|R(N)| = 7, it is natural to ask how many triplets suffice to “uniquely determine”

a level-1 network. In this section we provide a partial answer to this question. More precisely, saying that a networkN ∈ L1(X)isL1(X)-defined by a triplet systemR (on X) if, up to equivalence, Nis the unique network inL1(X)such thatR⊆R(N) holds, we show that every 4-outwards network N in L1(X)that is also simple is L1(X)-defined by a triplet system of size at most 2|X| −1. In addition, we show that if the requirement thatNis simple is replaced by the requirement thatNis saturated, thenNisL1(X)-defined byR(N).

(i) (ii)

Fig. 6 The binary level-1NonX= {x1, . . . ,x4}depicted in (i) is uniquely determined by the five triplets pictured in (ii) but|R(N)| =7