Complexity Insights of the Minimum Duplication Problem

(1)

HAL Id: hal-00948488

https://hal-upec-upem.archives-ouvertes.fr/hal-00948488

Submitted on 18 Feb 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

Complexity Insights of the Minimum Duplication

Problem

Guillaume Blin, Paola Bonizzoni, Riccardo Dondi, Romeo Rizzi, Florian

Sikora

To cite this version:

Guillaume Blin, Paola Bonizzoni, Riccardo Dondi, Romeo Rizzi, Florian Sikora. Complexity Insights of the Minimum Duplication Problem. Theoretical Computer Science, Elsevier, 2014, 530, pp.66-79. �hal-00948488�

(2)

Complexity Insights of the Minimum Duplication

Problem

Guillaume Blina_{, Paola Bonizzoni}b_{, Riccardo Dondi}c_{, Romeo Rizzi}d_{, Florian}

Sikorae a

Universit´e Paris-Est, LIGM, UMR 8049 - France, [email protected]

b

DISCo, Universit´a degli Studi di Milano-Bicocca - Milano, Italy, [email protected]

c

Dipartimento di Scienze umane e sociali, Universit´a degli Studi di Bergamo, - Bergamo, Italy, [email protected]

d

Department of Computer Science, University of Verona - Verona, Italy, [email protected]

e

PSL, Universit´e Paris-Dauphine, LAMSADE, UMR 7243 - Paris, France, [email protected]

Abstract

The Minimum Duplication problem is a well-known problem in phylo-genetics and comparative genomics. Given a set of gene trees, the Minimum Duplication _{problem asks for a species tree that induces the minimum} number of gene duplications in the input gene trees. Recently, a variant of the Minimum Duplication problem, called Minimum Duplication Bi-partite_{, has been introduced, where the goal is to find all pre-duplications,} that is duplications that in the evolution precede the first speciation with re-spect to a species tree. In this paper, we investigate the complexity of both

Minimum Duplication _{and Minimum Duplication Bipartite. First of}

all, we prove that the Minimum Duplication problem is APX-hard, even when the input consists of five uniquely leaf-labelled gene trees (improving upon known results on the complexity of the problem). Then, we show that the Minimum Duplication Bipartite problem can be solved efficiently with a randomized algorithm when the input gene trees have bounded depth. An extended abstract of this paper appeared in SOFSEM 2012 [1].

Keywords: Minimum Duplication Problem, Comparative Genomics,

(3)

1. Introduction

The evolutionary history of the genomes of eukaryotes is the result of a series of evolutionary events, called speciations, that produce new species starting from a common ancestor. This evolutionary history has been deeply studied in computational biology, and it is usually represented using a phy-logenetic tree called species tree [2]. A species tree is a rooted binary tree whose leaves are uniquely labelled by a set Λ representing the extant species, where the common ancestor of the contemporary species is associated with the root of the tree. The internal nodes represent hypothetical ancestral species (and the associated speciations). Speciations are not the only events that influence the evolution. Indeed, there are other events, such as gene duplications, gene losses and lateral gene transfers that, although not lead-ing to new species, are fundamental in the evolution. In this paper we focus on gene duplications which are known to be essential for the evolution of many eukaryotes groups, such as vertebrates, insects and plants [3]. A gene duplication can be described as the genomic event that causes a gene inside a genome to be copied, resulting in two copies of the same gene that can evolve independently. Genes of extant species are called homologous if they evolved from a common ancestor through speciations and duplications events [4]. The evolution of homologous genes, with regards to the extant species, is usually represented using another special kind of phylogenetic tree, called

gene tree. A gene tree is a rooted binary tree whose leaves are (not necessarily

uniquely) labelled by elements of the set Λ. Despite the fact that biologically speaking leaves in the gene tree represent genes, for simplification, the gene tree is labelled according to the species from which the corresponding gene was sampled. Therefore, leaves similarly labelled represent duplicated genes that evolved independently and appear in a common extant species. As in the species tree, the root and the internal nodes respectively represent the common ancestor and ancestral genes explaining their evolution.

With regards to the set of labels Λ, gene and species trees are said to be comparable. Nevertheless, due to complex evolutionary processes, such as gene duplications and losses, comparable gene trees and species trees very often present incompatibilities. An interesting problem is then to reconcile the gene trees and species trees with hypothetical gene duplications. For example, in Fig. 1, given a comparable gene tree and species tree inducing incompatibilities, one can infer a reconciled tree based on the a priori dupli-cation of gene g1 into genes h and g3 (h is a hypothetical ancestor of genes g2,

(4)

s3 s4 s5 s2 s1 g1 g3 g2 g4 g5 s3 s4 s5 s2 s1 g1 _g 3 h g2 g4 g5 (a) (b) (c) M f

Figure 1: (a) a gene tree T . (b) a species tree S where M is the lca mapping from T to S; each gene in {g2, g4, g5} is mapped by function f in the species that gene belongs to.

Nodes of S are labeled with s’s. (c) a reconciled tree for T and S based on the a priori duplication of gene g1into genes h and g3.

g4), which afterwards both speciate according to the topology of the species

tree.

Reconciliation is a widely-investigated problem, and different approaches have been proposed in the past based on the duplication-loss model [5, 6, 7, 8, 9, 10, 11, 12, 13, 14] and also extended to consider later gene transfer [15, 16, 17, 18, 19]. Some approaches are based on a probabilistic model that aims to infer how a gene tree evolves within a given species tree [5, 6, 17].

Based on the principle of parsimony, one is interested in finding the min-imum number of gene evolutionary events that can explain all the incom-patibilities. Notice that, while we focus on minimizing duplications, other possible costs have been considered, for example the minimization of losses or the minimization of duplications and losses [12, 9, 20].

This last can be inferred by the so-called lowest common ancestor mapping (lca mapping), denoted by M. M maps each ancestral gene g of the gene tree to the most recent common ancestor of the extant species from which all the descendants of g were sampled. Given M, a gene in the gene tree is a gene duplication if it has a descendant with the same M mapping. Then, the reconciliation cost is defined as the number of gene duplications in the gene tree induced by the species tree. Computing a specie tree inducing the minimum cost for this distance has been widely investigated under the name of the Minimum Duplication problem [21, 12, 20, 22] (defined formally afterwards).

(5)

1.1. Known results

The Minimum Duplication problem is known to be NP-hard [12]. Re-cently, the Minimum Duplication problem has been related to the

Mini-mum Triplets Consistency _{problem [22], a problem known to be}

W[2]-hard [23] and not approximable within factor O(log n) [23]. These results coupled with the reduction provided in [22] implies that the Minimum Du-plication _{problem is NP-hard, W[2]-hard (despite of [21]) and cannot be} approximated within factor O(2log1−ε_n

), even in the specific case of a forest composed of uniquely leaf-labelled gene trees with three leaves [22, 24] (no-tice that if the forest consists of a constant number of uniquely leaf-labelled gene trees with three leaves, then the problem is trivially in P).

Therefore, different heuristics and Integer Linear Programs have been developed [25, 26, 9, 27].

Recently, the Minimum Duplication Bipartite problem has been in-troduced to tackle the Minimum Duplication problem [28]. The Minimum Duplication Bipartite _{problem aims to find all the pre-duplications, that} is duplications that in the evolution precede the first speciation with respect to a species tree (see Fig. 2 for an example). Roughly, this means that only the first level of the species tree is considered. Indeed, one is interested in knowing if a given species belongs to the subtree of S rooted at the left child of the root or at the right one. Therefore, one can view the species tree as a bipartition (Λ1, Λ2) of the set of species Λ. Solving the Minimum

Duplica-tion Bipartite_{problem recursively produces a natural greedy heuristic for} the Minimum Duplication problem. The Minimum Duplication Bi-partite _{problem was shown to be 2-approximable [28], but its complexity} remains open.

In this contribution, we provide results for both the Minimum

Duplica-tion _{problem and the Minimum Duplication Bipartite problem. First}

of all, we prove that the Minimum Duplication problem is APX-hard, even when the input consists of five uniquely leaf-labelled gene trees (that is for a constant number of gene trees). Then, we show that the Minimum Du-plication Bipartite _{problem can be solved efficiently with a randomized} algorithm when the input gene trees have bounded depth.

2. Preliminaries

In this section we introduce some preliminary definitions and properties that will be useful in the rest of the paper. Consider a binary tree U , we

(6)

T1 2 8 9 1 6 T2 1 2 3 4 5 6 7 8 9 S {1, 2, 3, 4, 5} {6, 7, 8, 9}

Figure 2: A set F = (T1, T2) of gene trees and the species tree S which is a bipartition

({1, 2, 3, 4, 5}, {6, 7, 8, 9}). For sake of clarity, we have labelled the leafs of T1 and T2with

the species of S. The only node with a pre-duplication is the root of T1, since this node

and its two children are mapped by M (via the dashed lines) to the root of S. Still for sake of clarity, the mapping is not drawn for all nodes of T1and T2.

denote by Λ(U ) the set of its leaves. Given an internal node x of U , we denote by U (x) the subtree of U rooted at node x, and by Λ(U (x)) the set of leaf labels of U (x). When there is no ambiguity on the tree considered, we denote ζ(x) = Λ(U (x)); ζ(x) is called the cluster of x. Given a tree U , we denote by lcaT(u, v) the lowest common ancestor of two nodes u and v

in a tree T .

Given a gene tree T and a species tree S, leaf-labelled by a set Λ, we define a mapping M (also called least common ancestor mapping) from the nodes of T to the nodes of S, defined as follows. M(x) = y where y is the node of S having minimum cluster such that ζ(x) ⊆ ζ(y).

For example, in Fig. 1, according to M, g3 is mapped to s1, since s1

is the most recent common ancestor of s4 and s5 from which were sampled

(represented as a function f ) respectively the descendant g4 and g5 of g3.

Observe that, considering M, any leaf of the gene tree is mapped to the unique leaf of S similarly labelled (according to Λ). Given M, a node x in a gene tree T is a duplication if M(x) = M(x′_{), where x}′ _{is a child of x}

(7)

induced in T . Given a forest F = {T1, T2, . . . , Tk}, the duplication cost of F

with respect to S, denoted by d(F, S) is defined as d(F, S) =Pk

i=1d(Ti, S).

The Minimum Duplication problem [21, 12, 20, 22], is defined as follows: Minimum Duplication

• Input: A set F of gene trees.

• Output: A species tree S, with Λ(S) ⊇S

T∈F Λ(T ).

• Measure: Duplication cost d(F, S).

A variant of the problem has been introduced in [28], where the goal is to compute the number of duplications induced by nodes of the gene trees mapped in the root of the species tree. Given a gene tree T and a species tree S, a node of T is a pre-duplication if x and one of its children are both mapped in the root of S. Formally, the Minimum Duplication Bipartite problem is defined as follows:

Minimum Duplication Bipartite

• Input: A set of labels Λ, a set F of gene trees on Λ. • Output: A bipartition (Λ1, Λ2) of Λ.

• Measure: The number of pre-duplications.

2.1. Properties of the lca mapping

Let us introduce some fundamental properties that will be used in the rest of this paper. In the following, for ease, given a binary tree T = (V, E) and a vertex v ∈ V , let us denote by vL _{(resp. v}R_{) the left (resp. right)}

child of v, and by ζv the cluster of v i.e. the set of all leaves belonging to the

subtree rooted in v. Moreover, for ease, ϑT will denote the root of the tree

T .

Property 1. Let T , T′ _{be two gene trees labelled by the same sets of leaves}

Λ. Consider the bipartitions b1 = (ζ(ϑLT), ζ(ϑ R

T)), b2 = (ζ(ϑLT′), ζ(ϑR_T′)) of Λ.

Then either b1 and b2 are identical or any species tree S induces at least one

(8)

Proof. The property follows easily from the observation that in any

biparti-tion of the leaves Λ one of the set contains a leaf of ζ(ϑL

T) and a leaf of ζ(ϑRT),

or a leaf of ζ(ϑL

T′) and a leaf of ζ(ϑR_T′).

Property 2. Let T = (VT, ET) be a gene tree and S = (VS, ES) be a species

tree. Let v be a vertex of VT such that v has at least one child v′, which is

not a leaf. If there exists a vertex w of VS such that (a) ζ(v) \ ζ(v′) ⊆ ζ(w),

(b) ζ(w) 6⊇ ζ(v) and (c) ζ(w) ∩ ζ(v′_{) 6= ∅, then v is duplicated.}

Proof. Let us consider the vertex w in S. Notice that, since (a) ζ(v) \ ζ(v′_{) ⊆}

ζ(w), (b) ζ(w) 6⊇ ζ(v) and (c) ζ(w) ∩ ζ(v′_{) 6= ∅, then there exists at least}

a label l, such that l ∈ ζ(v′_{) \ ζ(w) (otherwise ζ(w) would contain ζ(v)).}

Furthermore, as ζ(w) ∩ ζ(v′_{) 6= ∅, it follows that v}′ _{and v are mapped to}

vertices of S that are on the path from w to ϑS. Let w′ be the vertex of S

where v′ _{is mapped (i.e. M(v}′_{) = w}′_{). Note that w}′ _{is defined such that}

ζ(v′_{) ⊆ ζ(w}′_{) and ∄z such that ζ(v}′_{) ⊆ ζ(z) and |ζ(z)| < |ζ(w}′_{)|. Since w}′

is an ancestor of w, it follows that ζ(v) ⊆ ζ(w′_{). Hence, v is mapped to w}′

(i.e. M(v) = w′_{). As a consequence v is duplicated (see Fig. 3).}

ζv\ ζv′ T v v′ w x x l ζv ζw S l w′ M ζv′

Figure 3: Illustration of Property 2 where (a) ζ(v) \ ζ(v′_{) ⊆ ζ(w), (b) ζ(w) 6⊇ ζ(v) and (c)}

ζ(w) ∩ ζ(v′_{) 6= ∅.}

3. On a tight inapproximability

We present a reduction from Minimum Vertex Cover on cubic graphs (MVCC) to the restriction of the Minimum Duplication problem –

(9)

de-noted Min-5-Dup – where F consists of five gene trees, that is F = {T1, T2, T3, T4, T5}. The MVCC problem is defined as follows:

Minimum Vertex Cover_{on cubic graphs (MVCC)}

• Input: A cubic graph G = (VG, EG) (i.e. every vertex has degree

three).

• Output: A subset V′

G ⊆ VG, such that for each edge {vi, vj} ∈ EG,

at least one of {vi, vj} belongs to VG′.

• Measure: |V′ G|.

In a first step (see Section 3.1), starting from a cubic graph G = (VG, EG),

we construct an associated input F = {T1, . . . , T5} of Min-5-Dup. Then in

Section 3.2, we prove that (1) any species tree S such that d(F, S) < q = 6|VG| + 3|EG| + 1 must induce duplications in the spine (the definition of

spine is given afterwards) of trees T1, . . . , T4, and (2) that our construction

is indeed an L-reduction.

3.1. Extra definitions and construction

In order to define formally the gene trees, let us first define the central notion of comb graph. We will consider a specific subclass of comb graphs corresponding to a binary tree where all the internal nodes lie on a single simple (i.e. with no repeated vertices) path. We will nevertheless use the term comb graph in the following to denote those lasts. Given a sequence L = hl1, . . . , lki of k labels, let C(L) denote the comb graph whose leaves are

labelled according to a postorder traversal using L (i.e. lx ∈ L, 1 ≤ x ≤ k−2,

is the label of the unique leaf of depth x, and lk−1, lk are both at level k − 1).

For example, in Fig. 1, the gene tree (a) corresponds to the comb graph C(hg2, g4, g5i).

Let us now define some operations on trees. Let T1△T2 be a tree obtained

from two trees T1and T2, by connecting the roots of T1and T2to a new vertex

v which becomes the root of T1△T2. The insertion of T2 in the edge e of T1

denotes the operation that leads to a tree obtained from T1 by replacing the

edge e = {v, v′_{} in T}

1 by two edges {v, w} and {w, v′} and connecting the

root of T2 to the new vertex w. Given a binary tree T = (V, E), with leaves

(10)

T |Λ′_{, as the subtree obtained from T by retaining only leaves with a label}

belonging to Λ′ _{and by contracting all the internal vertices of degree 2.}

We are now ready to define the gene trees T1, . . . , T5. Roughly, we will

associate with each vertex v ∈ VG, a specific tree Tv and with each edge

e ∈ EG, two trees Te1, Te2. These trees will be then combined to build the

gene trees T1, . . . , T5. For ease, let us consider the following order on the edges

of EG, he1, e2, . . . e|EG|i: ∀ex = {vi, vj}, ey = {vh, vk} it holds x < y, where

i < j and h < k, either if (i < h) or (i = h and j < k). Set q = 6|VG| + 3|EG|.

According to this order, we define the following sequences of labels: • M1 = hm11, m12, . . . m1|EG|i • M2 = hm21, m22, . . . m2|EG|i • L = hl1 1, l21, l21, l22, . . . l1|EG|, l 2 |EG|i • Lf = hf|VG|+1, f 1 |VG|, . . . , f q |VG|, f|VG|, . . . f 1 1, . . . , f q 1, f1i

The sequences M1, M2, L encode the edges of the cubic graph G and

belong to the subtrees (defined later) encoding the vertices and the edges of G. The labels are grouped as in the definition given above, since three labels m1

x, m1y, m1z belong to the same subtree of Tvi, three labels m

2

x, m2y, m2z belong

to the same subtree of Tvi and three labels of L belong to the same subtree

of Tvi (see Fig. 4).

The sequence Lf consists of labels associated with leaves connected to the

spine of T5(a similar sequence is used for T1, . . . , T4). This set is introduced to

separate the subtrees encoding different vertices and edges of G (see Fig. 5). Moreover, for ease of notation we denote the following subsequence of labels of Lf: • Li f = hfi1, . . . , f q ii, with 1 ≤ i ≤ |VG|. The subsequence Li

f is used to define those labels of T5 associated with a

single index i, 1 ≤ i ≤ |VG| (see Fig. 5).

Roughly, any edge ex is represented by the four labels {m1x, m2x, lx1, lx2}.

First of all, for any edge ex ∈ EG, let us build the two trees Te1x = C(hl

1

x, m2x, l2xi)

and T2

ex = C(hl

2

x, m2x, l1xi). Moreover, recall that G is cubic, therefore, any

vertex has degree three. Then, for any vi ∈ VG such that ex = {vi, vj}, ey =

{vi, vj′}, e_z = {v_i, v_j′′} ∈ E_G, with j < j′ < j′′, we build a tree

Tvi = C(hm1 x, m 1 y, m 1 zi)△C(hl k x, l k′ y , l k′′ z i) △C(hm2 x, m 2 y, m 2 zi)

(11)

where k (resp. k′_{, k}′′_{) is set to 1 if i < j (resp. i < j}′_{, i < j}′′_{); 2 otherwise} (see Fig. 4). l1 x l2 x m2 x l2 x l1 x m2 x m2 x m2 z m2 y l1 x l1 z l2 y m1 x m1 z m1 y Tvi T 1 ex T 2 ex

Figure 4: The trees Tvi, T

1 ex and T

2

ex for vi ∈ VG such that ex = {vi, vj}, ey = {vi, vj′}

and ez= {vi, vj′′} and i < j, i > j′ and i < j′′.

Now, we build the gene trees T1 to T5. We start from a comb graph

where subtrees representing vertices and edges will be inserted in. Let T5

be obtained from C(Lf) by inserting the subtree C(M1)△C(M2) in the edge

connecting f1 and its parent (see Fig. 5).

Regarding the construction of T1 to T4, let us assume that we are

also provided a 4-coloring λ : VG 7→ {1, 2, 3, 4} of G (for example, by

applying the polynomial-time greedy Welsh-Powell algorithm [29]). Let any Ti, 1 ≤ i ≤ 4, be first defined as the following comb graph: Ti =

C(hf1, f2, . . . , f|VG|+1, f 1 |VG|, . . . f q |VG|, f 1 |VG|−1. . . , f q

1i). We then insert, for each

vi ∈ VG, the tree Tvi in the edge connecting the parents of fi and fi+1

in the gene tree Tx where x = λ(vi) (see Fig. 5). Moreover, for each

ex = {vi, vj} ∈ EG (ordered from e1 to e|EG|), the tree T

1

ex is inserted

in the edge connecting the parent of fi and its other child in the gene

tree Tx where x = min{1, 2, 3, 4} \ {λ(vi), λ(vj)} (i.e. the gene tree

hav-ing the smallest index and not containhav-ing either Tvi, nor Tvj). Finally,

for each ex = {vi, vj} ∈ EG, the tree Te2x is inserted in the edge

con-necting the parent of fi and its other child in the gene tree Tx where

x = max{1, 2, 3, 4} \ {λ(vi), λ(vj)} (i.e. the gene tree having the biggest

index and not containing neither Tvi, nor Tvj). A sketch of this construction

is given in Fig. 5.

Let Px be the set of internal vertices in Tx, 1 ≤ x ≤ 4, belonging to the

path from the root of Tx to the parent px_|V_G_|+1 of f|VG|+1. We define the spine

of any gene tree Tx, 1 ≤ x ≤ 4, as Px\ {px_|V_G_|+1}.

(12)

Figure 5: Gene trees T1 to T5 obtained from the cubic graph G where Lf = hf5, f41, . . . , f q 4, . . . , f 1 1, . . . , f q

1, f1i and ∀1 ≤ i ≤ 4, λ(vi) = i. Each tree Ti, 1 ≤ i ≤ 4,

is obtained by inserting tree Tvj, T

x

j,h in the comb graph C(hf1, . . . , f5i). Notice that T5

contains a comb graph C(hf5, . . . , f1i).

duplicated with respect to S if M(v) = M(v′_{) where v}′ _{is a descendant of v}

in T (i.e. v′ _{belongs to T} v).

3.2. Correctness of the reduction

Before giving the details of the proofs, we will give an overview of the reduction. First of all, we will prove in Lemma 1, that all the gene trees in F are uniquely leaf-labelled. Then, we will prove (Lemma 2) that we can restrict ourselves to solutions, i.e. species trees, that contain an isomorphic copy of T5, thus inducing duplications in each internal node of the spine of

T1, . . . T4. Then, applying the result of Lemma 3, we are able to relate the

duplications of subtrees Tv, v ∈ VG, with the corresponding vertices in the

vertex cover of the graph G.

First of all, let us prove that, by construction, all the gene trees T1, . . . , T5

are uniquely leaf-labelled.

Lemma 1. The trees T1, . . . , T5 are uniquely leaf-labelled trees.

Proof. It is easy to see that T5 is uniquely leaf-labelled by construction.

(13)

T5 is uniquely leaf-labelled. Moreover, the trees C(Lf), C(M1), C(M2) have

pairwise disjoint sets of leaves.

Now, consider the gene trees T1, T2, T3 and T4. First, remark that each

tree Tv, with v ∈ VG, is uniquely leaf-labelled and so do the trees Te1x and

T2

ex (see Fig. 4). Then, consider the relative placement of Tv, T

1

ex and T

2 ex

in the gene trees. More precisely, by construction, we have to ensure that a tree Tv, where v ∈ VG is incident to the three edges ex = {v, v′}, ey = {v, v′′}

and ez = {v, v′′′}, does not belong to the same gene tree of {Tv′, T_v′′, T_v′′′}

and of {T1 ex, T 2 ex, T 1 ey, T 2 ey, T 1 ez, T 2

ez}. This is indeed true since all those trees

are associated with the gene trees considering their corresponding colors in the 4-coloring of G which ensures that (a) Tv, Tv′, T_v′′ and T_v′′′ belong to four

different trees, (b) {T1 ex, T

2

ex} are inserted in the trees where {Tv, Tv′} is not

present (a similar property holds for {T1 ey, T 2 ey}, and {T 1 ez, T 2 ez}). Moreover, the trees T1 ex and T 2

ex do not belong to the same gene tree (which is the case

by construction), and this concludes the proof.

Let us now prove that we are interested only in solutions that induces a duplication in each node on the spine of Tx, x ∈ {1, . . . , 4}.

First, let us consider the following order induced by the lca mapping M. Consider three vertices v, v′_{, v}′′ _{of a tree T and the following}

order-ing of their lowest common ancestors: we write lcaT(v, v′) > lcaT(v′, v′′)

(lcaT(v, v′) ≥ lcaT(v′, v′′) respectively) when the depth of the lca of v′, v′′

is greater (greater or equal respectively) than the one of v, v′_.

Lemma 2. Let S be a solution of Min-5-Dup for the instance F =

{T1, . . . , T5}. Then, either d(F, S) ≥ 6|VG| + 3|EG| + 1 or all the vertices on

the spines of the gene trees T1, T2, T3, T4 are duplicated.

Proof. Consider any species tree S and two leaves fi, fi+1 of T5, for a given

1 ≤ i ≤ n. Let wij (resp. w j

i+1) be the parent of fi (resp. fi+1) in the gene

tree Tj, with j ∈ {1, 2, 3, 4}. Let xi (resp. xi+1) denote the parent in T5 of fi

(resp. fi+1). Similarly, let xzi denote the parent in T5 of fiz.

In what follows, we consider a label fz

i, with 1 ≤ z ≤ q, and we prove

that, considering the previously mentioned mapping, either all the internal vertices on the path from xi to xi+1in T5 are duplicated (hence d(F, S) ≥ q =

6|VG| + 3|EG| + 1) or all the internal vertices on the path from w j

i (included)

to wji+1 (not included) are duplicated in Tj, 1 ≤ j ≤ 4. To do so, we will

consider a case by case analysis based on the possible mappings of fi, fi+1

(14)

More precesily, we have the following possible cases: either for each fz i ∈

Li

f, lcaS(fi, fiz) ≥ lcaS(fi+1, fiz) (Case 1) or there exists a fiz ∈ Lif such

that lcaS(fi, fiz) < lcaS(fi+1, fiz) (Case 2). For Case 1, we have two possible

subcases: for each fz

i ∈ Lif, lcaS(fi, fiz) ≥ lcaS(fi+1, fiz) (Case 1.a) or there

exists a fz

i such that lcaS(fi, fiz) ≥ lcaS(fi+1, fiz) (Case 1.b).

Consider the subtree of S′ = S| ({fi} ∪ {fi+1} ∪ (S_z{fiz})). Intuitively,

there exists a vertex x of S′ _{such that ζ(x) = ({f} i} ∪ (

S

z{fiz})), hence

fi+1∈ ζ(x) (Case 2) or not (Case 1). Case 1 can have two possible subcases:/

there exists a vertex y of S′ _{such that ζ(y) = ({f}

i+1} ∪ (S_z{fiz})) (Case

1.a) or not (Case 1.b), in which case there is a vertex w of S′ _{such that}

fi, fi+i∈ ζ(w), and fiz ∈ ζ(w), for some z./

T5 xi S fi lca_S_(f_i_{, f}z i) M fi+1 xi+1 Li f fz i fz+1 i fi f z i fi+1 Li f ∪ fi+1 xz+1_i xz i

Figure 6: Illustration of Lemma 2, when for each fz

i ∈ Lif, lcaS(fi, fiz) > lcaS(fi+1, fiz).

Vertices xi+1, . . . , xz+1i of T5 are duplicated.

(Case 1) Assume that for each fz

i ∈ Lif, lcaS(fi, fiz) ≥ lcaS(fi+1, fiz).

Notice that there exists two possible cases: lcaS(fi, fiz) > lcaS(fi+1, fiz),

for each fz

i (Case 1.a, see Fig. 6), or there exists at least one fiz such that

lca_S(f_i, fz

i) > lcaS(fi+1, fi) (Case 1.b, see Fig. 7).

(Case 1.a) In this case lcaS(fi, fiz) > lcaS(fi+1, fiz), for each fiz.

Property 2 applies to each internal vertex between xi, xi+1 excluding xi

with v = xz

i, v′ = x z+1

i , x = f

z+1

i and l = fi (see Fig. 6). Hence

d(F, S) ≥ q = 6|VG| + 3|EG| + 1 since each x j

i with 1 ≤ j ≤ q − 1 and

xi+1 are duplicated.

(Case 1.b) In this case for some fi

z, it holds that lcaS(fi, fiz) >

lca_S_(f_i+1_{, f}_i_{). Let us consider the leaves in ζ(w}j

i) \ ζ(w j

i+1) and let S j i be

(15)

Tj S fi lca_S_(f_i_{, f}_i+1₎ M fi wj_i tji fi+1 fz i f z i fi+1 lca_S_(f_i_{, f}z i) wi+1j S_ij sj_i m2 e a) Tj S fi lca_S_(f_i_{, f}_i+1₎ M fi w_ij tji fi+1_fz i fz i fi+1 lca_S(f_i, f_iz) wj_i+1 Sij sji m2 e m2 e z1 T5 xi M _f_i+1 xi+1 Li f fz i fz+1 i fim2e xz i xz+1 i b)

Figure 7: Illustration of Lemma 2 when there exists at least one fz

i such that lcaS(fi, fiz) >

lca_S_(f_i+1_{, f}_i_{). In case a) the vertices of T}_j_{, 1 ≤ j ≤ 4, between w}j

i and wi+1j are

duplicated. In case b), since vertex wji+1is not mapped in z1, then all the vertices between

xi+1 and xz+1i in T5are duplicated.

the set of internal vertices of Tj, 1 ≤ j ≤ 4, between wji+1 and w j

i. Property

2 applies to the vertex wji, as lcaS(fi+1, fi) contains fi, fiz but not fi+1, which

is contained in both ζ(wj_i+1) and ζ(wj_i). Hence w_ij is duplicated (see Fig. 7a). Now, we have to consider the vertices between wij and w

j i+1.

Consider the lowest vertex sji ∈ S j

i not duplicated and denote by t j i its

child which is not on the spine of Tj, 1 ≤ j ≤ 4. Let z1 be the vertex of S

where sj_i is mapped (i.e. M(sj_i) = z1), and notice that z1 ≥ lcaS(fiz, fi).

Since sji is not duplicated, then the cluster of one of the children of z1contains

ζ(tj_i), while the other contains {fz

i, fi, fi+1}, for each 1 ≤ z ≤ q. But then,

since there exists a m2

e ∈ ζ(s j

i), it follows that Property 2 applies to each

internal vertex between xi, xi+1 excluding xi with v = xzi, v′ = x z+1

i , x =

fz+1

i and l = m2ei, for all 1 ≤ z ≤ q (see Fig. 7b). Hence d(F, S) ≥ q =

6|VG| + 3|EG| + 1 since each x j

i with 1 ≤ j ≤ q − 1 and xi+1 are duplicated.

(16)

Tj S fi lca_S_(f_i_{, f}_i+1₎ M fi wj_i tji fi+1 fz i f z i fi+1 lca_S_(f_i_{, f}z i) wji+1 Sij sj_i m2 e a) Tj S fi lca_S_(f_i_{, f}_i+1₎ M fi wj_i tji fi+1 fz i f z i fi+1 lca_S(f_i, fz i) wji+1 Sij sj_i m2 e m2 e z1 T5 xi M _f_i+1 xi+1 Li f fz i fiz+1 fi m2_e xz i xz+1 i b)

Figure 8: Illustration of Lemma 2 when lcaS(fi, fiz) < lcaS(fi+1, fiz). In case a) the

vertices of Tj, 1 ≤ j ≤ 4, between wji and wi+1j are duplicated. In case b), due to the

leaf label m2

e in S, vertex w j

i+1 is not mapped in z1, and all the vertices between xi+1 and

xz+1i in T5 are duplicated.

Property 2 applies to the vertex wji, as ζ(w j

i) = ζ(s j

i)\{fi}, while lcaT(fi, fiz)

contains fi, but not fi+1. Hence wji is duplicated (see Fig. 8a). Now, we have

to consider the vertices between w_ij and wj_i+1. Consider the lowest vertex sji ∈ S

j

i not duplicated and denote by t j i its

child which is not on the spine of Tj, 1 ≤ j ≤ 4. Let z1 be the vertex of S

where sj_i is mapped (i.e. M(sj_i) = z1), and notice that z1 ≥ lcaS(fiz, fi+1).

Since sji is not duplicated, then the cluster of one of the children of z1contains

ζ(tji), while the other contains {fiz, fi, fi+1}, for each 1 ≤ z ≤ q. But then,

since there exists a m2

e ∈ ζ(s j

i), it follows that Property 2 applies to each

internal vertex between xi, xi+1 (excluding xi), with v = xzi, v′ = x z+1 i ,

x = fiz+1 and l = m2e, for all 1 ≤ z ≤ q (see Fig. 8b). Hence d(F, S) ≥ q =

6|VG| + 3|EG| + 1 since each xji with 1 ≤ j ≤ q − 1 and xi+1 are duplicated.

Since we have shown that for each pair of leaves fi, fi+1, either d(F, S) ≥

6|VG| + 3|EG| + 1, or all the internal nodes between wxi (included) and wxi+1

(17)

While in the previous lemma we have focused on the duplications induced by vertices of the spine of the gene trees T1, . . . , T4, in what follows, we will

focus on the duplications induced in the subtrees representing the vertices and edges (i.e. Tv, Te1, Te2).

Lemma 3. Let S be a solution of Min-5-Dup over instance F =

{T1, . . . , T5} and let Te1x, T

2

ex, Tvi, Tvj be four subtrees of T1, . . . , T4, s.t.

ex = {vi, vj} ∈ EG and i < j. Then (1) the root of at least one of Te1x,

T2

ex is duplicated with respect to S; (2) the roots of at least two of T

1 ex, T

2 ex,

Tvi, Tvj are duplicated with respect to S.

Proof. (1) The proof follows from Property 1, since the roots of T1 ex, T

2 ex

induces two different bipartitions of the sets {m2

x, l1x, l2x}.

(2) Now, let us prove the second part of the lemma. We have shown that any species tree S induces a duplication in the root of at least one of T1

ex,

T2

ex. If S induces a duplication in the roots of both T

1

ex and T

2

ex, then the

lemma holds. Hence assume that S induces a duplication in exactly one of T1

ex, T

2

ex, w.l.o.g. T

1

ex. Thus, assume that S does not induce a duplication in

the root of T2 ex.

Let us define Lx = {m1x, m2x, l1x, l2x} and consider Tvi|Lx, Tvj|Lx. The

roots of Tvi|Lx and Tvj|Lx induce, by construction, the following bipartitions

B(vi) = ({m1x, l1x}; {m2x}) and B(vj) = ({m1x, l2x}; {m2x}).

Let v be the vertex of S, which is the lowest common ancestor of {m2

x, l1x, l2x}. Since we have assumed that the root of Te2x is not duplicated, it

follows that the subtree rooted at v restricted to {m2

x, l1x, l2x} must induce the

bipartition ({m2

x, lx1}; {lx2}) (as defined in Te2x). Now, assume that both the

root of Tvi and the root of Tvj are mapped to v and consider where the leaf m

1 x

is possibly placed in the subtree S(v). If m1

x is in the same set of the

biparti-tion with l2

x, then the root of Tvi is duplicated (as B(vi) = ({m

1

x, lx1}; {m2x}),

see Fig. 9a). If m1

x is in the same set of the bipartition with l1x and m2x, then

the root of Tvj is duplicated, (as B(vj) = ({m

1

x, l2x}; {m2x}), see Fig. 9b).

Assume now that the root of Tvi or the root of Tvj, w.l.o.g. ϑT_vi, is not

mapped to v. Then, ϑT_vi is either mapped to a descendant v′ of v on the

path from v to lcaS(m2x, lx1) or to an ancestor v′′ of v. If ϑT_vi is mapped to

v′ _{then m}1

x is also a descendant of v′ leading to a duplication of the root of

Tvj (see Fig. 10a). If ϑTvi is mapped to v

′′_{, then v}′′ _{induces the bipartion}

({l2

x, m2x, lx1} ∪ X; {m1x} ∪ Y ), for some sets X, Y ⊂ Λ, leading to a duplication

of the root of Tvj (see Fig. 10b). A similar proof can be derived if ϑT_vj, is

(18)

m2 x m2 z m2 y l1 x l1 z l2 y m1 x m1 z m1 y Tvi S m2 x m2 z′ m2 y′ l2 x l1 z′ l2 y′ m1 x m1 z′ m1 y′ Tvj M l1 x m2 x l2 x m1x v _M a) m2 x m2 z m2 y l1 x l1 z l2 y m1 x m1 z m1 y Tvi S m2 x m2 z′ m2 y′ l2 x l1 z′ l2 y′ m1 x m1 z′ m1 y′ Tvj M l1 x m2 x l2 x m1x v _M b)

Figure 9: Illustration of the first part of Lemma 3. Case a) shows that, when the leaf labeled by m1,x is placed in the same set of the bipartition with lx2, the root of Tvi is

duplicated. Case b) shows that, when the leaf labeled by m1,x is placed in the same set

of the bipartition with l1

x, the root of Tvj is duplicated.

m2 x m2 z m2 y l1 x l1 z l2 y m1 x m1 z m1 y Tvi S m2 x m2 z′ m2 y′ l2 x l1 z′ l2 y′ m1 x m1 z′ m1 y′ Tvj M l1 x m2 x l2 x m1x v _M a) v′ m2 x m2 z m2 y l1 x l1 z l2 y m1 x m1 z m1 y Tvi S m2 x m2 z′ m2 y′ l2 x l1 z′ l2 y′ m1 x m1 z′ m1 y′ Tvj M l1 x m2 x l2 x m1x v M b) v′′

Figure 10: Illustration of the second part of Lemma 3. Case a) shows that, when the root of Tvi is mapped to a descendant v

′ _{of v, the root of T}

vj is duplicated. Case b) shows that,

when the root of Tvi is mapped to an ancestor v

′′ _{of v, the root of T}

(19)

Applying Lemma 2 and Lemma 3, we can prove the following fundamental result.

Lemma 4. Let G = (VG, EG) be an instance of MVCC and let F =

{T1, . . . , T5} be the corresponding instance of Min-5-Dup. Then, starting

from a cover V′

G of G, we can compute in polynomial time a solution S of

Min-5-Dup _{for F = {T}₁_{, . . . , T}₅_{} s.t. d(F, S) ≤ 5|V}_G_{| + 3|E}_G_{| + |V}′

G|;

start-ing from a solution S of Min-5-Dup for F s.t. d(F, S) ≤ 5|VG| + 3|EG| + p,

we can compute in polynomial time a cover of G of size at most p. Proof. First, consider a cover V′

G of G = (VG, EG), we define a solution S to

Min-_{5-Dup of cost at most 5|V}_G_|+3|E_G_|+|V′

G| (see Fig. 11 for an example).

Define first S′ _{as a tree isomorphic to T}

5. S is obtained by inserting some

subtrees in S′_{. More precisely, consider the subtree of S}′ _{having as leaf set}

M1 ∪ M2. Let x be the root of this subtree, with children xl, xr Define the

following comb graphs K1 and K2 (the order of the leaves in the two comb

graphs is induced by the order on the corresponding edges of the graph G, and if two leaves l1

x, l2x belong to the same comb graph, then lx1 < lx2). Let

ex = {vi, vj}, ey = {vi, vh}, ez = {vi, vk} be the three edge incident on vi.

Assume that vi is the a-th (b-th, c-th respectively) vertex in {vi, vj} ({vi, vh},

{vi, vk} respectively), where a, b, c ∈ {1, 2}. This means that by construction

the subtree T (vi) contains the leaves lax, lby, lzc. Then K1 is a comb graph on

the set L1 defined as follows:

L1 = {lax, l b y, l

c

z : vi ∈ (VG\ VG′)}

The comb graph K2 is on the set L2 = L \ L1.

The two comb graphs K1 and K2 are inserted in the edges {x, xl} and

{x, xr_{} (respectively). Next, we will show the duplication induced by S into}

the subtrees T1 ex, T

2

ex and Tvi.

First, assume that vi ∈ (VG\ VG′). Then the corresponding subtree T (vi)

does not contain duplications as it is isomorphic to S|Λ(T (vi)).

Assume that vi ∈ VG′. Then by construction the subtree

T (vi)|{m1x, m1y, m1z} (T (vi)|{m2x, m2y, m2z}, T (vi)|{lax, lby, lzc} respectively) is

iso-morphic to the subtree S|{m1

x, m1y, m1z} (S|{m2x, m2y, m2z}, S|{lax, lyb, lcz}

respec-tively). A duplication is induced in the root of T (vi) as lax, lby, lzc ∈ L2 (hence

(20)

Now, consider the subtrees T1 ex, T

2

ex associated with the edge ex = {vi, vj}.

If both vi, vj ∈ VG′, then all the leaves lx1, l2x, m2x belongs to L2 (hence to

Λ(K2)). Since we assume that lx1 < lx2 in the order of leaves of K2, it follows

that no duplication is induced in T1

ex, and a duplication is induced in the root

of T2

ex. Assume that exactly one of vi, vj belongs to V

′

G (w.l.o.g. vi ∈ VG′).

Then the leaves l1

x, m2x belongs to L2 (hence to Λ(K2)), while l2x ∈ L1. It

follows that no duplication is induced in T1

ex, and a duplication is induced in

the root of T2 ex.

Hence duplications are induced in: (1) the root of exactly one of the subtrees T1

ex, T

2

ex; (2) the root of each subtree Tvi, where vi ∈ V

′

G. Since all

the nodes on the spine of each Tx are duplicated, it follows that S induces

5|VG| + 3|EG| + |VG′| duplications. f5 f1 l2 3 l2 5 l62 l1 1 l2 1 l1 6 l1 5 S C(M1) C(M2) . . . . . . xl xr

Figure 11: A solution S for the instance of Fig. 5, where the vertex cover V′

G= {v1, v2, v3}. The labels l2 3, l 2 5, l 2

6 belongs to the tree T (v4) (and notice that v4is the only vertex not in

V′ G).

Now, consider a species tree S inducing at most 5|VG| + 3|EG| + p

dupli-cations. By Lemma 2, we can assume that S induces a duplication in the spine of each tree Tx, with x ∈ {1, . . . , 4}. Now, we compute a vertex cover

V′

G of G of size p as follows. For each subtree Tvi such that a duplication

(21)

for some edge {vi, vj}, a duplication is not induced in the root of subtrees

Tvi, Tvj, then add one of vi, vj to V

′

G. By construction and by Lemma 3, for

each edge {vi, vj} ∈ EG, at least one of vi, vj belongs to VG′. Since all the

nodes on the spine of each Tx is duplicated (hence a total of 5|VG| + 2|EG|

duplications), it follows that |V′

G| = p′ ≤ p, hence the lemma holds.

Lemma 4 concludes the reduction.

Theorem 1. _{The Minimum Duplication problem is APX-hard, even when}

the input consists of five uniquely leaf-labelled gene trees.

Proof. First, notice that in a cubic graph G = (VG, EG), |EG| = 3₂|VG|, and

a vertex cover V′

G of G has size at least |VG|

4 . Hence by Lemma 4 it follows

that we have designed an L-reduction from MVCC to Min-5-Dup. Since MVCC _{is APX-hard [30], provided our L-reduction, we can conclude that}

Min-5-Dup _{is APX-hard.}

4. A randomized approach

In this section, we investigate the complexity of the Minimum Dupli-cation Bipartite _{problem and show that it can be solved efficiently by a} randomized algorithm when the input gene trees have bounded depth. A ran-domized algorithm can be seen simply as an algorithm that is allowed to do some random decisions as it processes the input. Whereas defining a random-ized algorithm is quite easy, analyzing its performance is more complicated. Indeed, first, one has to compute the probability of success of the random-ized algorithm (i.e. probability to end up with an optimal solution). Then, one can amplify the probability of success simply by repeatedly running the algorithm, with independent random choices, and taking the best solution founded. If one, moreover, prove that the overall running time required to get a high probability of success is polynomial in the size of the input, then it implies that the problem is randomized polynomial (in RP-class). For fur-ther details on randomized algorithms, the reader should consider the book of Kleinberg and Tardos [31].

In order to prove that the Minimum Duplication Bipartite problem is randomized polynomial, we first provide a randomized algorithm for a variant of the Minimum Cut problem, called Minimum Cut in Colored

(22)

problem can be translated into a Minimum Cut in Colored Hypergraph problem that can be solved efficiently applying our randomized algorithm on hypergraphs with bounded hyperedges degree. It is of importance to note that, as far as we know, this is the first attempt of solving by randomization the minimum cut in colored hypergraph. Providing a randomized algorithm for general hypergaphs with unbounded hyperedges degree is still open.

Let us first introduce the Minimum Cut in Colored Graph problem: Minimum Cut in Colored Graph

• Input: A set of colors C, and an undirected colored graph G = (V, E) where any edge is colored with a color from C.

• Output: A colored cut of G – that is a partition of V into two non-empty sets A and B.

• Measure: The number of colors used by the edges having one end in A and the other in B.

For ease, let col : E 7→ C be a function returning the color of a given edge and mul(c) = |{e : e ∈ E and col(e) = c}| be a function returning the multiplicity of a given color. Moreover, for sake, given a graph G = (V, E), let col(G) =S

e∈Ecol(e) denote the set of colors used in G. Let us now describe

an algorithm inspired by the folklore Contraction Algorithm [31] used for solving the classical Minimum Cut problem (i.e. minimizing the number of edges having one end in A and the other in B) on uncolored graph by randomized algorithm.

As in [31], our Colored Contraction Algorithm uses a connected multigraph G = (V, E) – that is an undirected graph that is allowed to have more than one edge between the same pair of vertices – which is moreover colored. The algorithm starts by choosing, uniformly at random, a color c ∈ col(G) and contracting any edge e ∈ E such that col(e) = c (and thus all such edges). Contracting an edge {u, v} ∈ E will produce a new graph G′ _{= (V}′_{, E}′_{) in which u and v are identified as a single new vertex}

w whereas all other vertices are keeping their original identity (i.e. V′ ₌

{V ∪ {w}} \ {u, v}). In G′_{, E}′ _{= {E ∪ {{w, v}′′_{} : v}′ _{∈ {u, v}, {v}′_{, v}′′_{} ∈}

E}} \ {{v′_{, v}′′_{} : v}′ _{∈ {u, v}, v}′′ _{∈ V }. Roughly, E}′ _{is a copy of E where any}

edge {u, v} has been removed whereas any other edge has been preserved, but if one of its ends was equal to u or v, then this end is updated to be

(23)

equal to the new node w. Note that the contraction operation may end up in a multigraph even when starting from a classical graph G. In this process, contracting all the edges that have the selected color c roughly corresponds to a sequence of mul(c) contractions, each reducing the number of vertices by one. Colored Contraction Algorithm then continues recursively on G′, by choosing, uniformly at random, a color c ∈ col(G′) and contracting any edge e ∈ E such that col(e) = c. As these recursive calls proceed, the vertices of V′ _{should be viewed as supervertices: each supervertex w}

corresponds to the subset S(w) ⊆ V that has been “swallowed up” in the contractions that produced w. The algorithm ends when it reaches a graph G′

with only two super-vertices vA and vB. We output (A = S(vA), B = S(vB))

as the colored-cut found by the algorithm.

Let us now analyze the performance of the Colored Contraction Algorithm _{– which cannot be derived directly from the one of the original} Contraction Algorithm_{. Since the algorithm is making random choices,} there is some probability that it will succeed in finding a minimum colored-cut (and some probability that it would not). In order to prove that this algorithm is worthwhile, we will prove that the probability of success is only polynomially small; inducing that, by running the algorithm a polynomial number of times and returning the best colored-cut found in any run, one would be able to produce an optimal colored-cut with high probability.

Theorem 2. The Colored Contraction Algorithm returns an

opti-mal colored-cut G with probability at least (|V |2k₎−1_{where k = max}

c∈Cmul(c)

Proof. Let us assume that the optimal minimum colored-cut (A, B) of G is

of size opt; that is the set of edges having one end in A and the other end in B (referred afterwards as the cut-set) is colored using opt colors of C. Note that unlike the classical minimum cut problem, the goal here is to minimize the number of colors in the cut-set itself. Moreover, let Gopt =

G[A ∪ B, {(u, v) : (u, v) ∈ E and u ∈ A, v ∈ B}] corresponds to the bipartite graph representing the cut-set of (A, B). In order to compute a lower bound on the probability that the Colored Contraction Algorithm returns the minimum colored-cut (A, B), we first notice some important properties. First, remark that any vertex v ∈ V cannot have a degree less than opt. Indeed, otherwise, ({v}, V \ {v}) would correspond to a colored-cut inducing at most opt−1 colors, contradicting our hypothesis that (A, B) is an optimal minimum colored-cut of G. Therefore, any vertex of G is of degree at least opt_{; inducing the following lower bound on E: |E| ≥} opt|V |_{. We know}

(24)

moreover that, since each color of C can be used at most k = maxc∈Cmul(c)

times in E, we have that |E| ≤ k ·|C|. This leads to the following inequalities.

|V | · opt ≤ 2 · |E| ≤ 2k · |C| (1)

Let us now evaluate the probability P r[Fj] that the Colored

Contrac-tion Algorithm _{fails at the j}th_{step of the recursion (that is when already}

j − 1 contractions have been done). Considering what could go wrong in the

jth _{step of the Colored Contraction Algorithm, one can check that}

the unique issue would be that the uniformly at random choice of a color c unfortunately select one color of the set of opt colors used by the cut-set – which will be then contracted inducing that the algorithm would not be able to find the optimal colored-cut (A, B) since at least a node of A and a node of B would be both contracted into the same supervertex. Hence the probability that an edge of the current graph G′ _{is both in the optimal}

cut-set and contracted is at most opt

|C′_|, since there are at most opt edges to

be chosen among |C′_{| edges, where C}′ _{= col(G}′_{). According to Inequality 1,}

considering that the graph at jth _{step is G}′ _{and C}′ _{= col(G}′₎

P r[Fj] ≤ opt |C′_| ≤ 2k · |C′_| |V′_{| · |C}′_| = 2k |V′_| (2)

The colored-cut (A, B) will actually be returned by the algorithm if no edge of the cut-set is contracted in any of the at most |V | − 2 it-erations. If we write Sj for the event that an edge of the cut-set has

not been contracted until the jth _{step, then, according to Inequality 2,}

P r[Sj] ≥ 1 − P r[Fj] = 1 − _|V2k′_| where the graph at jth step is G′ = (V′, E′).

For ease, let us consider the sequence of color choices as being Sc = (c1, c2. . .)

and λj =

P

i<j and ci∈Sc

mul_(c_i_{). On the whole the probability that the}

Col-ored Contraction Algorithm _{returns the optimal colored-cut (A, B)}

is thus at least P r[Success] ≥ λ1−1 Y i=0 (1− 2k |V | − i)· λ3−1 Y i=λ2 (1− 2k |V | − i) . . . λ_|Sc|−1 Y i=λ_|Sc|−1 (1− 2k |V | − i) (3) ≥ λ1−1 Y i=0 (|V | − i − 2k |V | − i ) · λ3−1 Y i=λ2 (|V | − i − 2k |V | − i ) . . . λ_|Sc|−1 Y i=λ_|Sc|−1 (|V | − i − 2k |V | − i ) (4)

(25)

≥ λ_|Sc|−1 Y i=0 (|V | − i − 2k |V | − i ) = ✘✘ ✘✘ ✘ |V | − 2k |V | . . . |V | − 2k − 2k ✘✘ ✘✘ ✘ |V | − 2k . . . |V | − (λ|Sc|− 1) − 2k ✭✭ ✭✭ ✭✭ ✭✭✭ |V | − (λ|Sc|− 1) (5) ≥ Qλ|Sc|−1 i=2k |V | − i − 2k Q2k−1 i=0 |V | − i ≥ 1 |V |2k = (|V | 2k₎−1 ₍₆₎

Then according to Theorem 2, we know that a single run of the Colored Contraction Algorithm _{fails to find an optimal colored-cut with} prob-ability at most (1 − (|V |2k₎−1_{). One can then amplify the probability of}

suc-cess simply by repeatedly running the algorithm, with independent random choices, and taking the best colored-cut found. It is known that the function (1 − n−1₎n_{converges monotonically from} 1

4 up to 1

e as n increases from 2 [31].

Thus, if we run the algorithm |V |2k _{times, then the probability that we fail}

to find an optimal colored-cut in any run is at most (1 − (|V |2k₎−1₎|V |2k

≤ 1 e.

As usually done, it is easy to even reduce more the failure probability with further repetitions by running the algorithm |V |2k_{ln |V | times which induces}

a probability of failure of at most e− ln |V |₌ 1

|V |. Overall, the running time

re-quired to get a high probability of success is polynomial in |V | if k is bounded, since each run of the Colored Contraction Algorithm takes polyno-mial time, and we run it a polynopolyno-mial number of times.

Let us now demonstrate how this result can be used in order to solve the

Minimum Duplication Bipartite _problem.

Theorem 3. The Minimum Duplication Bipartite problem is

random-ized polynomial time solvable when the gene trees are of bounded depth.

Proof. Remind that, given a binary tree T = (V, E) and a vertex v ∈ V , vL

(resp. vR_{) denotes the left (resp. right) child of v and by ζ(v) the cluster of}

v i.e. the set of all leaves belonging to the subtree rooted in v. Moreover, for ease, ϑT is denoting the root of the tree T . Given a gene tree forest

F = {T1 = (V1, E1), T2 = (V2, E2), . . .} built on Λ, considering the definition

of the Minimum Duplication Bipartite problem, one wants to define a bipartition (Λ1, Λ2) of Λ = STi∈FVi inducing the minimum number of

(26)

Figure 12: Illustration of the construction of GF and G′ given F = (T1, T2). Considering

the minimum colored-cut {1, 2, 3, 4, 5}, {6, 7, 8, 9} of size 1, the only induced duplication is represented as a star on T1.

if ∃v′ _{∈ {v}L_{, v}R_{}, such that (Λ}

1T ζ(v′) 6= ∅) ∧ (Λ2T ζ(v′) 6= ∅) is true.

In other words, v is a duplication if for one of its children – say v′ _{– ζ(v}′₎

contains two leaves not belonging to the same part of the bipartition (Λ1, Λ2).

Given F and a set of colors C, we define the following colored hypergraph GF = (V, E) associated to F. Let V = Λ = ST∈Fζ(ϑT) and there are two

hyperedges, for any node vk of the tree Ti, αki = {ζ(vkL) : |ζ(vLk)| ≥ 2} and

βi

k = {ζ(vRk) : |ζ(vRk)| ≥ 2} colored with color col(αik) = col(βki) = cik ∈ C in

E. An illustration of such construction is provided in Fig. 12. Then in GF,

a colored-cut of size k′ _{corresponds to a bipartition of the set Λ inducing k}′

duplications. Indeed, if the hyperedge αi

k (resp. βki) belongs to the cut-set,

then it induces a duplication for the corresponding vertex vkin Ti since there

exist at least two leaves in ζ(vL

(27)

the bipartition (Λ1, Λ2).

Thus, if one can find a minimum colored-cut in such hypergraphs, then one would be able to solve in polynomial time the Minimum Duplication

Bipartite _{problem. Just consider the Colored Contraction}

Algo-rithm _{presented previously in this section. From any colored hypergraph} GF = (V, E), one may build a colored graph G′ = (V, E′) where any

hy-peredge e = {vi1, vi2. . . vik} colored with color c = col(e) has been replaced

by a path vi1, vi2. . . vik colored with c in E′ (i.e. E′ = {{vik, vik+1} : vik ∈

e, e ∈ E}). Notice that an edge e ∈ E′ _{colored with c is cut if and only}

if an hyperedge colored c of GF is cut. Once this colored graph has been

obtained, one may apply the Colored Contraction Algorithm which will produce a minimum colored-cut of G′ _{which also induces a minimum}

colored cut in GF. Since this algorithm has a complexity exponential in the

maximum multiplicity of any color of the considered graph, when the size of each hyperedge is bounded, so does the multiplicity of any color since the maximal size of an hyperedge corresponds to the maximal depth of the input gene trees: leading to a randomized polynomial solution for the Minimum

Duplication Bipartite _problem.

5. Conclusion

In this paper we have investigated the complexity of two variants of the Minimum Duplication problem. We have proved that the Minimum Duplication _{problem is APX-hard, even when the input consists of five} uniquely leaf-labelled gene trees. Then, we have shown that the Minimum Duplication Bipartite _{problem can be solved efficiently by a randomized} algorithm when the input gene trees have bounded depth.

A natural open problem is the complexity of the Minimum Duplication Bipartite _{problem when the gene trees have unbounded depth.} Further-more, it would be interesting to deepen the analysis on the complexity of the Minimum Duplication _{problem, when the input consists of less than five} uniquely leaf-labelled gene trees.

Acknowledgements

The authors acknowledge partial funding from ANR project BIRDS JCJC SIMI 2-2010, and also would like to thanks the anonymous reviewers for valuable arguments and remarks. Paola Bonizzoni and Riccardo Dondi have

(28)

been supported by the PRIN 2010/11 grant “Automi e Linguaggi Formali: Aspetti Matematici e Applicativi”, code H41J12000190001.

References

[1] G. Blin, P. Bonizzoni, R. Dondi, R. Rizzi, F. Sikora, Complexity insights of the minimum duplication problem, in: M. Bielikov´a, G. Friedrich, G. Gottlob, S. Katzenbeisser, G. Tur´an (Eds.), SOFSEM, Vol. 7147 of Lecture Notes in Computer Science, Springer, 2012, pp. 153–164. [2] J. Felsenstein, Phylogenies from molecular sequences: Inference and

re-liability, Ann. Review Genet. 22 (1988) 521–565.

[3] E. E. Eichler, D. Sankoff, Structural dynamics of eukaryotic chromosome evolution, Science 301 (5634) (2003) 521–565.

[4] W. M. Fitch, Homology—a personal view on some of the problems, Trends Genet. 16 (2000) 227–231.

[5] L. Arvestad, J. Lagergren, B. Sennblad, The gene evolution model and computing its associated probabilities, J. ACM 56 (2).

[6] L. Arvestad, A.-C. Berglung, J. Lagergren, B. Sennblad, Gene tree re-construction and orthology analysis based on an integrated model for duplications and sequence evolution., in: D. Gusfield (Ed.), RECOMB 2004, ACM, New York, 2004, pp. 326–335.

[7] P. Bonizzoni, G. Della Vedova, R. Dondi, Reconciling a gene tree to a species tree under the duplication cost model., Theoretical Computer Science 347 (2005) 36–53.

[8] W. Chang, O. Eulenstein, Reconciling gene trees with apparent poly-tomies, in: D. Chen, D. T. Lee (Eds.), COCOON 2006, Vol. 4112 of LNCS, Heidelberg, 2006, pp. 235–244.

[9] C. Chauve, N. El-Mabrouk, New Perspectives on Gene Family Evolu-tion: Losses in Reconciliation and a Link with Supertrees, in: S. Bat-zoglou (Ed.), RECOMB, Vol. 5541 of LNCS, Springer, 2009, pp. 46–58. [10] J. Cotton, R. Page, Rates and patterns of gene duplication and loss in the human genome, Proceedings of the Royal Society of London. Series B 272 (2005) 277–283.

(29)

[11] D. Durand, B. Hald´orsson, B. Vernot, A hybrid micro-macro-evolutionary approach to gene tree reconstruction, Journal of Compu-tational Biology 13 (2006) 320–335.

[12] B. Ma, M. Li, L. Zhang, From Gene Trees to Species Trees, SIAM J. Comput. 30 (3) (2000) 729–752.

[13] R. Page., Genetree: comparing gene and species phylogenies using rec-onciled trees., Bioinformatics 14 (1998) 819–820.

[14] R. Page, J. Cotton, Vertebrate phylogenomics: reconciled trees and gene duplications, in: Pacific Symposium on Biocomputing, 2002, pp. 536– 547.

[15] J.-P. Doyon, C. Scornavacca, K. Gorbunov, G. Szolloso, V. Ranwez, V. Berry, An effi. algo. for gene/species trees parsim. reconc. with losses, dup. and transf., J. Comp. Biol. 6398 (2010) 93- 108.

[16] M. Hallett, J. Lagergren, A. Tofigh, Simultaneous identification of du-plications and lateral transfers, in: RECOMB, ACM, 2004, pp. 347–356. [17] A. Tofigh, M. Hallett, J. Lagergren, Simultaneous identification of du-plications and lateral gene transfers, IEEE/ACM Trans. Comput. Biol. Bioinform. 8 (2011) 517-535.

[18] M. S. Bansal, E. J. Alm, M. Kellis, Efficient algorithms for the rec-onciliation problem with gene duplication, horizontal transfer and loss, Bioinformatics 28 (12) (2012) 283–291.

[19] M. S. Bansal, E. J. Alm, M. Kellis, Reconciliation revisited: Handling multiple optima when reconciling with duplication, transfer, and loss, Journal of Computational Biology 20 (10) (2013) 738–754.

[20] M. T. Hallett, J. Lagergren, New algorithms for the duplication-loss model, in: RECOMB, ACM, 2000, pp. 138–146.

[21] U. Stege, Gene Trees and Species Trees: The Gene-Duplication Problem is Fixed-Parameter Tractable, in: F. K. H. A. Dehne, A. Gupta, J.-R. Sack, R. Tamassia (Eds.), 6th International Workshop on Algorithms and Data Structures (WADS’99), Vol. 1663 of LNCS, Springer, 1999, pp. 288–293.

(30)

[22] M. S. Bansal, R. Shamir, A Note on the Fixed Parameter Tractability of the Gene-Duplication Problem, IEEE/ACM Transactions on Compu-tational Biology and Bioinformatics (TCBB) 8 (3) (2011) 848–850. [23] J. Byrka, S. Guillemot, J. Jansson, New results on optimizing rooted

triplets consistency, Discrete Appl Math 158 (11) (2010) 1136–1147. [24] A. Chester, R. Dondi, A. Wirth, Resolving rooted triplet inconsistency

by dissolving multigraphs, in: T.-H. H. Chan, L. C. Lau, L. Tre-visan (Eds.), TAMC, Vol. 7876 of Lecture Notes in Computer Science, Springer, 2013, pp. 260–271.

[25] M. S. Bansal, J. G. Burleigh, O. Eulenstein, A. Wehe, Heuristics for the Gene-Duplication Problem: A Θ(n) -Speed-Up for the Local Search, in: T. P. Speed, H. Huang (Eds.), RECOMB, Vol. 4453 of LNCS, Springer, 2007, pp. 238–252.

[26] M. S. Bansal, O. Eulenstein, A. Wehe, The Gene-Duplication Prob-lem: Near-Linear Time Algorithms for NNI-Based Local Searches, IEEE/ACM Trans. Comput. Biology Bioinform. 6 (2) (2009) 221–231. [27] W.-C. Chang, J. G. B. andDavid F Fern´andez-Baca, O. Eulenstein,

An ILP solution for the gene duplication problem, BMC Bioinformatics (Suppl 1):S14 (12).

[28] A. Ouangraoua, K. M. Swenson, C. Chauve, An Approximation Al-gorithm for Computing a Parsimonious First Speciation in the Gene Duplication Model, in: E. Tannier (Ed.), RECOMB-CG, Vol. 6398 of LNCS, Springer, Ottawa, Canada, 2010, pp. 290–301.

[29] D. J. A. Welsh, M. B. Powell, An upper bound for the chromatic number of a graph and its application to timetabling problems, The Computer Journal 10 (1) (1967) 85–86.

[30] P. Alimonti, V. Kann, Some APX-completeness results for cubic graphs, Theoretical Comput. Sci. 237 (1–2) (2000) 123–134.