Compound Poisson Approximation and Testing for Gene Clusters with Multigene Families

(1)

HAL Id: hal-00636411

https://hal.archives-ouvertes.fr/hal-00636411

Submitted on 27 Oct 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Compound Poisson Approximation and Testing for Gene Clusters with Multigene Families

Simona Grusea, Etienne Pardoux, Olivier Chabrol, Pierre Pontarotti

To cite this version:

Simona Grusea, Etienne Pardoux, Olivier Chabrol, Pierre Pontarotti. Compound Poisson Approxi-

mation and Testing for Gene Clusters with Multigene Families. Journal of Computational Biology,

Mary Ann Liebert, 2011, 18 (4), pp.579-594. �10.1089/cmb.2010.0043�. �hal-00636411�

(2)

Compound Poisson Approximation and Testing for Gene Clusters with Multigene Families

S. GRUSEA,^1,2E. PARDOUX,¹O. CHABROL,¹and P. PONTAROTTI¹

ABSTRACT

We present in this article a compound Poisson approximation for computing probabilities involved in significance tests for conserved genomic regions between different species. We consider the case when the conserved genomic regions are found by the reference region approach. An important aspect of our computations is the fact that we are taking into account the existence of multigene families. We obtain convergence results for the error of our approximation by using the Stein-Chen method for compound Poisson approximation.

Key words: compound Poisson approximation, multigene families, reference-region approach, significance test for gene clusters, Stein-Chen method.

1. INTRODUCTION

O

rthologous genesare two genes, in two different species, that descend from the same gene at the ancestor of the two species, as the result of a speciation event. We callconserved genomic regionor gene clustertwo chromosomic regions, in two different species, that have in common a certain number of orthologous genes, not necessarily adjacent or in the same order in the two genomes. We do not impose any restriction on the gap length between consecutive orthologs. In the literature, various definitions for gene clusters exist (Bergeron et al., 2002; Danchin et al., 2004; Durand et al., 2003; Hoberman and Durand, 2005, Hoberman et al., 2005; Raghupathy et al., 2005; 2009). We have chosen here a very unrestrictive definition, in order to be able to detect evolutionary signals even between very distant species. The conserved genomic regions can represent signs of evolutionary relatedness between species or of functional selective pressures acting on certain groups of genes. But for this to be the case, the conserved genomic regions have to be significant, i.e., very improbable to have appeared by chance.

During the evolutionary time, the gene order in one genome can be affected by various genome re- arrangement events such as inversions, translocations, transpositions, chromosomic fissions, and fusions.

Therefore, in the absence of certain constraints due to functional selective pressures, the gene order is randomized during evolution. This is one reason why, in general, the null hypothesis taken in the significance tests for gene clusters is the hypothesis of random gene order.

There exist different approaches when searching for gene clusters (Durand et al., 2003). In this article, we focus on the case when the gene clusters are found by the ‘‘reference region’’ approach, which consists

1LATP–UMR CNRS 6632, E´ quipe E´volution Biologique et Mode´lisation, Universite´ de Provence, Marseille, France.

2Institut de Mathe´matiques de Toulouse, INSA de Toulouse, Universite´ de Toulouse, Toulouse, France.

Volume 18, Number 4, 2011

#Mary Ann Liebert, Inc.

Pp. 579–594

DOI: 10.1089/cmb.2010.0043

579

(3)

in starting with a fixed genomic region in a certain species A (called thereference region) and searching for significant orthologous gene clusters in the genome of another species B.

In general, the orthology relation between the genes of two species is not one-to-one. For a given gene in one species, we may find more than one orthologous gene in another species, as the result of duplication events happened after the separation of the two species. The genes in one species which are orthologous to the same gene in another species are calledco-orthologsof this gene and form what we call amultigene family. The existence of multigene families is an important fact which needs to be considered when testing for gene cluster significance, but very few of the existent statistical tests consider it. Danchin et al. (2004) propose to weigh the orthologs in inverse proportion to the sizes of the multigene families, but their use of a binomial distribution is not adequate in these settings. Raghupathy et al. (2005, 2009) take also into account the existence of multigene families, but their test is suitable only for clusters found by the window- sampling approach, and not by the reference region approach, as in our case.

In this article, we adopt the idea of Danchin et al. (2004) for taking into account the multigene families and propose a compound Poisson approximation for computing the probabilities, under the null hypothesis, of different gene clusters.

The article is organized as follows. In Section 2, we present the mathematical framework. We explain the simplified mathematical model that we use and we describe the way in which we take into account the existence in the genome B of multiple co-orthologs for the genes in the reference region. We give the mathematical formulation of the problem and we start by considering, for technical convenience, the case of a circular genome. In Subsection 2.3, we give a very short presentation of the Stein-Chen method for compound Poisson approximation—the coupling approach. We present in Theorem 1 a result of Roos (1993), which gives a convergence result for the error of the approximation under the existence of a certain coupling. Section 3 is the core of the article, containing the compound Poisson approximations for our probability of interest, together with the convergence results that we have obtained using the Stein-Chen method. Following the approach of Roos (1993a,b), we construct explicitly the coupling needed in The- orem 1, and we estimate the terms appearing in the error bound in Theorem 1. Theorem 2 states the obtained convergence result. In Subsection 3.2, we describe a ‘‘Markovian’’ approximation for computing, in practice, the parameters of the compound Poisson distribution. In Subsection 3.3, we extend the results to the case of a linear genome. In Section 4, we present some numerical results, both in the circular and in the linear case, for a set of selected values for the parameters which are interesting in our biological framework.

We also discuss the biological implications of our results. Section 5 presents three applications of our results on real biological data. We analyze three examples: the first one is a comparison between the human genome and the genome of Oryzias-Latipes, the second one is a comparison between the human genome and the genome of Ciona-Intestinalis, and the third one is a comparison between the human genome and the genome of Danio-Rerio.

2. MATHEMATICAL FRAMEWORK 2.1. Mathematical formulation of the problem

We model the genome as an ordered set of genes, the length of a genomic region being measured in number of genes. We ignore the separation into chromosomes and the physical distances between genes.

The data that we dispose of are: the numbermof genes in the reference region from the genome A which have at least one ortholog in the genome B; for each of those genesi¼1,. . .,m, the number of orthologs it has in B, which we denotefi; the positions in B of these orthologs; the total numberNof genes in the genome B.

We make the (natural) assumption that there exists a maximal size fmax for the multigene families.

Based on the fact that we are in the case m< <N, we make a further approximation and consider the genome B as the continuous interval [0, 1], in which the ‘‘new’’ positions of the orthologs are obtained by dividing byNtheir real positions in the genome.

We will use a pure significance test, with the null hypothesisH₀:random gene order in the genome B.

All the probabilities and distributions appearing throughout the paper are implicitly considered under the null hypothesis H₀.

Fori¼1,. . .,m, we letUij,j¼1,. . .,/_irepresent the positions in B of the orthologs of the geneifrom the reference region. UnderH₀, the r.v.’sUij,j¼1,. . .,/_i,i¼1,. . .,mare i.i.d. uniformly distributed on [0, 1].

(4)

Let n: ¼/₁þ þ/_m denote the total number of genes in B which are orthologous to genes in the reference region in A. We are interested only in thesengenes. We want to test whether they cluster together in a significant way, i.e., in a way which is very improbable by chance, under the null hypothesis.

For taking into account the existence in B of multiple orthologs for the genes in the reference region, we consider the following counting measure:

l_m: ¼X^m

i¼1

1 /_i

X^/ⁱ

j¼1

dU_ij: For an ortholog belonging to a multigene family of sizefi, we call_/¹

iitslabel. For an intervalI[0, 1], we will refer to mm(I) as itsweight.

For applying a statistical test we need to compute, for a given weight h and a given length r, the probability, under the null hypothesis, of finding somewhere in the genome B an orthologous cluster of weight greater than hand of length smaller thanr. We will call such a clusterof type(h:r).

In this article, we focus on the computation of this probability. For technical simplifications, we first consider the case when B is a circular genome, hence the circle of length 1 in our model.

2.2. The circular case

Leth be fixed, of the formh¼Pm i¼1ni

/i, with 0ni/_i,i¼1,. . .,m.

Letr2(0, 1) be also fixed.

We denote by U(1)U(2) U(n) the ordered positions in B of the n orthologs, i.e. the order statistics of ni.i.d. r.v.’s uniformly distributed on the circle of length 1.

LetWm¼Wm(h,r) denote the r.v. representing the number of (possibly overlapping) clusters of type (h:r) in the genome B. Let also

A_k¼A_k(h,r):¼ fl_m([U_(k),U_(k)þr])hg

denote the event of having in B a cluster of type (h :r) starting with thek-th ortholog.

We have Wm¼Pn

k¼11A_k. We are interested in computing P(Wm1), the probability of finding, somewhere in the genome B, at least one cluster of type (h:r).

We will further simplify the parametrization of the problem.

Let/⁰₁5 5/⁰_Jdenote all the different values among the multigene families’ sizes/₁,. . .,/_m, and letgj¼ jfi¼1,. . .,m:/_i¼/⁰_jgj,j¼1,. . .,Jdenote their multiplicities.

Remark 1. We can represent the measure mm as l_m¼Pn

i¼1LidU_(i), where dx denotes the Dirac measure in x andL¼(L1,. . .,Ln)is a random vector independent of the U(i)’s and uniformly distributed over the set of all possible labelings of the n orthologs:

K¼ ‘¼(‘1,. . .,‘n)2 1

/⁰₁,. . ., 1 /⁰_J n

: i:‘i¼ 1

/⁰_j ( )

¼gj/⁰_j,8j

( )

:

Let n_min: ¼g₁¼ jfi:/_i¼/⁰₁gj, where/⁰₁¼minf/_i:i¼1,. . .,mg. We let also h: ¼ dh/⁰₁eand we assume thatn_minh_*, s.t.h_*is the minimal number of orthologs in a cluster of weight greater thanh.

For every labeling ‘2Kand everyk¼1, . . .,n, let

hk(‘): ¼minfd:‘kþ þ‘kþd1hg

be the minimal number of orthologs in a cluster starting with thek-th ortholog so as to be of weight greater thanh. Therefore,

Ak\ fL¼‘g ¼ fU(kþhk(‘)1)U(k)rg \ fL¼‘g:

Leth* :¼max‘,k{h_k(‘)}.

We have h dh/⁰_Je, where /⁰_J¼maxf/i:i¼1,. . .,mg.

Remark 2. We place ourselves in the asymptotic settings of m??, or equivalently, n??.

(5)

Note that we are in the case of a sum of indicators which are in ashort-range dependenceand along- range (almost) independence. Because of the strong dependence between the neighboring indicators, the eventsA_kwill tend to occur in clumps. Consequently, it seems reasonable to approach the distribution of W_mby a compound Poisson distributionCP(l), with a ‘‘good’’ choice of the parameterl.

We will quantify the error using the Kolmogorov distance. We recall that the Kolmogorov distance between two measuresm andnonRþ is

dK(l,)¼sup

k2N

jl([k,1)([k,1))j:

We will approximate our probability of interest P(Wm1) by the corresponding probability CP(k)([1,1))¼1expf P¹

i¼1

kig for the compound Poisson distribution, with an error jP(Wm1)CP(k)([1,1))j dK(L(Wm),CP(k)):

It is therefore sufficient to obtain bounds for the Kolmogorov distance between the two distributions. For bounding the Kolmogorov distance, we will use the Stein-Chen method for compound Poisson approximation.

The Stein-Chen method is a general method to obtain bounds on the distance between two probability distributions with respect to a probability metric. It was originally formulated for normal approximations by Stein (1972), to obtain a bound for the Kolmogorov distance between the distribution of a sum of m-dependent sequence of random variables and a standard normal distribution. The Poisson approximation version was developed by Chen (1975). See also Stein (1986).

The Stein-Chen method for compound Poisson approximation was introduced by Barbour et al. (1992).

In this article, we use, more precisely, thecoupling approachdeveloped by Roos (1993 a,b).

2.3. The Stein-Chen method, the coupling approach

Let W¼P

a2CI_a be a sum of indicators. We assume that there exists a local-dependence structure between these indicators (of the type short-range dependence, long-range independence), so that we can, for everya2C, divideGinto four disjoint subsets {a},C^vs_a,C^vw_a andC^b_a:

C^vs_a : ¼ fb2Cnfag:I_b ‘‘very strongly’’ dependent onI_ag,

C^vw_a : ¼ fb2Cnfag:I_b ‘‘very weakly’’ dependent onfIc,c2 fag [C^vs_agg, C^b_a: ¼Cnffag [C^vs_a [C^vw_a g:

We letU_a: ¼P

b2C^vs_a I_b,Z_a: ¼I_aþU_a,X_a : ¼P

b2C^b_aI_b.

The canonical choice for the parameter of the approximating compound Poisson distribution is the following: k¼PGþ1

i¼1 kidi, whereG¼max_a2CfjC^vs_ajgand ki¼1

i X

a2C

E(I_a1_fZ_a_¼ig):

In practice, it is not always easy to compute the canonical parameters. Sometimes it is useful to keep only a smaller number of parameters:^kk¼P‘

i¼1^kkidi, with‘ <Gþ1, where^kki¼ki for i¼2,. . .,‘,^kki¼0 for i‘þ1 and^kk1¼E(W)P‘

i¼2iki.

For everya2C, letVabe a r.v. and Va its set of values.

We have the following theorem (see Theorem 4.F. in Roos, 1993a).

Theorem 1. Assume that for everya2Cand v2 Vawe can construct, on the same probability space, the indicatorsfI_biv⁰⁰ (a),b2C,i¼1,. . .,jC^vs_aj þ1gand fI_bv⁰ (a),b2Cg in such a way that

L(I_biv⁰⁰ (a),b2C)¼ L(Ib,b2CjIa1_fZ_a_¼_ig¼1,V_a¼v),8i (1) L(I_bv⁰ (a),b2C)¼ L(Ib,b2C): (2) Then, for all choices of the setsC^vs_a and C^vw_a and for all bounded functions g:g:N!R, we have

(6)

dK(L(W),CP(^kk))cK(^kk) X

a2C

[(EIa)²þEIaE(UaþX_a)þE(IaX_a) (

þ^jCX^vs^a^{j þ1}

i¼1

X

b2C^vw_a

E(Ia1_fZ_a_¼_ighb,a,i(V_a))]þ ^GþX¹

i¼‘þ1

i(i1)ki

9=

;,

where cK(^kk)¼sup_A2Ksup_i1jg^kk,A(iþ1)g^kk,A(i)j, withK ¼ f[k,1):k2Ngand g^kk,A being the solution of the Stein-Chen equation for compound Poisson approximation (Roos, 1993a) and h_b,_a,i(v)¼ EjI⁰⁰_biv(a)I_bv⁰ (a)j.

Barbour et al. (2000) showed that, if the following condition is fulfilled:

k12k23k3 , (3)

then

cK(k)min 1 2, 1

k1þ1

: (4)

In the next section, we apply this method to ourW_m, using Theorem 1.

3. COMPOUND POISSON APPROXIMATION FOR P(W

m

1) 3.1. The circular case

We place ourselves in the asymptotic settings ofn??andr?0, withnr?0.

We letIk: ¼1A_k,k¼1, . . .,n.

For everyk2 f1,. . .,ng, we choose the dependence sets as follows:

C^vs_k : ¼ fkhþ2,. . .,k1,kþ1,. . .,kþh2g, C^vw_k : ¼ fj:jjkj42(h2)g,

C^b_k: ¼Cnffag [C^vs_a [C^vw_a g ¼ fj:h25jjkj 2(h2)g:

Here,G¼maxk¼1,...,njC^vs_kj ¼2(h2) andZk¼Pkþh2 j¼khþ2Ij. We will explicitly construct the coupling described in Theorem 1.

Let us define the spacingsS_j:¼U_(jþ1)U_(j),j¼1,. . .,n, with the circular convention modulo n.

Notation 1. For a sequence(a_j)_j1we will denote ai,k:¼aiþ þaiþk1. For everyk2 f1,. . .,ngand‘2K, we haveA_k\{L¼‘}¼{S_k,hk(‘)1r}.

Letk2 f1,. . .,ng be fixed. The indicators appearing in the expression ofZkare those fromI_kh*þ2to I_kþh*2. Consequently, if L¼‘, the spacings appearing in the expression of Zk are Skhþ2,. . ., Skþh*þhkþh*2(‘)4.

LetV_k: ¼(L,S_k_h_þ2,. . .,S_kþ_h_þ_h₄). Note thatV_kcontains all the spacings which may appear inZ_k, for different values of‘.

For every v¼(‘,z₁,. . .,z_2h_þ_h₅), with ‘2K,z₁,. . .,z_2h_þh₅40 and z₁þ þz_2h_þ_h₅51, we will construct on the same probability space the indicatorsfI_jiv⁰⁰(k),j¼1,. . .,ngandfI⁰_j(k),j¼1,. . .,ng (not depending onv) verifying the relations (1) and (2) in Theorem 1.

Note that the event {I_k1_{Zk¼i}¼1} isV_k - measurable and thus, for the condition (1) to be fulfilled, it suffices to construct the family of indicatorsfI⁰_j(k),j¼1,. . .,ng(not depending oni), s.t.

L(I_jv⁰⁰(k),j¼1, . . .,n)¼ L(Ij,j¼1, . . .,njVk¼(‘,z₁,. . .,z_2h_þh₅)):

LetU₁⁰,. . .,U⁰_n be r.v.’s independent onLand such thatL(U⁰₁,. . .,U_n⁰)¼ L(U(1),. . .,U_(n)).

Define the corresponding spacings S⁰_j¼U_j⁰_þ1U⁰_j,8j¼1,. . .,n (with the circular convention U_n⁰_þ₁¼U₁⁰). We thus haveL(S⁰₁,. . .,S⁰_n)¼ L(S1,. . .,Sn).

Forv¼(‘,z1,. . .,z2hþh5) with‘2K,z1,. . .,z2hþh540 andz1þ þz2hþh551, we let

(7)

S⁰⁰_j ¼

1^2h^þP^h⁵

i¼1

zi

1^kþ^hP^þh⁴

i¼khþ2

S⁰_i

S⁰_j,j2 f1, . . .,ngnfkhþ2,. . .,kþhþh4g,

S⁰⁰_k_h

þ2¼z₁,. . .,S⁰⁰_k_þh

þh4¼z_2h_þ_h₅: Note that

L(S⁰⁰₁,. . .,S⁰⁰_n)¼ L(S1,. . .,S_njSkhþ2¼z₁,. . .,S_kþ_h_þh₄¼z_2h_þ_h₅):

Let alsol⁰_m:¼Pn

i¼1LidU_i⁰.

For everyj2 f1,. . .,ngwe construct the indicators needed in Theorem 1 as I_j⁰(k):¼1fl⁰_m([U⁰_j,U⁰_jþr])hg,I⁰⁰_jv(k): ¼1fS⁰⁰_jþ þS⁰⁰_j_þ

hj(‘)2rg:

It is easy to see that the indicators defined above verify the conditions (1) and (2). It remains to compute all the quantities appearing in the error bound in Theorem 1.

The canonical choice for the parameters of the compound Poisson distribution isk¼P2h3

i¼1 kidi, , with ki¼¹_iPn

k¼1E(Ik1_fZ_k_¼_ig).

In our approximation we will use only half of the parameters, by truncating at‘¼h_*1. Instead ofl we will use ^kk¼Ph1

i¼1 ^kkidi, where ^kki¼ki for i¼2,. . .,h1 and ^kk1¼E(Wm)Ph1 i¼2 ki¼ k1þP2h3

i¼h iki.

We will approximate the probability of interest P(Wm1) by

p:¼1exp ^hX¹

i¼1

^k ki

( ) :

Remark 3. As the indicators fI⁰⁰_jv(k),j¼1,. . .,ng do not depend on i, also the term hj,k(v)¼ EI_jv⁰⁰(a)I⁰_jv(a)appearing in Theorem 1 does not depend on i and thus

dK(L(Wm),CP(^kk))cK(^kk) Xⁿ

k¼1

f(EIk)²þEIkE(UkþXk)þE(IkXk) (

þ X

j2C^vw_k

E(Ikhj,k(Vk))g þ^2hX³

i¼h

i(i1)ki

9=

;, where Uk¼Pkþh2

j¼khþ2IjIk and Xk¼Pkhþ1

j¼k2hþ4IjþPkþ2h1 j¼kþh1Ij.

Using classic results on uniform spacings (Pyke, 1965, 1972), one can easily prove

Lemma 2. For fixed k, assume that n??, r?0 s.t. nr?0. Then, uniformly with respect to 0<nr<1, we have

P(S1,kr)¼(nr)^k

k! (1þ O 1

n þ O(nr)) and for fixed i and j,

if i5k:P(S1,ir,Sk,jr)¼(nr)^iþ^j

i!j! (1þ O 1

n þ O(nr)), if ik:P(S1,ir,Sk,jr)¼ (2kiþj2)!

(kþj1)!(k1)!(kiþj1)!(nr)^k^þ^j¹

·(1þ O 1

n þ O(nr)):

(8)

For everyk¼1,. . .,n, we haveE(Ik)¼_jKj¹ P

‘2KP(Sk,hk(‘)1r).

For every‘2K, from Lemma 1 and the exchangeability of the spacings, we have P(Sk,hk(‘)1r)¼P(S1,hk(‘)1 r)¼ (nr)^h^k^(‘)¹

(hk(‘)1)!(1þ O 1

n þ O(nr)):

We hence obtain, for 0<nr<1, the following upper bound:

E(Ik)(nr)^h¹

(h1)!(1þ O 1

n þ O(nr)): (5) We make the following (biologically realistic) assumption on the data.

Assumption 1. We assume that we have n_minGn, i.e., n_min¼an(1þ O(¹_n)), witha1 fixed.

Based on Assumption 1, we obtain

jf‘2K:h1(‘)¼hgj jKj: (6) This further implies that E(Ik)(nr)^h¹ andE(Wm)n(nr)^h¹.

Let k<j. We have E(IkI_j)¼_jKj¹ P

‘2KP(Sk,hk(‘)1r,S_j,h_j_(‘)₁r), where for each ‘2K, using Lemma 1, we have

P(Sk,hk(‘)1r,Sj,hj(‘)1r)¼P(S1,hk(‘)1r,Sjkþ1,hj(‘)1r)

¼ (2(jk)þh_j(‘)h_k(‘))!

(jk)!(jkþhj(‘)hk(‘))!(jkþhj(‘)1)!(nr)^j^kþ^h^j^(‘)¹

·(1þ O(1

n)þ O(nr)), (7)

if k5jkþhk(‘)2 ( we say that the two clusters intersect)

¼ 1

(hk(‘)1)!(hj(‘)1)!(nr)^h^k^(‘)^þ^h^j^(‘)²(1þ O(1

n)þ O(nr)), (8) if j4kþhk(‘)2 ( we say that the two clusters do not intersect):

From Assumption 1, we can also obtain that

jf‘2K:h_k(‘)¼h,h_j(‘)¼hgj jKj: (9) Next we will estimate the error terms appearing in Theorem 1. We have

Proposition 3. Assume that n??, r?0s.t. nr?0and n_minGn. Then, uniformly in¹_nnr51and n42(2hþh4)_exp ^4(h_3(h^þ^h⁵⁾

1)þh

n o

, we have the following estimates:

(a) Xⁿ

k¼1

(EIk)² n(nr)^2(h¹⁾

[(h1)!]²(1þ O(1

n)þ O(nr));

(b) Xⁿ

k¼1

E(Ik)E(UkþXk)4(h2)n(nr)^2(h¹⁾

[(h1)!]²(1þ O 1

n þ O(nr));

(c) Xⁿ

k¼1

E(IkXk)2(2hh2)n(nr)^2(h¹⁾

[(h1)!]²(1þ O 1

n þ O(nr));

(d) Xⁿ

k¼1

X

j2C^vw_k

E(Ikhj,k(Vk))2(h1)f2hþh5þ2^h²(hþh4)g

·n(nr)^2(h¹⁾

[(h1)!]² (1þ O 1

n þ O(nr));

(e) ^2hX³

i¼h

i(i1)ki(h2)2^2h⁵n(nr)^2(h¹⁾

[(h1)!]²(1þ O 1

n þ O(nr)):

(9)

Proof.The bounds in (a) and (b) follow easily using (5). &

Proof of(c). We have Xⁿ

k¼1

E(IkXk)¼Xⁿ

k¼1

X

khþ1

j¼k2hþ4

E(IjIk)þ ^k^þ2hX⁴

j¼kþh1

E(IkIj)

( )

, whereE(IkI_j)¼_jKj¹ P

‘2KP(Sk,hk(‘)1 r,S_j,_h_j_(‘)₁r).

Forj¼kþh_*1, using Lemma 1,

– ifh_j(‘)¼h_*andh_k(‘)¼h_*, then the two clusters do not intersect and, from (8), we obtain P(Sk,h_k(‘)1 r,Sj,h_j(‘)1r)¼ (nr)^2(h¹⁾

[(h1)!]²(1þ O(1

n)þ O(nr));

– ifh_j(‘)¼h_*andh_k(‘)>h_*, then the two clusters intersect and, using (7), we obtain P(Sk,h_k(‘)1r,Sj,h_j(‘)1r)1

2

(nr)^2(h¹⁾

[(h1)!]²(1þ O(1

n)þ O(nr));

– for every other‘we haveP(Sk,h_k(‘)1r,Sj,h_j(‘)1r)¼(nr)^2(h¹⁾O(nr).

It follows from (9) thatE(IkIkþh1)_[(h^(nr)^{2(h 1)}

1)!]²(1þ O(¹_n)þ O(nr)).

The other cases forj can be treated in a similar manner and the upper bound stated in (c)

easily follows. &

Proof of (d). We will condition on the r.v. Vk¼(L,Skhþ2,. . .,Skþhþh4). Given that Vk¼(‘,z1,. . .,z2hþh5), we haveIk¼1_fz_h_1,_hk_(‘)₁_rg (hence deterministic) and thus

Xⁿ

k¼1

X

j2C^vw_k

E(Ikhj,k(Vk))¼Xⁿ

k¼1

X

j2C^vw_k

1 jKj

X

‘2K

d(k,j,‘), where for eachk¼1, . . .,n, j2C^vw_k and‘2K, we let

d(k,j,‘):¼E[Ikhj,k(Vk)jL¼‘]¼d1(k,j,‘)þd2(k,j,‘), d1(k,j,‘):¼

Z

1_fz_h_1,_hk_(‘)₁_rgP(S⁰⁰_j,_h_j_(‘)₁r,S⁰_j,_h_j_(‘)₁4r) dF(z1,. . .,z2hþh5),

d2(k,j,‘):¼ Z

1_fz_h_1,_hk_(‘)₁_rgP(S⁰⁰_j,_h

j(‘)14r,S⁰_j,_h

j(‘)1r)

dF(z1,. . .,z2hþh5), withFbeing the distribution of (Skhþ2,. . .,S_kþhþh4).

We further decomposed1(k,j,‘):¼d⁰₁(k,j,‘)þd⁰⁰₁(k,j,‘), where d₁⁰(k,j,‘):¼

Z

1_fz_h_1,_hk_(‘)₁_rg1_fz_{1, 2h þh 5}₄_arg

·P(S⁰⁰_j,_h_j_(‘)₁ r,S⁰_j,_h

j(‘)14r)dF(z₁,. . .,z_2h_þ_h₅), d₁⁰⁰(k,j,‘):¼

Z

1_fz_h_1,_hk_(‘)₁_rg1_fz_{1, 2h þh 5}_arg

·P(S⁰⁰_j,_h_j_(‘)₁ r,S⁰_j,_h

j(‘)14r)dF(z₁,. . .,z_2h_þ_h₅),

witha¼a(n) to be chosen a little further. &

We will simplify the notation by writingh_kinstead ofh_k(‘). We have:

(10)

d⁰₁(k,j,‘) Z

1_fz_{h 1,}_hk₁₅_rg1_fz_{1, 2h þh 5}₄_argdF(z1,. . .,z2hþh5)

¼ Z r

0

n(nu)^h^k² (hk2)!

Z 1 aru

n(nv)^2h^þ^h^h^k⁵

(2hþhhk5)!(1uv)ⁿ^(2h^þ^h⁴⁾

·(1þ O 1 n )dvdu

Z nr 0

x^h^k² (h_k2)!e^x=2

Z n anrx

y^2h^þh^h^k⁵

(2hþhh_k5)!e^y=2

· 1þ O 1 n

dydx( by a change of variableþLemma 3) 2^2h^þh⁵

Z nr=2 0

z^h^k² (hk2)!e^z

Z 1 anr=2z

t^2h^þ^h^h^k⁵ (2hþhhk5)!e^t

· 1þ O 1 n

dydx 4(nr)^h^k¹

(h_k1)!

(anr)^2h^þ^h^h^k⁵

(2hþhh_k5)!e^anr=2 1þ O 1 n

( by Lemma 2) 4(nr)^h^k^þ^h²

(h1)!

1 n

n (nr)^h¹

(anr)^2h^þ^h^h^k⁵

(2hþhhk5)!e^anr=2

· 1þ O 1 n

1

n(nr)^2(h¹⁾O(nr), if ¹_nnrand (3hþh3) lognanrpffiffiffin

, entailing that

n(anr)^2h^þ^h^h^k⁵e^anr=2(nr)^h, and if moreovera>1, nr<1 and

anr>4(2h_*þh*4) for applying Lemma 2 and Lemma 3. The last inequality is hence valid for 4(2hþh4)_(3hþh3) lognanr ffiffiffi

pn and 1

nnr51 In a similar manner we can boundd⁰⁰₁(k,j,‘), then decompose and boundd2(k,j,‘).

We finally obtain that, if we take a:¼^(3h^þ^h_nr^{3) log}ⁿ then, uniformly in ¹_nnr51 and n44(2hþh4)_exp ^4(2h_3h^þh⁴⁾

þh3

n o

, we have the upper bound stated in (d).

Proof of (e). For every k¼1,. . .,n we let Cik denote the class of all the subsets of size i1 of C^vs_k ¼ fkhþ2, . . .,k1,kþ1,. . .,kþh2g. We obtain

iki¼Xⁿ

k¼1

X

C2Cik

E Ik

Y

t2C

It

Y

t2C^vs_knC

(1It) 0

@

1 AXⁿ

k¼1

X

C2Cik

E(IinfCIsupC):

For everyk¼1,. . .,nandC2 Cikwe have E(IinfCIsupC)¼ 1

jKj X

‘2K

P(SinfC,h_inf_C(‘)1r,SsupC,h_supC(‘)1r) andh_*1i1supCinfC.

Ifh_inf_C(‘)¼h_sup_C(‘)¼h_*, then the two clusters do not intersect and we have P(SinfC,hinfC(‘)1r,S_sup_C,h_sup_C_(‘)₁r)¼ (nr)^2(h¹⁾

[(h1)!]² 1þ O 1

n þ O(nr)

: It follows thatE(IinfCIsupC)_[(h^(nr)^{2(h 1)}

1)!]²1þ O ¹_n þ O(nr)

and hence

(11)

iki 2(h2) i1

n(nr)^2(h¹⁾

[(h1)!]² 1þ O 1

n þ O(nr)

:

The bound stated in (e) then follows from X

2h3

i¼h

(i1) 2(h2) i1

¼(2h4) ^2hX⁵

j¼h2

2h5 j

¼(h2)2^2h⁵:

&

We have used the following two elementary lemmas that we state without proof. For a proof of Lemma 4, see Grusea (2008), and for a proof of Lemma 5, see Roos (1993a).

Lemma 4. Let X₁,. . .,X_nbe i.i.d. r.v.’s with distributionExp(1)and let i,k1s.t. iþk1n. Then, uniformly ina1,b<1,ab>2(nk1), we have the following inequality:

P(X1,n4ab,Xi,k5b)2b^k k!

(ab)ⁿ^k¹ (nk1)!e^ab: Lemma 5. For 0x1 and n2(mþ1)we have(1x)^n(mþ1)e^nx/2.

In the following lemma, we show that the chosen parameters for the approximating compound Poisson distribution verify the relation (3), and hence we can use the bound (4) of Barbour et al. (2000).

Lemma 6. If0<nr<1and n_minGn,then for every i2 f1,. . .,h1gwe have^kkin(nr)ⁱ^þ^h². If n_minGn and nrg, wheregis a fixed constantg<1,then i^kki(iþ1)^kkiþ1,8i.

Proof. We haveiki¼Pn

k¼1E(Ik1_fZ_k_¼_ig). One can easily show that the leading terms in the expression ofil_iare those which are expectations of products oficonsecutive indicators. For a term withiconsecutive indicators, of the form

E(Ij I_jþ_i₁)¼ 1 jKj

X

‘2K

E(Ij I_j_þ_i₁jL¼‘),

we have that for each‘the extreme clusters intersect (because of the fact thatjþi1jþhj(‘)2, as i<h_*h_j(‘),Vj) and hence

P(Sjþ þSjþi1þhjþi1(‘)2r)E(Ij Ijþi1jL¼‘) E(IjIjþi1jL¼‘), implying thatE(Ij I_j_þ_i₁jL¼‘(nr)^iþ^h^j^þi1^(‘)²:.

Using (6) we obtainE(Ij I_j_þ_i₁)(nr)ⁱ^þh²,8j.

The results in the statement easily follow. &

For the detailed proofs of Proposition 1 and Lemma 2, see Grusea (2008).

From Proposition 1 and Lemma 4, together with Theorem 1 and relation (4), we obtain the following upper bound on the error of approximatingP(Wm1) byp¼1expf Ph1

i¼1 ^kkig.

Theorem 7. Suppose that n!1, r!0 and n_minGn. Then, uniformly in ¹_nnr51 and n42(2hþh4)_exp ^4(2h_3(h^þ^h⁴⁾

1)þh

n o

,we have:

P(Wm1)p

j j Cn(nr)^2(h¹⁾

[(h1)!]² 1þ O 1

n þ O(nr)

, where C¼4hh6þ(h1)f2hþh5þ2^h²(hþh4)g þ(h2)2^2h⁶.

Moreover, ifE(Wm)¼p1 is held constant when n!1,then

(12)

P(Wm1)p

j j ¼ O 1 n :

3.2. The computation of the parameters

In practice, based on the fact that the leading terms in the expression of^kkiare those containing products of i consecutive indicators, and using a ‘‘Markovian’’ approximation, we make the following approximation for the parameters:

^k

ki npqⁱ¹(1q)², fori¼2,. . .,h1,

^k

k1¼np^hX¹

i¼2

i^kki, wherep: ¼P(A1) andq: ¼P(A2jA1).

For computingp we sum over all labelings‘:

p¼ 1 jKj

X

‘2K

P(S1,h₁(‘)1r),

whereP(S1,h₁(‘)1 r) is given by a Beta distribution function (Glaz et al., 1983; Pyke, 1965).

Note that it suffices to sum only over all different (‘1,. . .,‘h) possible.

We compute q¼P(A2jA1)¼P(A1\A2)=pin a similar way. We have P(A1\A₂)¼ 1

jKj X

‘2K

P(A1\A₂jL¼‘)

¼ 1 jKj

X

‘2K

P(S1,h₁(‘)1r,S2,h₂(‘)1r):

In this case it suffices to sum over all different (‘1,. . .,‘hþ1).

For calculatingP(S1,h₁(‘)1 r) andP(S1,h₁(‘)1 r,S2,h₂(‘)1 r) we use classic results on uniform spacings (Glaz et al., 1983).

3.3. The linear case

Next we briefly consider the case of a linear genome. As in the circular case, we see the genome B as the interval [0, 1] and the positions of thenorthologs as i.i.d. r.v.’s uniformly distributed on [0, 1]. The events A_kare defined as before, but in this case we have a smaller number of possible events, preciselynh_*þ1.

We also have a boundary effect which consists in the fact that fork¼nhþ2,. . .,nhþ1 the events A_khave a smaller probability.

Similarly to the circular case, we approximate the distribution of the number of clusters of type (h:r) in the genome B,Wm:¼Pnhþ1

k¼1 1A_k, by a compound Poisson distribution of parameter^kk¼Ph1

i¼1 ^kkidi, where

^k ki¼1

i X

nhþ1

k¼1

E(Ik1_fZ_k_¼_ig), i¼2,. . .,h1, ^kk1¼E(Wm)^hX¹

i¼1

i^kki

andZk¼Pkþh2

j¼khþ21Aj. Notice that Theorem 2 is valid in this case, too.

For the computation of the parameters we ignore the boundary effects and use a Markovian approximation as in the circular case. Note that, based on Assumption 1 (see also the relations (6) and (9)), the error introduced in the computation of the parameters by ignoring the boundary effects is negligible.

4. NUMERICAL RESULTS AND DISCUSSION

We denote by/⁰: ¼(/⁰₁,. . .,/⁰_J) the vector containing all the distinct values/⁰₁5 5/⁰_Jamong the sizes of the multigene families in the genome B, and we denote byg:¼(g₁,. . .,g_J) the vector containing their multiplicities. We present two sets of numerical results for our compound Poisson approximation, see the two tables below. In Table 1 we give the results for the circular case and in Table 2 for the linear case.

(13)

We have selected values for f⁰, g, h,r which are interesting in practice, for our biological purpose of statistically testing the significance of gene clusters found by the reference region approach. In both tables, pis our compound Poisson approximation for the probability of interestP(Wm(h,r)1) andpp^MC–eis a Monte Carlo estimate, based on 10⁶simulations, of the 95%-confidence interval for the same probability.

We have estimatedeusing the Central Limit Theorem.

Notice that, although Theorem 2 does not apply very well for these selected values and the theoretical bound given by the theorem is poor, the numerical results are very satisfactory.

We present here numerical results only for some selected values forf⁰,g,h,r, but the approximation remains very good for a large panel of values for the parameters.

Note that in the case of a one-to-one orthology mapping (f_i¼1,Vi) the weight of an interval is exactly the number of orthologs it contains. The probabilityP(Wm(h,r)1) can then be expressed in terms of the distribution of a continuous conditional scan statistic, for which a lot of approximations exist (Glaz, 2001) and also an exact (even if quite computationally demanding) expression (Huntington et al., 1975).

However, in the more general case that we treat in this article, trying to find an exact expression for this probability by using the method in Huntington et al. (1975) seems very difficult.

In biological applications,handrwill be the weight and respectively the (normalized) length of a given observed orthologous cluster in the genome B, for which we want to test its significance. The so-called p-value of this cluster is exactly the probability P(Wm(h,r)1), for which we give here a compound Poisson approximation. The observed cluster is significant, and hence interesting from the biological point of view, provided its p-value is smaller than a fixed threshold (0.01 for example).

Table1. Numerical Results for the Circular Case

(f⁰,g, h, r) pp^MC" p

(1, 100, 8, 0.01) 0.0053–0.000146 0.0053

(1, 100, 8, 0.012) 0.0150–0.000245 0.0153

((1, 2), (100, 10), 8, 0.01) 0.0061–0.000156 0.0060

((1, 2), (100, 10), 8, 0.012) 0.0174–0.000264 0.0178 ((1, 2, 3), (100, 15, 5), 8, 0.01) 0.0070–0.000167 0.0071 ((1, 2, 3), (100, 15, 5), 8, 0.02) 0.0208–0.000288 0.0212 ((1, 2, 3, 4), (100, 15, 5, 3), 8, 0.01) 0.0072–0.000170 0.0073 ((1, 2, 3, 4), (100, 15, 5, 3), 8, 0.012) 0.0219–0.000296 0.0222 ((1, 2, 3, 4, 5), (100, 15, 5, 3, 2), 8, 0.01) 0.0071–0.000169 0.0075 ((1, 2, 3, 4, 5), (100, 15, 5, 3, 2), 8, 0.012) 0.0219–0.000296 0.0227 The vectorf⁰contains all the distinct sizes of the multigene families in the genome B and the vector g contains their multiplicities. p is our compound Poisson approximation for P(Wm(h,r)1) andpp^MC–eis a 95%–confidence interval for the same probability based on

Table2. Numerical Results for the Linear Case

(f⁰, g, h, r) pp^MC" p

(1, 100, 8, 0.01) 0.0052–0.000144 0.0052

(1, 100, 8, 0.012) 0.0146–0.000242 0.0151

((1, 2), (100, 10), 8, 0.01) 0.0060–0.000155 0.0060

((1, 2), (100, 10), 8, 0.012) 0.0173–0.000263 0.0176 ((1, 2, 3), (100, 15, 5), 8, 0.01) 0.0068–0.000165 0.0070 ((1, 2, 3), (100, 15, 5), 8, 0.02) 0.0203–0.000285 0.0211 ((1, 2, 3, 4), (100, 15, 5, 3), 8, 0.01) 0.0072–0.000170 0.0073 ((1, 2, 3, 4), (100, 15, 5, 3), 8, 0.012) 0.0216–0.000294 0.0220 ((1, 2, 3, 4, 5), (100, 15, 5, 3, 2), 8, 0.01) 0.0073–0.000171 0.0074 ((1, 2, 3, 4, 5), (100, 15, 5, 3, 2), 8, 0.012) 0.0217–0.000295 0.0226 The vectorf⁰contains all the distinct sizes of the multigene families in the genome B and the vector g contains their multiplicities. p is our compound Poisson approximation for P(Wm(h,r)1) andpp^MC–eis a 95% confidence interval for the same probability based on