• Aucun résultat trouvé

An algorithmic view of gene teams

N/A
N/A
Protected

Academic year: 2021

Partager "An algorithmic view of gene teams"

Copied!
24
0
0

Texte intégral

(1)

HAL Id: hal-00619206

https://hal-upec-upem.archives-ouvertes.fr/hal-00619206

Submitted on 5 Sep 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

An algorithmic view of gene teams

Marie-Pierre Béal, Anne Bergeron, Sylvie Corteel, Mathieu Raffinot

To cite this version:

Marie-Pierre Béal, Anne Bergeron, Sylvie Corteel, Mathieu Raffinot. An algorithmic view of gene

teams. Theoretical Computer Science, Elsevier, 2004, 320 (2-4), pp.395-418. �hal-00619206�

(2)

Marie-Pierre Beal  Anne Bergeron y SylvieCorteel z Mathieu RaÆnot x Abstra t

Comparative genomi s is a growing eld in omputational biology, and one of its typi alproblem is theidenti ation of sets of orthologousgenes that havevirtuallythe same fun tion in several genomes. Many di erent bioinformati sapproa heshave been proposed to de ne these groups, often based on the dete tion of sets of genes that are \not toofar"in all genomes. Inthis paper,wepropose aunifying on ept, alled gene teams,whi h anbeadapted to variousnotionsof distan e. Wepresenttwoalgorithms foridentifying geneteamsformedbyngenespla edonmlinear hromosomes. The rst one runs in O(mnlog

2

n) and uses adivide and onquer approa h based onthe formal properties of gene teams. We next propose an optimization of the original algorithm, and,inordertobetterunderstandthe omplexityboundofthealgorithms,were astthe problem intheHop roft'spartitionre nementframework. Thisallowsusto analyzethe omplexityofthealgorithmswithelegantamortizedte hniques. Bothalgorithmsrequire linear spa e. We alsodis uss extensions to ir ular hromosomes that a hievethe same omplexity.

Resume

La omparaisondesgenomesest undomaine roissantenbiologie omputationnelle et l'un de ses problemes typiques est l'identi ation d'ensembles de genes orthologues quiontvirtuellementlam^emefon tiondansplusieursgenomes. Plusieursappro hes bio-informatiques distin tesonteteproposees pour de nir es groupes. Elles sont souvent basees sur ladete tion d'ensembles de genes qui ne sontpas \trop eloignes"dans tous les genomes onsideres. Dans etarti le, nousproposonsun on ept uni ateur,appele 

equipe de genes, quipeut^etreadapteadi erentesnotionsdedistan es. Nouspresentons deux algorithmespour identi er lesequipes de genes formees par n genessitues sur m hromosomeslineaires. Lepremieraune omplexiteentemps deO(mnlog

2

n)et utilise une appro he \diviser pour regner" basee sur des proprietees formelles des equipes de genes. Nous proposons ensuite une optimisation de et algorithme, et, a n de mieux omprendre la borne sur sa omplexite, nous repla ons le probleme dans le adre d'un s hemade raÆnement de partitions de Hop roft. Ce i nous permet d'analyser la om-plexite par des te hniquesplus elegantes de omplexite amortie. Les deux algorithmes ontune omplexiteenespa elineaire. Nous onsideronsegalementdesextensionsau as des hromosomes ir ulairesquiontlam^eme omplexite.



InstitutGaspard-Monge,UniversitedeMarne-la-Vallee,CiteDes artes,Champs-sur-Marne,77454 Marne-la-ValleeCedex2,Fran e. Marie-Pierre.Bealuniv-mlv.fr .

y

LaCIM,UniversiteduQuebe aMontreal,Canada. annela im.uqam. a z

CNRS -LaboratoirePRiSM,UniversitedeVersailles, 45AvenuedesEtats-Unis, 78035Versailles edex, Fran e. E-mail: sylprism.uvsq.fr

x

CNRS -LaboratoireGenomeetInformatique, Tour Evry2,523, Pla e des Terrasses de l'Agora,91034 Evry,Fran e. raffinotgenopole. nrs.fr

(3)

Inthelastfew years,resear h ingenomi ss ien e evolved rapidly. More andmore omplete genomes are now available due to the development of semi-automati sequen er ma hines. Manyofthese sequen es{parti ularlyprokaryoti ones{arewellannotated: thepositionof their genesare known,and sometimespartsof theirregulationormetaboli pathways.

A new omputational hallenge is to extra t gene or protein knowledge from high level omparisonofgenomes. Forexample,theknowledgeofsetsoforthologousorparalogousgenes on di erent genomes helpsto inferputative fun tions from one genome to the other. Many resear hershaveexploredthisavenue,tryingtoidentifygroupsor lustersoforthologousgenes that have virtually the same fun tion in several genomes [1 , 6, 5 , 7, 9, 11 , 13 , 14 , 15 , 18℄. Theseresear hesareoftenbasedon asimple,butbiologi allyveri edfa t, thatproteinsthat intera t are often oded by genes losely pla ed in the genomes of di erent spe ies. With theknowledgeof thepositionsofgenes,itbe omespossibletoautomatetheidenti ationof groupsof losely pla edgenes inseveral genomes. Fora more ompletebiologi allyoriented dis ussionon these groupsofgenes, we referthe readerto [12 ℄.

From an algorithmi and ombinatorial point of view, the formalizations of the on ept of losely pla ed genesarestillfragmentary,andsometimes onfusing. The distan ebetween genesisvariouslyde nedasdi eren esbetweenphysi allo ationsona hromosome, distan e from a spe i edtarget, oras a dis rete ount of intervening a tualor predi tedgenes. The algorithmsoften la k the ne essary grounds to prove their orre tness, or assess their om-plexity. This paper ontributesto aresear h movement of lari ation of these notions. We aimto formalize,inthesimplestand most omprehensiveways,the on eptsunderlyingthe notionofdistan e-based lustersofgenes. We anthenmake useofthese on epts,andtheir formalproperties, to designsoundand eÆ ient algorithms.

A rststepinthatdire tionhasbeendonein[9 ,19 ℄withthe on eptof ommon intervals. A ommonintervalisasetoforthologousgenesthatappear onse utively,possiblyindi erent orders, on a hromosome of two or more spe ies. This on ept overs simplest ases of sets of losely pla edgenes, but does nottake in a ount the nature of thegaps between genes. Commonintervals an be de nedon hromosomes with paralogous genes,that is,ea h gene ould have multiple lo ations on the hromosomes. However, the algorithms in [9 , 19 ℄ are designedonlyforthe ase whereea h geneo urson eon ea h hromosome.

Inthispaper, weextendthisnotionbyrelaxingthe\ onse utive" onstraint. We assume thatea hgeneo urson eonea h hromosome. Weallowgenestobeseparatedbygapsthat do notex eeda xed threshold. We develop a simpleformal setting forthese on epts, and give two polynomialalgorithmsthatdete t maximalsets of losely pla edgenes, alledgene teams, in m hromosomes. Note that we fo us in thispaper on thealgorithmi part of the geneteam on ept. A ompletestudyvalidatingthismodelfromabiologi alpoint ofviewis availablein[12 ℄,an theresults on erningthedivide-and- onqueralgorithmwereannoun ed in[3 ℄.

The rst algorithmre nesthe partitionsindu edbygene hainsof two ormore hromo-somes. It uses a divide-and- onquer approa h based on the existen e of small lassesof the partitions. The apparent simpli ityhidesa omplexunderlyingproblemthat rst appeared inthenon trivial omplexityof this rst algorithm.

Next,inordertobetterunderstandthe omplexitybounds,andanalysis,ofthisalgorithm, we re astthe problemintheHop roft's partition re nement framework [10 ℄, whi h overs a wide range of appli ations [8, 16 ℄. We develop a new algorithm based of the rst Hop roft

(4)

Hop roft-like algorithm. The lose linksbetween the two algorithms allows us to derive an elegant omplexity analysis, based on amortized te hniques, whi h is mu h more intuitive than the equational approa h. Moreover, the fa t that Hop roft-like algorithms have been extensivelystudied on rmstheintrinsi diÆ ultiesofthegeneteamsidenti ationproblem. This paperis organized as follows. In Se tion2, we formalize the on ept of gene teams thatuni es mostof the urrent approa hes, and dis usstheirbasi properties. In Se tion 3 we present two algorithms that identify the gene teams of two hromosomes. The links between Hop roft's partitioning framework and gene teams identi ation are explored in Se tion4. Finally,inSe tion5,weextendouralgorithmstom hromosomes,and to ir ular hromosomes. An extendedabstra t ofthispaperappeared in[3 ℄.

2 Gene Teams and their Properties

Mu hofthefollowingde nitionsrefertosetsofgenesand hromosomes. Thesearebiologi al on eptswhosede nitionsareoutsidethes opeofthispaper. However,wewillassumesome elementaryformal propertiesrelatinggenes and hromosomes: a hromosome is an ordering devi e forgenes that belong to it, and a gene an belongto several hromosomes. If a gene belongsto a hromosome, we assumethat its positionisknown,and unique.

2.1 De nitions and Examples

Let bea setof n genesthat belong to a hromosome C,and letP C be a fun tion:  P C !R

that asso iatesto ea h geneg in areal numberP C

(g), alled its position.

Fun tions of this type are quite general, and over a wide variety of appli ations. The position anbe,asin[14 ,15 ,11 ℄,thephysi allo ationofana tualsequen eofnu leotideson a hromosome. In morequalitativestudies,su h as[1 , 13 ℄,thepositionsarepositiveintegers re e tingtherelative orderingof genesinagivenset. Inother studies[5 ℄,positionsareboth negative and positivenumbers omputedinrelationto a target sequen e.

Thefun tionP C

indu esapermutationonanysubsetS of,orderingthegenesofSfrom the gene of lowest position to the geneof highest position. We will denote the permutation orrespondingtothewholesetby

C

. Ifgandg 0

aretwogenesin,theirdistan e C (g;g 0 ) in hromosome C is givenbyjP C (g 0 ) P C (g)j.

For example, if = fa;b; ;d;eg, onsiderthe following hromosome X, in whi h genes notinareidenti edbythe starsymbol:

X =   e d a  b:

De neP X

(g)asthenumberofofgenes appearingtotheleftofg. Then X ( ;d)=jP X (d) P X ( )j=4, X

=( edab), andthe permutationindu edon thesubsetfa; ;eg is( ea).

De nition1 Let S be a subset of , and (g 1

:::g k

) be the permutation indu ed on S on a given hromosome C. For Æ > 0, the set S is alled a Æ- hain of hromosome C if  C (g j ;g j+1 )Æ; for 1j<k.

(5)

elementsin thepermutation( ea) isdistant bylessthan Æ.

We will also refer to maximal Æ- hains with respe t to the partial order indu ed on the subsets by the in lusion relation. For example, with Æ = 2, the maximal Æ- hains of X are f gand fa;b;d;eg. Note thatsingletons arealwaysÆ- hains, regardless ofthevalue ofÆ.

De nition 2 AsubsetS ofisa Æ-set of hromosomes C andDifS isa Æ- hain bothinC and D. A Æ-team of the hromosomes C andD isa maximal Æ-set withrespe t to in lusion. A Æ-team with onlyone element is alled a lonelygene.

Consider,forexample,the two hromosomes:

X =   e d a  b Y = a b     d e:

ForÆ =3 then fd;eg and f ;d;eg are Æ-sets, but not f ;dg sin ethe latter is not a Æ- hain inX. The Æ-teams ofX andY, forvaluesof Æ from 1 to 4are given inthe followingtable.

Æ Æ-teams Lonely Genes 1 fd;eg fag;fbg;f g 2 fa;bg;fd;eg f g 3 fa;bg;f ;d;eg

4 fa;b; ;d;eg

Notethat two geneteams an overlap. Forinstan e,ifX= a bd, Y =ab   d and Æ =2,then fa;bgand f ;dgaretwo overlappinggeneteams.

Our goalis to develop algorithmsfortheeÆ ient identi ationof geneteams. The main pitfalls areillustratedinthe nexttwo examples.

The interse tion of Æ- hains is not always a Æ-set. A naive approa h to onstru t Æ-sets is to identify maximal Æ- hains in ea h sequen e, and interse t them. Although this workson someexamples,theapproa hdoesnotholdinthegeneral ase. Forexample,inthe hromosomes:

X = a b

Y = a   b;

with Æ =1, the maximalÆ- hain of X is fa;b; g, and the maximal Æ- hains of Y are fa; g and fbg. But fa; g isnota Æ-team.

Gene teams annotbegrownfromsmallerÆ-sets. Atypi alapproa hfor onstru ting maximal obje ts is to start with initial obje ts that have the desired property, and luster them with a suitable operation. For gene teams, the singletons are perfe t initial obje ts, but there is no obvious operation that, appliedto two smallÆ-sets, produ es a bigger Æ-set. Consider thefollowing hromosomes:

X = a b d Y = a d b :

(6)

general, it ispossibleto onstru t pairsof hromosomes with an arbitrary numberof genes, su h that the only Æ-sets are the singletons and the whole set. For example, onsider the following hromosomes, inwhi hthe genesare representedbynumbers inorder to illustrate the onstru tion:

X = 1 2 3 ::: ::: 2k Y = 2 4 6 ::: 2k 1 3 5 ::: 2k 1:

ForÆ =1, anyÆ-set largerthan a singleton must ontain both odd and even genes be ause theyalternatein hromosomeX,butanyÆ- haininY that ontainsoddandevengenesmust ontaingenes1and2k,implyingthattheonlyteamwithmorethanonegeneisthewholeset.

Instead of growing teams from smaller Æ-sets, we willextra t them from larger sets that ontain onlyteams. Thisleadsto thefollowingde nition:

De nition 3 AÆ-leagueof hromosomesC andDisaunion ofÆ-teamsofthe hromosomes C and D.

Asthetwolastexamplesshow, the ombinatorial propertiesof Æ-setsarenotelementary, and weneed to establishthem inorder to develop and prove ouralgorithms.

2.2 Properties of Æ-sets and teams.

The rst ru ialpropertyof Æ-teams isthatthey form apartitionof theset ofgenes . It is a onsequen e ofthefollowinglemma:

Lemma 1 If S and T aretwo Æ- hains of hromosomeC, andS\T 6=;, then S[T isalso a Æ- hain.

Proof: Consider the permutation indu edon the set S[T,and let g and g 0

be two onse -utive elements in the permutation. If g and g

0

both belong to S (or to T), then they are onse utive inthepermutationindu edbyS (orbyT),and (g;g

0

)Æ. IfgisinS butnot in T,and g

0

isinT butnotinS,then eitherg is betweentwo onse utive elements ofT,or g

0

is betweentwo onse utive elementsof S. Otherwise,thetwo sets S and T wouldhave an empty interse tion. If g is betweentwo onse utive elements of T, forexample, thenone of them is g

0

,implying(g;g 0

)Æ.

We nowhave easily:

Proposition 1 For a given set of genes , the Æ-teams of hromosomes C and D form a partition of the set.

Proof: Sin e any singleton of  is a Æ-set, any gene of  belongs to a Æ-team. If the in-terse tion of two di erent Æ-teams T

1 and T

2

is not empty, then the interse tion of the two underlyingÆ- hainsisnotemptyneitherinC norinD,thereforetheirunionisalsoaÆ- hain inboth sequen es, implying that T

1 [T

2

is a Æ-set, and ontradi ting themaximality of T 1 andT

2 .

(7)

Æ-team.

Proof: Sin e the maximal Æ-sets form a partition of , any Æ-set is ontained in a unique Æ-team.

Thealgorithmsdes ribedinthenextse tionworkonleagues,splittingthemwhileensuring that a league is split in smaller leagues. The pro ess stops when ea h league is a Æ-set. Corollary 1 provides a simple proof that su h an algorithm orre tly identi es the teams. The next propositiongivesthe\initial"leaguesforthe rstalgorithm.

Proposition 2 Any maximal Æ- hain of C or of D is a league.

Proof: FirstobservethatthesetofmaximalÆ- hainsina hromosomealsoformsapartition of. Therefore,anyÆ- hainisin ludedinauniquemaximalÆ- hain. IfT isateamofC and D,then T isaÆ- hain inboth hromosomes,thusT is in ludedinasingle maximal hainin both hromosomes.

3 Algorithms to Find Gene Teams

It is quite straightforward to develop O(n 2

) algorithms that nd gene teams in two hro-mosomes. In the following subse tion, we present some of the pitfalls of naive approa hes to partition re nement that an lead to an O(n

2

) worst ase s enario. However, sin e the ultimate goalis to be ableto upgrade thede nitionsand algorithmsto more thantwo hro-mosomes, su h a thresholdis too high. In Se tion3.2, we develop an O(nlog

2

n) algorithm, whose omplexityisanalysedinse tion3.3. We thenproposeinSe tion3.4an optimization of the rst algorithm, redu ing its time omplexity to O(nlognlogÆ

0

), where Æ 0

is, for all prati al purpose, asmall onstant.

3.1 Partition Re nement

Assumethat we aregiven two permutations on ,  C

and  D

,ea h already partitionedinto maximalÆ- hainsof hromosomes C and D:

 C = ( 1 ::: k1 )( k1+1 ::: k2 ):::( ks+1 ::: n )  D = (d 1 :::d l 1 )(d l 1 +1 :::d l 2 ):::(d l t +1 :::d n ): Let ( i ::: j

) be one of the lassesof the partitionof  C , by Proposition2 ( i ::: j ) is a league. Ourgoal isto splitthis lassinv sub lassesS

1 ;:::;S

v

su h that: a) ea h sub lassis a league; b) ea h sub lass is a Æ- hain in C; and ) ea h sub lassis ontained inone of the lasses of

D .

Consider,forexample, thefollowingtwo hromosomes { inwhi hwe identi edthegenes as numbers,and k 1:

X = (3 1 5 2 7 4 9 ::: 2k+1 2k 2 2k+3 2k) (2k+2) Y = (1 2 3 4 5 ::: 2k+1 2k+2 2k+3):

(8)

anobservethatgenes2k+2and2k+3mustbeisolatedinbothpartitions. Buttheresulting problem X 0 = (3 1 5 2 7 4 9 ::: 2k+1 2k 2) (2k+3) (2k) (2k+2) Y 0 = (1 2 3 4 5 ::: 2k+1) (2k+2) (2k+3);

hasthesame formtas theoriginalone,showingthata bad hoi e of leaguesto ompare an yieldtoO(n)iterationsofthepro ess. Thispartitionre nementapproa hhasthedrawba k thatbigleaguesmust be readoverand over again,inorderto extra t thesmallleaguesthat areburied in them. In thenext se tion, we take the point of view of the small lasses, and showthattheir extra tion an be doneeÆ iently.

3.2 A Divide-and-Conquer Algorithm

The following algorithm to identify teams is a divide-and- onquer algorithm that works by extra ting small leagues from larger ones. Its basi prin iple is des ribed in the following paragraph.

Assume thatS is aleague of hromosomes C and D,and thatthegenes of S are respe -tivelyordered inC and Das:

( 1 ::: n ); and (d 1 :::d n ):

By Proposition1, if S is a Æ-set, then S is a Æ-team. If S is not a Æ-set, there are at least two onse utive elements, say

i and

i+1

that are distant by more than Æ. Therefore, both ( 1 ::: i )and ( i+1 ::: n

)areleagues, splittingtheinitialproblem intwosub-problems. The followingtwo lemmasexplainhow to splita problemeÆ iently.

Lemma 2 If S is a league, but not a team, of hromosomes C and D, then there exists a sub-league of S with at mostjSj=2 genes.

Proof: Let jSj = n, if all sub-leagues of S have more than n=2 genes, it follows that ea h team in ludedinS hasmorethanbn=2 genes,andtheinterse tionoftwosu hteams annot be empty.

The above lemma implies that if S is a league, but not a team, and if the sequen es ( 1 ::: n ) and(d 1 :::d n

)arethe orrespondingpermutations in hromosomes C andD,then thereexist a value pn=2 su h thatat leastone of thefollowingsequen es isa league:

( 1 ::: p ); ( n p+1 ::: n ); (d 1 :::d p ); (d n p+1 :::d n ): Forexample, if X = a b  d e f g Y = a e d b g f

and Æ = 1, then (a b ) is easily identi ed as a league, sin e the distan e between and d is greater than 1 in hromosome X. The next problem is to extra t the orresponding permutationin hromosome Y. Thisis taken are of thefollowinglemma thatdes ribesthe behaviorof thefun tion\Extra t((

1 :::

p );D)":

(9)

C D 1 p

genes ordered in in reasing position in hromosome C, then the orresponding permutation (d 0 1 :::d 0 p

) on hromosome D an beobtained in time O(plogp).

Proof: Given ( 1

::: p

), we rst onstru t the array A = ( 1 D ( 1 );:::; 1 D ( p )). Sorting A requires O(plogp) operations, yielding the array A

0 . The sequen e (d 0 1 :::d 0 p ) is given by ( D (A 0 1 )::: D (A 0 p )).

The lastoperation neededto split a league isto onstru t the ordered omplement of an ordered league. For example, for the league 

Y

= ( a e d b g f), the omplement of the league ( ab) istheleague (edgf).

Moreformally,if(d 0 1 :::d 0 p ) is asubsequen e of(d 1 :::d n ),we willdenote by (d 1 :::d n )n(d 0 1 :::d 0 p ) thesubsequen eof(d 1 :::d n

)obtainedbydeletingtheelementsof(d 0 1 :::d 0 p ). Inourparti ular ontext, thisoperation anbedoneinO(p)steps. Indeed, on eaproblemissplitintwo sub-problems, there is no need to ba ktra k inthe former problems. Therefore, at any point in thealgorithm,ea h genebelongsto exa tlytwoorderedleagues, one inea h hromosome. If thegenedatastru ture ontainspointers tothepreviousand thefollowinggene{ ifany{in bothleagues, thestru ture an be updatedin onstant timeas soon asan extra ted geneis identi ed. Sin ep genesareextra ted, theoperation an be doneinO(p)steps. Anexample ofsu han \extra tion" operationis showninFig. 1.

P

D

C

D’’

C’’

Extraction of P

C’

D’

Figure 1: Extra tion of a league P out of D. The initial problemon (C ;D) is split in two sub-problemson(C 0 ;D 0 ) and (C 00 ;D 00 ).

Fig. 2 ontains the formal des ription of the algorithm FindTeams. The three ases that are not shown orrespond to the tests 

C ( n p ; n p+1 ) > Æ,  D (d p ;d p+1 ) > Æ and  D (d n p ;d n p+1

)>Æ,and aredupli ationsof the rst ase, up to indi es.

Theorem 1 Oninput  C

and D

, algorithm FindTeams orre tly identi es the Æ-teams of hromosomes C and D.

Proof: Sin eis aleague,the rstinputtoFindTeams willbea league. The orre tness of thealgorithm omesfrom thefa tthatifa leagueS issuppliedtothealgorithm,theneither

(10)

1 n 1 n 1. SubLeagueFound False 2. p 1

3. While(not SubLeagueFound)andpbn=2 Do 4. If C ( p ; p+1 )>Æor C ( n p ; n p+1 )>Æor 5.  D (d p ;d p+1 )>Æ or D (d n p ;d n p+1 )>Æ Then 6. SubLeagueFound True 7. Elsep p+1 8. End ofif 9. End ofwhile 10. IfSubLeagueFoundThen 11. If C ( p ; p+1 )>ÆThen 12. (d 0 1 :::d 0 p ) Extra t(( 1 ::: p );D)) 13. FindTeams(( 1 ::: p );(d 0 1 :::d 0 p )) 14. FindTeams(( p+1 ::: n );(d 1 :::d n )n(d 0 1 :::d 0 p )) 15. ElseIf:::

16. /* The three other ases are similar*/ 17. End ofif 18. Else ( 1 ::: n )isaTeam 19. End ofif

Figure 2: Fast re ursivealgorithm forgeneteamsidenti ation.

S is a Æ-team, whi h isthe onditiontested by thefour tests withinthe loop of line3, orit hasa\small" sub-league,whose omplement isalso a league.

The spa e needed to exe ute algorithm FindTeams is easily seen to be O(n) sin e it needs the fourarrays ontaining 

C , D , 1 C , 1 D

,and then genes,ea h withfour pointers oding impli itlyfortheorderedleagues.

3.3 Time Complexity of Algorithm FindTeams

In the last se tion, we saw that algorithm FindTeams splits a problem of size n in two similarproblemsof sizep andn p,withpn=2. Thenumberofoperationsneededto split the problemis O(plogp), butthe value of p is not xed from one iteration to the other. In ordertokeeptheformalismmanageable,wewill\... negle t ertain te hni al detailswhen we state andsolve re urren es. A good example of a detailthat is glossed over isthe assumption of integer arguments tofun tions.", [17 ℄ p. 53.

Assumethatthenumberofoperationneededtosplittheproblemisboundedby plogp, andletF(n)denotethenumberofoperationsneededtosolveaproblemofsizen. ThenF(n) is boundedby thefun tionT(n)des ribedbythefollowingequation:

T(n)= max 1pbn=2

f plogp+T(p)+T(n p)g: (1)

(11)

splitinhalf. Indeed,we willshowthatT(n) is equalto thefun tion: T 2 (n) = n 2 log n 2 +2T 2  n 2  ; (2) withT 2

(1)=1. One dire tioniseasy:

Lemma 4 T(n)T 2

(n).

Proof: SupposethatT(i)T 2

(i) foralli<n, then

T(n)  max 1pn=2 f plogp+T 2 (p)+T 2 (n p)g  ( n=2)log(n=2)+T 2 (n=2)+T 2 (n n=2) = T 2 (n):

Inorder to show the onverse, we rstobtaina losedform forT 2 (n). Lemma5 T 2 (n) =n ( n=4)logn+( n=4)log 2 n.

Proof: Substitutingthe valueT 2

(n=2)in theleft sideof Equation 2,and using theidentity log (n=2)=(logn) 1 yields:

T 2

(n) = ( n=2)log(n=2)+2[n=2 ( n =8)log(n=2)+( n=8)log 2

(n=2)℄ = n ( n =4)logn+( n =4)log

2 n:

WeusethisrelationtoshowthefollowingremarkablepropertyofT 2

(n). Itsaysthatwhen aproblemissplit intwo,themore unequal theparts,the better.

Proposition 3 If x<y then T 2 (x)+T 2 (y)+ xlogx<T 2 (x+y).

Proof: Consider thevariablez=y=x. The followingidentitiesareeasy to derive:

log(x+y) logx=log(1+z) log(x+y) logy=log (1+1=z) log

2

(x+y) log 2

x=[2logx+log(1+z)℄log(1+z) log

2

(x+y) log 2

y=[2logx+log(1+z)+logz℄log(1+1=z):

De ne H(z) = log(1+z)+zlog(1+1=z). Its value for z = 1 is 2, and its derivative is log (1+1=z), implying that the H(z) is stri tly in reasing. We will show that [T

2 (x+y) T 2 (x) T 2

(y)℄=( x)>logx. Usingthe losedform forT 2 ,wehave: [T 2 (x+y) T 2 (x) T 2 (y)℄=( x) = (1=4)[log 2 (x+y) log 2 x℄+(y=4x)[log 2 (x+y) log 2 y℄ (1=4)[log(x+y) logx℄ (y=4x)[log(x+y) logy℄:

(12)

(H(z)=4)[2logx+log(1+z) 1℄+(1=4)zlogzlog (1+1=z)  (H(z)=2)logx

> logx; sin eH(z)>2,when z>1.

UsingProposition3,we get:

Proposition 4 T(n)T 2

(n).

Proof: SupposethatT(i)T 2

(i) foralli<n, then

T(n) = max 1pbn=2 f plogp+T(p)+T(n p)g  max 1pbn=2 f plogp+T 2 (p)+T 2 (n p)g  max 1pbn=2 fT 2 (p+n p)g  max 1pbn=2 fT 2 (n)g = T 2 (n): We thushave:

Theorem 2 The time omplexity of algorithm FindTeams isO(nlog 2

n).

Theorem 2 is truly a worst ase behavior. It is easy to onstru t examples in whi h its behavior will be linear, taking, for example, an input in whi h one hromosome has only singletons asmaximalÆ- hains.

3.4 A faster algorithm

Algorithm FindTeams an be optimized by using a parameter Æ 0

that depends on gene densityand thevalueof Æ:

De nition 4 Let Æ 0

be the maximal number of genes ontained in moving window of size Æ, over all the hromosomes.

Theoptimizationfo useson howto extra tthesmallleagueP,orthepivotofHop roft's framework (see Se tion 4). Assume P to be of size p. The extra tion algorithm will runin O(plogÆ

0

)insteadofO(plogp). The ideaisto lo allysortthegenesinsmallzones,andthen onsider onse utivezonesto ndthemaximalÆ-teams. These onse utive zonesarebuiltby extendingtheneighborhoodof ea h zone, withoutsortingthezones.

(13)

Ea h hromosomeis utinatmost2nzonesZ i

oflengthÆ,andea hgeneonthis hromosome isasso iatedwithaspe i zone. AtableZ=Z

1 :::Z

h

isbuiltforea h hromosometoinsure a dire t a essto a zone.

The zone buildingalgorithm fora hromosome is given inFig 3. The genes ares anned fromleftto right(line2),the urrentpositionisinitializedwiththepositionofthe rstgene, the initial gene to the rst gene, and the zone number to 1 (line 1). Then, if the distan e betweenthe urrentgeneand theinitialgeneisgreaterthan2Æ,we buildtwozonesandreset thepro ess. Ifthisdistan eisbetweenÆ and 2Æ,itmeansthatweentereda onse utivezone andwe also resetthepro ess,butin rementthenumberofzonesonlybyone. Finally,ifthe distan eis smallerthanÆ,westayinthesame zone.

Build zones(( 1 ::: n )) 1. CurrentZone 1 ; InitGene 1 2. Fori=1:::nDo 3. If C (InitGene ; i )>2Æ Then

4. CurrentZone CurrentZone+2 ; InitGene i 5. Else 6. If C (InitGene ; i )>ÆThen

7. CurrentZone CurrentZone+1 ; InitGene i 8. End of if 9. End of if 10. Zone C ( i ) CurrentZone 11. End of for

Figure 3: Algorithm forassigninga zoneto ea h geneof a hromosome C.

ThehzonesZ 1

;:::;Z h

omputedwithBuild zoneshavesomeobviousproperties. There areatmostÆ

0

genesasso iatedwiththesamezone. Thetotal numberhof zonesislessthan orequal to 2n,sin eagene reatesat most2 zones (line4).

3.4.2 Sorting all zones

Assume now that we want to extra t a league P of size p out of a hromosome C. We rstgroup togetherthegenes of P thatareasso iatedto thesame zone ofthetable Z of C. Supposewe onsideredlzonesZ

i 1 ;:::;Z i l ofsizez j ,i 1 ji l

. Thistakestimeproportional top. Wenowsortea hsu hzoneusinga lassi aloptimalsortalgorithm. SortingZ

i j requires O(z j logz j

) time, whi his,asz j

Æ 0

,lessorequalthanO(z j

logÆ 0

). Thetotal omplexityis thenlessorequalto O(

P l j=1 z j logÆ 0 )=O(plogÆ 0 ).

Notethatfortherestoftheextra tionalgorithm,wekeeptra k,forea hnonemptyzone Z

i j

, of the minimal and maximalposition of the genes in Z i

j

. This is given by the sorting pro edurewithoutadditional ost.

3.4.3 Extra ting maximal Æ- hains At this point, we have a list of l sortedzones Z

i1 ;:::;Z i l of genes,in a table Z =Z 1 :::Z h . The zones are not sorted among ea h other, in the sense that we annot address the zones

(14)

i 1

i l

this informationwe an extra t P in C. The idea is simply to onsider for ea h zone Z i

j , 1  j l, the zoneto its left inthe table Z, that is Z

i j

1

(if it exists), and hain Z i j with Z i j 1

if ne essary. The zone Z i

j 1

is a essible in onstant time through the table Z. The order inwhi hthezones Z

i j

are onsideredis irrelevant. Therearethree main ases:

1. ZoneZ i

j 1

doesnotexist (i j

=1). ZoneZ i

j

is dire tlymarked asaninitial zone.

2. Zone Z i

j 1

is empty. Then, the way zones are built by algorithm Build zones (Fig. 3) insuresthatthe genes inZ

ij

annot be Æ- onne ted to other genes to theleft,sin e an emptyzonemeansa distan egreater thanÆ to anypre edinggene. ThezoneZ

i j

is then marked asan initialzone.

3. Zone Z i

j 1

,is not empty. Then,if thedistan e between thelast element ofZ i j 1 and the rst element of Z i j

isless orequalto Æ,then Z i j is hainedto Z i j 1 as afollowing zone. Otherwise,we applya pro ess similarto ase 2.

At the end of that pro ess, after having onsidered all zones in whi h at least one element of P was found, all zones are either hained to the zone to their left, or initial. To nish thepro ess,forall theinitialzones,we followthelinksof hainedzonesand on atenatethe genes. This formsthe maximalÆ- hains, sin e: (a) insidea zone, the genes areÆ- onne ted; (b)iftwozonesZ i j 1 andZ i j

are hained,thegenesofthesetwozonesareÆ- onne ted,sin e we test whether the maximal gene of Z

i j

1

is onne ted to the minimal gene of Z i

j

or not; ( )iftheÆ- hain wasnotmaximal,anotherzone(totheleftortotheright)wouldhave been hained.

3.4.4 Complexity

Proposition 5 Splitting a league P of size p an bedone in O(plogÆ 0

) worst ase time.

Using the analysis of Se tion 3.3 or the amortized te hniques of Hop roft's framework (see Se tion 4), we get a new algorithm with O(nlognlogÆ

0

) worst ase time omplexity. TheoptimizationstillrequiresO(n)spa e,sin ethereareatmost2nzones per hromosome. The omplexityanalysisextendstothe aseofm hromosomes,yieldinganO(mnlognlogÆ

0 ) algorithm.

4 Hop roft's partitioning framework

Partitionre nementwith pivotsisawidely usedte hniqueto solvea large lassofproblems on graphs, strings, et [4 , 8℄. The rst designer was Hop roft who used it to minimize deterministi automata [10 ℄. We propose another version of the faster algorithm, based on partition re nement with pivots, for the omputation of the Æ-teams of two hromosomes. Thealgorithmsextends to an arbitrarynumberm of hromosomes.

4.1 Gene teams and Hop roft's partitioning framework

Re ning a partition an be done by splitting its lasses into smaller ones, a ording to a subset of  alled the pivot set: ea h lass X of L is repla ed by X \S and XnS. We

(15)

Æ-teams, pivotswillalwaysbe Æ- hains ofone of the hromosomes. LetL

C andL

D

bethetwoinitialpartitionsindu edbymaximalÆ- hainsof hromosomes CandD. Wedistinguishtwotypesofpivots, alledtypeCandtypeD. PivotsoftypeCsplit thepartition L

D

whilepivots of type D split the partition L C

. Partitions are implemented bysorted lists. Thereforepartitions areimpli itly ordered. A partitionQ is ompatible with a partition P ifevery lassof Q isin luded ina lass ofP and ifthe orderingin P respe ts the ordering in Q(i.e ifin P the lass X is before the lass Y, thenany lass X

0

X of Q isbeforeany lass Y

0

Y). A pivot splitsapartition into a ompatibleone. Moreover, and thispoint di ersslightlyfrom generalpartitionre nement s hemes, ea h lassof apartition alsoisimplementedbyasortedlist. Ea h lassofthepartitionL

C

issorteda ording tothe geneorder given by hromosome C,and ea h lass ofthe partitionL

D

issorted a ordingly to the order given byD.

De nition5 We say that a lass X overlaps a set S if X 6 S and X \S 6= ;. Given a subset S of , a partition L of  is said to be S-stablewhen no lass of L overlaps S.

Notethat after are nement stepof L byS,thenew partitionis S-stable.

The PartitionRe nement algorithm is des ribed in Fig. 4. While Hop roft's original algorithm pro esses the \small half", we pro ess several \small parts": initially, the sta k pivots ontainsall lassesofthetwopartitions. Then,ea h lassinthesta kiseitherrepla ed bysmaller ones, ornewsmallsub lassesare sta ked. Thealgorithm allsSort zones(P), a pro edurewhi h omputesa de ompositionof thepivotP of typeC (resp. D) into anunion of maximalÆ- hainsof D(resp. C). Thispro edure isdes ribed inSe tions 3.4.2 and 3.4.3.

Pro edure Split(X;P), Fig. 5, is the mainpart of the algorithm. If a lass X properly overlaps thepivot set, thepivot splitsthe lass X ofL

C

(resp. L D

) into at leasttwo lasses a ording to the pivot set. The obtained sub lasses are still Æ- hains of C (resp. D). The sizesof thesub lassesare omputedinparallel duringthepro ess, inorder to avoid parsing aneventual { unique{largesub lass. The odeuses thefollowingfun tions. IfX is Æ- hain of the hromosome C,let(g

1 ;:::;g

k

) bethe permutationof X indu edbyC. We denote by next(g

i

;X) the gene g i+1

when it exists, in whi h ase hasnext(g i

;X) is true. If it doesnot exist,hasnext(g

i

;X) isfalse.

The orre tness of Algorithm PartitionRe nement is obtained with the following in-variantsof thewhile loop (line6).

Proposition 6 Partitions L C and L D always verify: 1. Ea h lass of L C (resp. L D )is a Æ- hain of C (resp.D). 2. The unionof two distin t lassesof L

C

(resp.L D

)is not a Æ-set.

Proof. Duringthe initializationof Algorithm PartitionRe nement, the lassesof L C

and L

D

are Æ- hains of C and D respe tively, and Pro edure Split transforms a olle tion of Æ- hains into a olle tion of Æ- hains.

The onservation of the property2 follows from the following property2': forany pivot P,anyelement gof P and anyelement g

0 =

2P,g andg 0

annotbein asame maximalÆ-set. Properties 2' and 2 are true after the initializationstep. Let us assume that they are both satis edat some time. Then, after a splittingof a lass X under a pivot, anytwo elements

(16)

1. Initializations 2. L

C

(resp.L D

) the olle tionofmaximalÆ- hainsofC (resp.D), (ea h lassof L

C

(resp.L D

)isorderedbyC (resp.D)). 3. Letpivots beanemptysta kofpivots.

4. Addea h lassof L C

(resp.L D

)inpivots asapivotof typeC (resp.D). 5. Re nements

6. While(pivots is notempty)Do 7. Pi kapivotP inpivots. 8. Sort zones(P)

9. IfP hastypeD(the asetypeC issimilar)Then 10. IfL

C

isnotP-stableThen

11. LetM bethesetof lassesofL C properlyoverlappingP. 12. Forea h lassX 2M Do 13. Let(X 1 ;X 2 ;:::;X r )=Split(X;P)

14. If(X is ontainedin thesta kpivots)Then 15. RemoveX frompivots andaddX

1 ;X 2 ;:::;X r 16. aspivotsoftypeC. 17. Else

18. Forea h lassX i

su hthatsize[X i

℄size[X℄=2Do

19. AddX

i

in pivots asapivotoftypeC.

20. End of for 21. End ofif 22. End offor 23. End ofif 24. End of if 25. End of while

Figure 4: Hop roft-likealgorithm forgeneteams identi ation.

of two distin tsub lasses annotbelongto asame maximalÆ-set,by onstru tion. Thusthe new pivots of the sta k obtained from lines 15-16 of Algorithm PartitionRe nement or from lines18-19ofAlgorithmPartitionRe nementstillverify2',andthere nedpartition stillveri es 2. 2

Proposition6impliesthatnoÆ-teamwillbesplitduringthepro ess. Thenextproposition insuresthat there isalwaysenough pivotsinthesta kto properlyidentifyall Æ-teams.

Proposition 7 Ifthepartition L C

isnotY-stableforevery lassY 2L D

,(orifthepartition L

D

isnot X-stablefor every lass X2L C

),thensome pivotof typeD(resp. C)in the sta k pivots will stri tly re ne this partition.

Inthe ase of more thantwo hromosomes,at the endof theexe utionof thealgorithm, ea h partition of one hromosome is X-stable for ea h lass X of a partition of another hromosome.

Proof. We show that if thepartition L C

is not Y-stable for every lass Y 2L D

, then some pivotinpivots willstri tlyre nethepartitionL

D

. Letusassumethatthereisa lassX 2L C su hthatX properlyoverlapsa lassY 2L

D

. Letg2Y \X,andf 2(nY)\X. Consider the rst time g and g

0

are split apart into two di erent lasses Z 1 and Z 2 of L D . If these

(17)

C

ouputsalistof lassesLwiththeirsizes 1. LetLbetheemptylist.

2. Extra tmaximalÆ- hainsX 1

;:::;X r

ofelementsfrom X\P 3. Extra tmaximalÆ- hainsX

0 1 ;:::;X 0 s ofelementsfromX\(nP) 4. For(ea h hain X i )Do 5. Computesize[X i

℄withanexplorationofthe hainX i

. 6. AddX

i toL.

7. size[X℄ size[X℄ size[X i ℄ 8. End offor 9. LetL 0 =(X 0 1 ;:::;X 0 s ) 10. For(ea h hain X 0 2L 0 )Do 11. Setg(X 0 )asthe rstelementofX 0 . 12. size[X 0 ℄ 1. 13. End offor 14. While(L 0

ontainsmorethanone hain)Do 15. While(hasnext(g(X 0 );X 0 )forea hX 0 2L 0 )Do 16. For(ea hX 0 2L 0 )Do 17. g(X 0 ) next(g(X 0 );X 0 ). 18. size[X 0 ℄ size[X 0 ℄+1. 19. End offor 20. End ofwhile 21. For(ea hX 0

2Lsu hthatnothasnext(g(X 0 );X 0 ))Do 22. AddX 0 toL. 23. RemoveX 0 from L 0 . 24. size[X℄ size[X℄ size[X

0 ℄. 25. End offor 26. End ofwhile 27. If(L 0

isnonempty,andhen e ontainsaunique hainX 0 )Then 28. AddX 0 toL. 29. size[X 0 ℄ size[X℄. 30. End ofif 31. returnL.

Figure 5: Splittinga lassunder apivot.

lasses are lasses of the initial partition L D

, then Z 1

is an initial pivot. Otherwise, there is a splittingof a lass Z 3g;g 0 into Z 1 3g;Z 2 3g 0 ;:::;Z r

. Then either Z was already in thesta kofpivots, andall sub lassesZ

i

have beenaddedaspivots(lines15-16 ofAlgorithm PartitionRe nement), or Z was not in the sta k, and all sub lasses Z

i

but at most one have beenaddedaspivots(lines18-19 ofAlgorithmPartitionRe nement). Thisprodu es a pivot either ontaining g and not g

0 , or g

0

and not g. Su h a pivot annot go out of the sta k sin epivoting on it would splitX into at least two lasses. If it is split himselfinside thesta k (lines15-16 ofAlgorithm PartitionRe nement), anotherpivot seperating g and g

0

stillremains inthesta k. Thus thesta k ontainsapivot able to stri tlyre neL C

. 2 As a onsequen e, at the end of the exe ution of the pro ess, L

C

is Y-stable for every lassY 2L

D

, and L D

is X-stable for every lass X 2L C . ThusL C and L D are olle tions of thesame Æ-sets. It follows from Proposition 6,property 2 that these Æ-sets are maximal. We obtaintheexpe ted Æ-teams asL

C orL

D .

(18)

To a hieve a good omplexity, we use thefollowing data stru tures. Any lass of L C

(resp. L

D

)isstoredinadoublylinkedlist,orderedbyC (resp. D). Allthe lassesofapartitionare storedinadoublylinkedlist. Ea helementofa lasshasapointertoits lass. Moreover,ea h gene an be a essed dire tly in L

C

and in L D

, by the use of a table. This data stru ture is illustrated by Figure 6 whi h represents the initial partition L

C

for the two following hromosomes C ;D withÆ =2.

C =   e d a  b D = a b     d e:

The initializationsare performedina lineartime O(n)fortwo hromosomes.

a

L

C

X

1

X

2

c

e

d

a

b

c

d

e

b

Figure6: The initialpartition L C

.

The omplexityanalysisusesamortizedte hniques,espe iallythepointed parts te hnique usedin[4 ℄or[2 , p.331℄. We onsiderpairs(P;g)made ofapivot P going outof thesta kof pivots (line7 ofthealgorithmPartitionRe nement), andan element ofg inP. The basi result is thefollowing:

Proposition 8 Ea hgenegappearsatmost2logntimesinapivotP goingoutofthe sta k.

Proof. If a pivot P ontaining an element g is going out of a sta k and has size p, a pivot ontaining gwhi henters thesta klaterisin luded inP,and hassizeat mostp=2. Thus,it will have a sizeat mostp=2 whilegoing outof the sta k also. A gene g belongsinitiallyto twopivots, one of type C and one of typeD. 2

Let (P;g)betheamortized ostofpro essingthepointedpair(P;g). Then,by Proposi-tion 8 the global ost of thealgorithm willbe given by2n (P;g)logn. We establish,inthe nextpropositionthat (P;g) isO(logÆ

0 ).

[Note that the omplexity analysis assumes the following data stru tures. Any lass of L

C

(resp. L D

) is storedin a doublylinked list,orderedbyC (resp. D). Allthe lasses of a partitionarestoredinadoublylinkedlist. Ea helementof a lasshasa pointerto its lass. Moreover, ea h gene an bea essed dire tly inL

C

and inL D

,bytheuse ofa table.℄

Proposition 9 Theamortized ost (P;g)= 1 logÆ 0 + 2 ,where 1 and 2 are onstants.

(19)

is rst pro essed by Sort zones in time O(plogÆ 0

). We assign to ea h (P;g) a ost logÆ 0

, so that the sum of these osts for all g in P equals the ost of the sorting operation. The omputation of theset M of lines 10-11 of the algorithm is done in timeO(p) by exploring P and usingthe dire t linksfrom a geneto its positionin a lass. Thisin rementsthe ost of ea h (P;g) bya onstant.

We now onsiderthe ostindu edbyPro edureSplit. Lethbethesizeofthe lassXto be split. We laimthat theextra tionsof lines2-3 also areperformed intimeO(p). Indeed, one extra ts a Æ- hain X

i

of elements of X \P by exploring the list P, and by he king the Æ- onne tion fortheorder indu edbyC. More pre isely,when an element, andidateto be added in X

i

, is not Æ- onne ted to the previous ones for the order C, one builds a new lass X

i+1

. If it is Æ- onne ted, it is removed from X in onstant time. If X is no longer Æ- onne ted,we utitintoaÆ- hainX

0 j

ofelementsinX\(nP),andanewÆ- hainX. This in rementsthe ostofea h (P;g) onlywithanother onstant. Remark thatthisimpliesthat there are at most p sub lasses X

i

. Note also that, at this time, the sizes of the sub lasses, and thepointersfrom ea h element ina lassto its lass, have notbeenupdated.

We next onsiderthe ost of the omputation of thesizes ofthe sub lasses. The ompu-tationofthesizesofthesub lassesX

i

isperformedlines4-8ofPro edureSplitintimeO(p), sin ethesumofthesizesofthesesub lassesisat mostp. This harges(P;g) witha onstant again. The omputation of the sizes s

0 j

of the sub lasses X 0 j

is done in lines 14-30. Re all that asmallsub lasshasa sizelessthanorequalto h=2. Sin e L

0

inlines 14-26hasat least two sub lasses, the sub lasses removed in line 23 are small. At line 26, all sub lasses that have beenread ompletelyare small,and thebeginning ofan eventual uniquelargesub lass Y mayhave beenexplored. Nevertheless,the maximalnumberof elements of Y readis the maximal size of all other sub lasses. The pointers from ea h element in a lass to its lass are re omputed forall sub lassesbut Y. Thusthe ost of the omputation of the sizes and pointers of allsub lassesis at most2

P j2J

s 0 j

, whereJ is theindexset of all sub lasses but Y. Sin e all sub lassesbutY are at some time ontained inthe sta k of pivots, and an go out of itbybeingremoved in line14,one hargesagain ea h (P;g) with one more onstant, in orderto ountthe ostof these operations. 2

Proposition 10 Thetime omplexityofthealgorithmPartitionRe nementisO(nlognlogÆ 0

) for two hromosomes andO(mnlognlogÆ

0

) for m hromosomes.

4.3 From Hop roft like algorithm to FindTeams

The two algorithmsPartitionRe nement and FindTeams arevery lose. The algorithm FindTeamsisinfa ta re ursive simpli ationof theHop roftlike one. Thesimpli ationis based on thetwo followingremarks.

First, the sta k pivots of lines 6-8 of Algorithm PartitionRe nement is simulated in FindTeamsbythere ursive allstoitselfoflines13-14. Thisusesapropertyoftheproblem that isnotvalidforall Hop roftlikealgorithms,and allowsto dividetheoriginalproblemin two subproblems. Indeed, assume that in line 11 of PartitionRe nement a pivot P (say of L

D

) splits the set of lasses M of L C

whose alphabet interse ts that of P. The split is performed using Split, whi h partitions the resulting lasses of L

C

in two sets, those that ontains elements of P and the others. Some of these lasses will be reintegrated in the list pivots in lines18-19 of PartitionRe nement and reusedlaterto split other lasses. A

(20)

C

as pivots, would only ut lassesbuilt withelementsof P of L D

. Thispropertyallows usto derive two sub-problems after a Split, on one handall lassesof L

C

built of elements of P together with P on L

D

, and, on the other hand, all the lasses remaining on L C

and L D

. This isused inFindTeamsto re ursively allthesame algorithm on these two sets inlines 13-14 of AlgorithmFindTeams.

A se ond remark on erns the omputation of the sizes of the lasses. In the Hop roft-like algorithm, when splittinga lass X with a pivot P, the sizes of the resulting lasses of size less than or equal to size[X℄=2 are omputed in lines 14-30 of Split. After the split, in lines 18-19 of algorithm PartitionRe nement, the lasses are kept as potential pivots. Algorithm FindTeamssimpli esthissteplinesby ndingasmall lassofsizep(ifitexists) inO(p) and onsidering itasa pivot.

5 Extensions

5.1 Multiple Chromosomes

Themostnaturalextensionofthede nitionofÆ-teamstoasetfC 1

;:::;C m

gof hromosomes, istode neaÆ-setSasaÆ- haininea h hromosomeC

1 to C

m

,and onsidermaximalÆ-sets asinDe nition2. Forexample, withÆ = 2,theonlyÆ-teamof hromosomes:

X =   e d a  b Y = a b    d e Z = b a e    d;

thatisnot alonely geneis thesetfa;bg.

Allthede nitionsandresultsofSe tion2 applydire tlyto thisnew ontext, repla ingC andD bythe m hromosomes.

Algorithm Findteams an be readily adapted to m hromosomes bymodifyingits two maintasksof ndingandextra ting smallleagues. Identifyingasmallleague inm partitions an be donein O(mp). This smallleague must thenbeextra ted from m 1 hromosomes, yielding two sub-problems, one of whi h is of size p. The analysis of Se tion 3.3 yields dire tlyanO(mnlog

2

n)timeboundforthisalgorithm,sin etheparameter inEquation1 wasarbitrary.

5.2 Extension to Cir ular Chromosomes

Inthe ase of ir ular hromosomes,we rst modifyslightlytheassumptionsand de nitions. Thepositionsof genesare given here asvaluesona niteinterval:

 P

C

![0::L℄;

inwhi hpositionLisequivalenttoposition0. Thedistan ebetweentwo genesgandg 0 su h thatP C (g)<P C (g 0 ) isgiven by:  C (g;g 0 )=min  P C (g 0 ) P C (g) P C (g)+L P C (g 0 ):

(21)

C 1 n the permutations,for1<mn:

 (m) C =(g m :::g n g 1 :::g m 1 ):

A Æ- hain in a ir ular hromosome is any Æ- hain of at least one of these permutations. A ir ular Æ- hain is a Æ- hain (g

1 :::g k ) su h that  C (g k ;g 1

)  Æ: it goes all around the hromosome. Allother de nitionsof Se tion2 applywithoutmodi ations.

Adapting algorithm FindTeams to ir ular hromosomes requires a spe ial ase for the treatment of ir ularÆ- hains. Indeed,inSe tion3.2,thebeginningandendofa hromosome provided obvious starting pla es to dete t leagues. In the ase of ir ular hromosomes, assume thatS isa leagueof hromosomes C and D,andthatthegenes ofS arerespe tively ordered inC and D,from arbitrarystartingpoints, as:

( 1 ::: n )and (d 1 :::d n ):

If none of these sequen es is a ir ular Æ- hain, then there is a gap of length greater than Æ on ea h hromosome, and the problem is redu edto a problemof linear hromosomes. If both are ir ularÆ- hains, then S is a Æ-team. Thus, the only spe ial ase is when one is a ir ularÆ- hain,andtheother,say(

1 :::

n

)hasagapgreaterthanÆbetweentwo onse utive elements,orbetweenthelastoneandthe rstone. Withoutlossofgenerality,we anassume thatthegapisbetween

n and

1

. Then,ifS isnotateam,thereexistsavaluepn=2su h thatone of thefollowingsequen e isa league:

( 1 ::: p ) ( n p+1 ::: n :)

The extra tion pro edure is similar to the one in Se tion 3.2, but both the extra ted leagues an again be ir ularÆ- hains, asillustratedinFig. 7.

D’

D’’

C’

C’’

C

D

P

Extraction of P

Figure7: Spe ial ase thatmighto urwhen extra tingtheleague poutofa ir ular league ofD. Both extra tedleaguesare again ir ularÆ- hainsof D

0

and D 00

.

The ir ularity an be dete tedinO(p)steps, sin ethepropertyis destroyed ifand only ifan extra tedgene reatesa gap of lengthgreaterthanÆ betweenits two neighbors.

(22)

A parti ular ase of the team problem is to nd, for various values of Æ, all Æ-teams that ontain a designated geneg. Clearly, theoutput of algorithm FindTeams an be lteredfor thedesignatedgene, butit ispossibleto do better. In lines13 and 14 of Fig. 1,theoriginal problemissplitintwosubproblems. Considerthe rst ase,inwhi hthesub-league(

1 ::: p ) isidenti ed: 1. Ifgeneg belongsto ( 1 ::: p

), thenthese ond re ursive allis unne essary.

2. If gene g does not belong to ( 1

::: p

), then the extra tion of (d 0 1 ;d 0 p ), and the rst re ursive all,are notne essary.

These observationslead to a simpler re urren efor thetime omplexityof thisproblem, sin e roughly half of the work an be skipped at ea h iteration. With arguments similarto those in Se tion 3.3, we get that the numberof operationsis bounded bya fun tion of the form:

T(n)= (n=2)log(n=2)+T(n=2);

where T(1)=1, andwhose solutionis: T(n)= nlogn 2 n+2 +1:

6 Con lusions and perspe tives

We de ned the unifying notion of gene teams and we onstru ted two distin t identi a-tion algorithms forngenes belongingto two ormore hromosomes, thefaster one a hieving O(mnlognlogÆ

0

) time for m linear or ir ular hromosomes. Both algorithms require only linearspa e.

The gene team identi ation problem is more omplex than one ould think in view of the simpli ity of the rst re ursive algorithm. We showed in a se ond part that this algorithm is in fa ta ni e simpli ation of afullHop roft partitioningalgorithm. However, instead of leading to a faster algorithm, this strong link reinfor es our estimation of the intrinsi omplexity of the gene team identi ation problem. In some parti ular Hop roft like algorithms,a lever pivot hoi e an redu e the omplexityfrom O(nlogn)to O(n)[8℄. Obtainingfasteralgorithmsorlowerboundsforthegeneteamidenti ationproblemremains open.

We intendto extendour work intwo dire tions that willfurther larifyand simplifythe on epts and algorithmsused in omparative genomi s. The rst is to relax some aspe t of the de nitionof gene teams. For largevaluesof m, the onstraint that a set S be a Æ- hain in all m hromosomes might be too strong. Sets that are Æ- hains in a quorum of the m hromosomes ould havebiologi alsigni an easwell. We also assumed,inthispaper,that ea h genein the set had a uniquepositionin ea h hromosome. Biologi alreality an be more omplex. Genes an gomissingina ertainspe ies{theirfun tionbeingtakenoverby others, and genes an have dupli ates.

Inase ondphase,weplantoextendournotionsandalgorithmsto ombinedistan ewith other relations betweengenes. Forexample, intera tions betweenproteinsare often studied through metaboli or regulatory pathways, and these graphs impose further onstraints on teams.

(23)

athttp://www-igm.univ-mlv.fr/~raffinot/geneteam.html.

A knowledgment

We would like to thank Marie-Fran e Sagot for interesting dis ussions, Laure Ves ovo and Ni olas Lu for a areful reading, and Myles Tierney forsuggestions on the terminology. A spe ialthanksgoesto Laurent Labarreforhis bibliographywork.

Referen es

[1℄ A.K. Bansal. An automated omparative analysis of 17 omplete mi robial genomes. Bioinformati s, 15(11):900{908, 1999.

[2℄ D.BeauquierandJ.BerstelanP.Chretienne,editors.Elementsd'algorithmique.Masson, Paris,1992.

[3℄ A. Bergeron,S.Corteel, and M. RaÆnot. Thealgorithmi of geneteams. In Workshop on Algorithms in Bioinformati s (WABI), number 2452 inLe ture Notes in Computer S ien e, pages 464{476. Springer-Verlag,Berlin,2002.

[4℄ A. Cardon and M. Cro hemore. Partitioning a graph in O(jAjlog 2

jVj). Theoreti al Computer S ien e,19(1):85{98, 1982.

[5℄ T. Colombo, A. Gueno he, and Y. Quentin. Inferen e fon tionnelle par l'analyse du ontexte genetique: une appli ation aux transporteurs ABC, 2002. Journees ALBIO, Montpellier, Mar h.

[6℄ T.Dandekar,B.Snel,M.Huynen,andP.Bork. Conservationofgeneorder: a ngerprint of proteinsthatphysi allyintera t. Trends Bio hem. S i., 23(9):324{328, 1998.

[7℄ W.Fujibu hi,H.Ogata,H.Matsuda,andM.Kanehisa.Automati dete tionof onserved gene lusters in multiplegenomes by graph omparison and p-quasi grouping. Nu lei A idsResear h, 28(20):4029{403 6, 2000.

[8℄ M.Habib,C.Paul, andL.Viennot. Partitionre nementte hniques: aninteresting algo-rithmi toolkit. International Journal of Foundations of Computer S ien e,10(2):147{ 170, 1999.

[9℄ S.HeberandJ.Stoye. Findingall ommonintervalsofkpermutations.InCombinatorial Pattern Mat hing (CPM), number 2089 in Le ture Notes in Computer S ien e, pages 207{218. Springer-Verlag,Berlin, 2001.

[10℄ J.E.Hop roft. An nlognalgorithm forminimizingthestatesina niteautomaton. In Z.Kohavi,editor, TheTheoryof Ma hinesandComputations,pages189{196. A ademi Press, 1971.

[11℄ M. Huynen, B. Snel, W. Lathe, and P. Bork. Predi ting Protein Fun tion by Ge-nomi Context: Quantitative Evaluationand QualitativeInferen es. Genome Resear h, 10:1024{1210, 2000.

(24)

ofGeneClustersForComparativeGenomi s. ComputationalBiologyandChemistry(ex. Computer and Chemistry), 27(1):59{67, 2002.

[13℄ A. Morgat. Syntenies ba teriennes, 2001. Entretiens Ja ques Cartier on Comparative Genomi s,Lyon,De ember.

[14℄ H. Ogata, W. Fujibu hi, S. Goto, and M. Kanehisa. A heuristi graph omparison algorithmanditsappli ationtodete tfun tionallyrelatedenzyme lusters.Nu l.A ids. Res.,28(20):4021{4028 , 2000.

[15℄ R.Overbeek,M. Fonstein, M. D'Souza, G. D. Pus h,and N. Maltsev. The useof gene lusterstoinferfun tional oupling. Pro .Natl. A ad.S i.USA,96(6):2896{2901, 1999.

[16℄ Robert Paige and Robert E. Tarjan. Three partition re nement algorithms. SIAM Journal on Computing, 16(6):973{989, 1987.

[17℄ C. LeisersonT. Cormenand R.Rivest,editors. Introdu tion toAlgorithms. MITPress, 1992.

[18℄ R. L. Tatusov, M. Y. Galperin, D. A. Natale, and E. V. Koonin. The COG database: a tool for genome-s ale analysis of protein fun tions and evolution. Nu l. A ids. Res., 28(1):33{36, 2000.

[19℄ T. Uno and M. Yagiura. Fast algorithms to enumerate all ommon intervals of two permutations. Algorithmi a, 26(2):290{309, 2000.

Figure

Figure 1: Extration of a league P out of D. The initial problem on (C ; D) is split in two
Figure 2: F ast reursive algorithm for gene teams identiation.
Figure 3: Algorithm for assigning a zone to eah gene of a hromosome C.
Figure 4: Hoproft-like algorithm for gene teams identiation.
+4

Références

Documents relatifs

Maximum likelihood tree for the 75-taxon a-tubulin dataset (RAxML, WAG+C model), with percentage bootstrap support values depicted (PHYML, WAG+C model, 1000 replicates).

 Used to insiste sur la répétition habituelle d'une action dans le passé, mais il reste sous-entendu que cette action n'appartient qu'au passé :..  She used to be

3.Traduis. a) Dans le passé, nous passions nos soirées à bavarder avec des amis tandis que maintenant, les gens regardent la télévision. b) Je voyageais beaucoup

Figure 1: BNF formula for new distribution rules All references to distributed arrays in int-range-expr must be accessible (the element belongs to extended local part) for

Section 5 made the specialization steps explicit: (1) formalization of the admissible parts and the resulting admissible partitions, (2) analysis of the generic algorithm exe-

2001, année 1 de la European Federation for the Science and Technology of Lipids, verra le premier exercice budgétaire, le premier congrès réalisé en Grande-Bretagne sous le

This last step does NOT require that all indicator be converted - best if only a small percent need to be reacted to make the change visible.. Note the

Thus each mathematical symbol (G, L,. .) has a single definite meaning inside each exercise, but might have different meanings from an exercise to another.. It is not required to