• Aucun résultat trouvé

Modeling local repeats on genomic sequences

N/A
N/A
Protected

Academic year: 2021

Partager "Modeling local repeats on genomic sequences"

Copied!
44
0
0

Texte intégral

(1)

HAL Id: inria-00353690

https://hal.inria.fr/inria-00353690

Submitted on 16 Jan 2009

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Jacques Nicolas, Christine Rousseau, Anne Siegel, Pierre Peterlongo, François Coste, Patrick Durand, Sébastien Tempel, Anne-Sophie Valin, Frédéric Mahé

To cite this version:

Jacques Nicolas, Christine Rousseau, Anne Siegel, Pierre Peterlongo, François Coste, et al.. Modeling local repeats on genomic sequences. [Research Report] RR-6802, INRIA. 2008, pp.43. �inria-00353690�

(2)

a p p o r t

d e r e c h e r c h e

N0249-6399ISRNINRIA/RR--6802--FR+ENG

Thème BIO

Modeling local repeats on genomic sequences

Jacques Nicolas — Christine Rousseau — Anne Siegel

— Pierre Peterlongo — François Coste — Patrick Durand

— Sébastien Tempel — Anne-Sophie Valin — Frédéric Mahé

N° 6802

Décembre 2008

(3)
(4)

Centre de recherche INRIA Rennes – Bretagne Atlantique

Jaques Niolas

∗ †

, Christine Rousseau

,Anne Siegel

, Pierre Peterlongo

, François Coste

,Patrik Durand

,Sébastien Tempel

§

, Anne-Sophie Valin

, FrédériMahé

k

ThèmeBIOSystèmesbiologiques

Équipes-ProjetsSymbiose

Rapportdereherhe 6802Déembre200840pages

Abstrat: This paper deals with the speiation and searh of repeats of

biologial interest, i.e. repeatsthat may have arole in genomi strutures or

funtions. Althoughsomepartiularrepeatssuhastandemrepeatshavebeen

wellformalized,modelsdevelopedsofarremainoflimitedexpressivitywithre-

spettoknownforms ofrepeatsinbiologialsequenes. Thispaperintrodues

new general and realistionepts haraterizing potentiallyuseful repeats in

a sequene: Loalityand several renements aroundthe Maximality onept.

Loality is relatedto thedistribution of ourrenesof repeated elements and

haraterizesthewayourrenesarelusteredinthisdistribution. Theassoi-

atednotionofneighborhoodallowstoindiretlyexhibitwordswithadistribution

ofourrenesthatisorrelatedtoagivendistribution. Maximalityisrelatedto

theontextualdelimitationof therepeatedunits. Wehaveextended theusual

notion of maximality, working on the inlusion relation between repeats and

taking into aountlargerontexts. Mainly, weintrodued anew repeaton-

ept,largestmaximalrepeats, lookingfortheexisteneofasubsetofmaximal

ourrenesofarepeatedwordinsteadofaglobalmaximization.

Weproposealgorithmshekingforloalandrenedmaximalrepeatsusing

attheoneptuallevelasuxtreedatastruture. Experimentsonnaturaland

artiialdatafurtherillustratevariousaspetsofthisnewsetting. Allprograms

areavailable onthegenouestplatform,athttp://genouest.org/modulome.

Key-words: loal repeats,largestmaximal repeats,repeats,genomes,bioin-

formatis

towhomorrespondeneshouldbeaddressed

Irisa/Inria,CampusdeBeaulieu,35042RennesCedex,Frane

KorilogSARL,BP34,56190Muzilla,Frane

§

Giri,1925LandingsDr. MountainView,CA94043,USA

LigueNationaleContreleCaner,14RueCorvisart,75013Paris,Frane

k

Eobio,CNRSUMR6553,CampusdeBeaulieu,35042RennesCedex,Frane

(5)

Résumé: Cetartileétudielamodélisationetlareherhederépétitionspar-

tiulières ayant un intérêt biologique, 'est àdire pouvant jouer un rle dans

lesstrutures oulesfontions génomiques. Même si, àl'imagedes répétitions

entandem, ertainstypesde répétitions ontdéjà étébien formalisées,lesmo-

dèles développés jusqu'alors sourent d'une expressivité limitées par rapport

auxformesonnuesdesrépétitions ayantunsensbiologique.

Ce papier introduit de nouveaux onepts génériques et réalistes qui a-

ratérisentdes répétitions d'intérêt dansles séquenes: laloalité et plusieurs

ranementsautour delanotiondelamaximalité. La loalité estliéeàladis-

tribution desourrenes desélémentsrépétéset aratérise lafaçondont les

ourrenessontgroupéesdans ettedistribution. Enoutre, lanotionassoiée

devoisinage permetd'exhiberdesorrélationsentredistributionsd'ourrenes

et ainsi de mettre à jour d'autres répétitions. La notion de Maximalité est

liée à la délimitation des unités répétées. Nous avons étendu la notion om-

munémentadmisede maximalité,paruneapprohebaséesurl'inlusionentre

répétitions et en onsidérantdes ontextes plus largequ'un unique aratère.

Enpartiulier,nousavonsproposéunnouveauoneptderépétitions,appelées

plusgrandesrépétitionsmaximales,qui herheàvérierl'existened'un sous

ensembled'ourrenesmaximalesd'unerépétitionpluttquedes'appuyersur

lareherhed'unemaximisationglobale.

Pourtouslesnouveauxoneptsintroduits,nousproposonsdesalgorithmes

dedétetiondanslesséquenesbasés auniveauoneptuelsur l'arbredes suf-

xes. Desrésultatsexpérimentauxsurdesdonnéesréellesetsimuléesillustrent

l'intérêtdenotreapprohe. Touslesprogrammessontdisponiblessurlaplate-

formegenouesthttp://genouest.org/modulome.

Mots-lés : répétitions loales, plusgrandes répétitions maximales, répéti-

tions,génomes,bioinformatique

(6)

1 Introdution

Awidenumberofstudieshasrevealedthatgenomesequenesontainrepeated

sub-sequenesplayingmajorrolesin thestruture, thefuntion,thedynamis

andtheevolutionofgenomes[18,22,11℄.

Thereexistsalargeliteratureoveringtheproblemofndingrepeatswhih

anbemainly dividedinto threeategories depending onthetypeof targeted

repeats: exatrepeats,repeatswitherrorsandstrutured repeats.

Exatrepeatshavebeenextensivelystudied,leadingtovariousoneptssuh

aslongestrepeats[10, 16℄, maximalrepeats [8, 12, 13,23℄ andsuper-maximal

repeats [8, 1℄. The onept of maximal repeat is quite attrative and simply

fouses on sequenes present in at least two largestommon bloks, without

possibleleftorrightextension,andwithoutanybiologialapriori.

Aseondategoryof algorithmsintroduesanerrormodelin thespeia-

tion of repeatedunits, suh as longestrepeats with ablok of don't ares[6℄,

maximalpairswithboundedgap[5,14℄,tandemrepeats[25,26,4℄andrepeats

witheditdistane[15℄. Thesekindsofalgorithmsaremoreadaptedtoanalyze

the many repeats ontained in genome sequenes that usually ontain opies

withmultiple variations.

Finally,thethirdkindofapproahtargetsthesearhofstruturedmotifsthat

onsist of anordered olletion ofp > 1 parts separatedfrom oneanother by

onstrained spaers[17, 9,19, 20℄. These algorithmsare ofpartiular interest

in studyinggeneexpressionandgeneregulation.

Ouronerninthispaperdealswiththespeiationandsearhofrepeats

ofbiologialinterest,i.e. repeatsthatmayhavearoleingenomistruturesor

funtions. Thisproblem is already addressedbyalgorithms from the twolast

ategories. Algorithmsallowing thetreatmentof errors anbeused to loate

genes n-pliation (n > 1), and various types of tandem repeats whih are,

among others, onstituents of entromeresand telomeres. Algorithmslooking

for strutured repeats have been proposed to takle the diult problem of

loating theset ofshort motifs onstitutingthe regulatory fatorsinvolvedin

genetransription.

However,allmodelsdevelopedsofarremainoflimitedexpressivitywithre-

spettoknownformsofrepeatsinbiologialsequenes. Transposableelements

forinstaneexhibitomplexopyingpatternsthatareonlypartiallyunderstood

sofar. Furtherstudiesareneededtodevelopmorerealistiformalsettings,while

preservingthegeneralityoftheonepts. Thispaperisaontributiontowards

this goal. We introdue some new variations on repeats that apture impor-

tantharateristisofobservedrepeatsintheontextofmoleularbiology. We

examinetheproblemofdening andmathing repeatsappearingat partiular

positionsorin partiularontexts.

Thesearhforrepeatsisalwaysbasedonthedetetionofelementaryunits

with several opies ourring in the studied sequene. Maximal repeats 1

have

largely been used for this purpose throughout the literature sine they an

representallotherrepeatsandhaveastrongmathematialstruture. Maximal

repeats ontainlongest repeatsand havea well dened struture ofinlusion,

their number islinear (atmost nexatmaximal repeats in asequene ofsize

1

AmaximalrepeatinastringSisasubstringwsuhthattherearetwosubstringsawc

andbwdof$S$suhthata6=b andc6=d($isaspeialharater thatdoesnotappearin S).

(7)

n),theyanbeomputedinlineartimeusingasux-treebasedalgorithm,and

theyan beusedasbasibloksto omputeerror-pronerepeats.

However,theysuerfromimportantdrawbakswithrespettoreal repeats.

Firstofall,itishardinpratieto distinguishmaximalrepeatsfrombak-

groundnoise. Pointing at largerepeatedwordsleads generallyto meaningful

units beause theprobabilitythat theyappearby hane isverylow. Inon-

trast, short wordssuh as those that appear in gene regulation an our at

a frequeny that is omparable to the frequeny of random words of similar

size. Maximality is a global onept that is dened with respet to all o-

urrenes of a word. What makesshort opies relevant has generally a loal

nature. This explain whyalarge partof this work is dediated to the simple

but powerful onept of loality of repeats. The lustering of ourrenes in

ompatstruturesin theneighborhoodofaninitialpositionisdesribeditself

using apartiular onstraint of loality. We thus introdue a omplementary

notion,neighbor repeats, thatis neessaryto takeintoaountthepreseneof

elements indiretlyloal beause theyare in the viinityof loal unitssome

ofthosebeingeventuallydegeneratedand nonobservable. Moreover,repeated

units have loally an inlusion struture and it is generally the largest ones

that are of interest. Within this idea, we haveintrodued a formal notionof

largestmaximal repeat that isarestritionof maximalrepeatstothose whose

at leastoneof the ourreneis notoveredby abiggerrepeat. Furthermore

itis worthto notie that nolevel ofnoiseorvariationis allowed betweentwo

opies. Inpratialases,mostofopiesshareaverysimilar sequenebutare

notfully idential. Totaklethis problem, weextended thenotion ofontext

ofthemaximal repeats,usually limitedto onesinglenuleotide. Itpermitsto

takeintoaountsmallvariationslikeSNP(SingleNuleotidePolymorphism).

Anotherwayto look at miro-variationson aset of ourrenesis to observe

theexisteneofaset ofoverlappingmaximalrepeats. A notionofunit reets

this struture. A repeated unit may either be asingleword orbemade ofan

overlappingassemblyofmoreelementaryunits.

Inadditiontomodelingthesenewoneptsaboutrepeatsandtheirloality

in genomi sequenes, we propose methods and provide algorithms and their

prooffortheiridentiationin asequeneS.

Thepaperisorganizedasfollows:thenextsetionprovidesaformalizationof

thealgorithmiframework. Setion3detailsonebyonenewonepts,presents

their identiation algorithm and provides results on artiial and biologial

sequenes. Beforeonluding,Setion 4brieydisussesthehoiesthat have

beenmadeinthisstudy,emphasizingamoregeneralresearhtrakonexibility.

Thepaperoerssupplementarymaterialat theendfordetailedalgorithmior

mathematialaspets.

2 Approah

Weproposein this work several renementsaround theonept of repeats in

sequenesandabouttheirloality. Foreahonept,analgorithm isprovided

for theirdetetion in genomi sequenes. It is worthnotiing that tools were

atuallydeveloped; both for validating modelson real genomi sequenesand

formaking ourapproah available to the ommunity. Please referto the web

sitehttp://genouest.org/modulomefordownloadingodes.

(8)

We use the suxtree data struture to desribe the searh for partiular

repeats. Wereallthat a(ompat)suxtreeforasequeneS isatreewhose

edges are labeled with non empty words; all internal nodes haveat least two

hildren and eah sux of S orresponds to exatly one pathfrom the tree's

roottoaleaf. Eahnodemaybeassoiatedwiththewordmadeoflettersread

onthepathfrom therootto thisnode. ConstrutingsuhatreeforS anbe

ahievedin timeandspaelinearin thelengthofS (see[7℄forareview).

Weintrodueabasigeneriproedureomputinganattributeoneahnode

ofthisstruture. Thisproedure,presentedin Algorithm1(inthesupplemen-

tary material), is alled Attributes. It is a simpledepth rstreursivevisit

ofthetree, omputingoneahnodeavaluesynthesizing thevaluesofitshil-

dren. It is linear in time and spae with respet to the size of the analyzed

sequene. People familiar with the omputation of maximal repeats will nd

its desription within this framework after proposition 3 in setion 3.4. The

proedure Attributesallowsgivingamoreabstrat presentationof algorithms

while ensuring the linear basi omplexity. Although suh a proedure does

not introdue a fundamentally new algorithmi sheme, it requires some non

trivialhoies in its desriptionandit seemsto bethe rsttime that suh an

expliitformaldesriptionisprovidedforasystematisuxtreedatastruture

exploiting. Thismayhelpdesribingandextendingthemyriadvirtuesofthis

datastruture[3℄withompatandpreiseodes.

Ofourse,thetreeanbeonsideredonlyatthelogiallevelandenhaned

suxarraysusedinstead[2℄ in pratialimplementations. However,itsusage

beingmoreintuitive,wepreferusingsuxtreefordidatipurpose.

Ouralgorithms are all basedon Attributesfuntion alls,using eah time

speiproeduresonnodesofthetree.

Inordertoillustratetheinterestofnewlyintroduedonepts,experiments

were onduted on several genomes. Sequene data were made of omplete

genomesof Arhaea and Bateria withasizein therange[2.4M b,3.5M b]and

ofrandomsequenesresultingfromshuedversionsofthesegenomes. Shuing

wasahievedviatheshueseqfuntionoftheEmboss4.0pakage[24℄.

3 Enhaning repeats onepts

In this setion we introdue one by one the new onepts we propose. The

Loality notion, appliable to any kind of repeat seen later, is rst exposed

(Setion 3.1). Then we explore various maximal repeats renements in Se-

tions 3.2, 3.3 and 3.4. In eah setion we provide rst the ontext and the

formalization,andseond,weproposesomeexperimental results.

3.1 Loal repeats

The rst property to be introdued in this paperonerns thedistribution of

ourrenesofrepeats. Loality isasimplerestritiononrepeatsintroduinga

boundedsize ontherangeoftheirourrenes. Thisrangeis formalizedusing

anotionofsope.

Thesope ofarepeatgivesaesstothesizeoftheregionswheretherepeat

oursin thesequenes. Typially,itwillgetalowvalueforlustered repeats.

Letusstartwithaverysimpledenition.

(9)

Figure1: Loalityofrepeatsandnotionofsope.

Denition1 (sope)Thesopeofasetofourrenesofwordsinasequene

isthedierenebetweenthelastandtherstourrenespositions.

Thesope is asimpleindex relatedto therandomnessof adistribution: a

narrowdistributionistypialofpartiularlyinterestingwords.Atrstestimate,

oneould dene the sope of a word in a sequene as the sope of its set of

ourrenes.Biologialsequenesexhibitinfatamorerenednotionofloality

sine a same repeat may be found in dierent lusterssee for instane the

CRISPRstruture in [21℄. If the distribution of awordis multimodal and if

twolusters are faraway, thenits sopemaybeartiially large. This is why

weintrodueaparameter, thenumber of modesdenoted byµ, that limitsthe

maximumnumberofallowedlustersorrespondingtoasamerepeat: allowing

µmodesexpressesthefat thatourrenesmaybelusteredin 1toµgroups.

Then, we dene the loality of a repeated word as the average sope of all

groups.

Denition2 ( µ-loality )LetµN+ and wbearepeat withourrenes

positionspos. Givenanintegerµ,aµ-partitionisapartitionP={P1, . . . , P|P|}

ofposthatontainsatmostµbloks(mutuallyexlusivesubsets),eahonewith

atleasttwoelements. ForeahblokPi,wedeneminPi(resp. maxPi)asthe

smallest(resp. biggest)ourrenepositionontainedinPi.

Wedenotebyscope(P)thesumofsopesofalllustersofrepeats,scope(P) = P|P|

i=1(maxPiminPi),andwedenotebyP theminimalµ-partitionwithre-

spetto sope,thenwithrespetto size. Theµ-loalityof the repeatisdened

asthemeanvalueofthesopesoflustersin thisoptimalpartition:

µ_locality(w) = scope(P)

|P| .

Bloksofapartitionrepresentlustersofrepeatswithanassoiatedsope

eah blok has at least two ourrenes in order to avoid onsidering loally

isolated elements with null sope. The loality is alulated for an optimal

partition from the point ofview of theset of positions of all lusters. The 1-

loalityis just thesope, that is, themaximal dierene between positions of

ourrenes.

Example1 LetS =GCGAT AT AGAG. The sope of Gis 10, the sopeof A is 6, the sope of AT A, T or T A is 2 (see gure 1). The 2-loality of G

is (2 + 2)/2 = 2, orresponding to the partition {{1,3},{9,11}} of its set of

ourrenes.

Inreasingthenumberofmodesµneverinreasesthevalueoftheµ-loality

sineit is omputed using aminimum overall partitions into 1to µ subsets.

(10)

It hasanotieableeetonthesumof sopesonlyiflearlyseparatedregions

of ourrenes exist. Inpratial appliations, the user anx the maximum

number of modes by looking at the stabilization of this sum. In fat, if the

numberofmodesissuientlylarge,thevalueoftheloalityonvergestothe

mean interval between two ourrenes and reets the way ourrenes are

groupedtogether. Wethusdenetheasymptoti loality orsimplyloality ofa

repeatasitsµ-loalitywhenµtendstowardsitsmaximalvalue(thenumberof

ourrenesdividedbytwo,sineeah blokontainsatleasttwoelements).

Algorithmfor omputingthe loality of a repeat

Enumeratingallpossiblepartitionsofourrenesinordertogettheµ-loality

rendersitsomputationunfeasibleinmostpratialases. Wehaveestablished

a nie property that allows to drastially redue the set of interesting parti-

tions and results in aquadrati algorithm, applying adynami programming

approah.

Proposition1 LetµN andw be arepeat withn ourrenes inasequene

whose positions arestoredinpos[1..n](n2). Theµ_loalityofw isequalto µ_locality(w) = pos[n]pos[1]opt(µ, n)

|P| ,

wherethere existsapartition {P1,· · ·P|P|} ofpos[1..n],|P| ≤µsuhthat:

ˆ eah blok Pi orresponds to an interval on pos[1..n] ontaining at least

twopositions, ai andbi: ai= minPi< bi= maxPi< ai+1;

ˆ theextremesof blokssatisfy the relation

P|P|−1

i=1 (ai+1bi) =opt(µ, n);

ˆ thefuntion opt satisesthe following reurrentformulae:

opt(1, j) =opt(k,2) =opt(k,3) = 0 opt(k+ 1, j+ 2) = max

opt(k+ 1, j+ 1),

opt(k, j) +pos[j+ 1]pos[j]

(k1, j 2). (1)

Proof 1 ofproposition1:

Werst needtoprove that the µ-loalityof a repeat isobtainedfor apartition

of its ourrenes in at most µ lusters of onseutive ourrenes that eah

ontains at least two elements. Consider a partition of the set of ourrenes

{P1,· · ·P|P|} with |P| ≤µ. The sope of eah blokis denoted by scope(Pi) = maxPiminPi.

The partition is ordered sothat minPi <minPi+1 for every i ≤ |P|. Sup-

posenowthat thereexistsanindex i suhthatmaxPi>minPi+1.

Weshow thatswapping thesetwoelements leads toabetter partition.

LetPi =Pi∪ {minPi+1} \ {maxPi}andPi+1 =Pi+1∪ {maxPi} \ {minPi+1}.

Then maxPi = max{(Pi \ {maxPi}), minPi+1} < maxPi and minPi = min{(Pi\ {maxPi}),minPi+1} ≥minPi+1minPi.

Similarly, minPi+1 >minPi+1 andmaxPi+1 maxPi+1. Hene scope(Pi)<

Références

Documents relatifs

Finally, in Section 5 , we study the depth of the nests of real algebraic curves inside some real algebraic surfaces, namely Hirzebruch surfaces (Theorem 5.3 ) and Del Pezzo

Maximal local time of randomly biased random walks on a Galton-Watson tree.. Xinxin Chen, Loïc

This paper considers concentrated polymer systems where the translational degree of freedom of all the chains to move has been quenched, ln particular we consider the

If the index m of M around the curve F° (&amp;A) is at least one, the analytic disks with boundary in M that are close to F° in some Lipschitz space form a (2m + 2)-parameter

Les nombres super ` 0 -champions sont définis en remplaçant dans la défi- nition des nombres super `-champions la fonction ` par ` 0 , notamment dans (1.7).. Au paragraphe 2,

Abstract We show that, for any cluster-tilted algebra of finite representation type over an algebraically closed field, the following three definitions of a maximal green sequence

Then we may assume that k exceeds a sufficiently large effectively computable absolute constant.. We may assume that k exceeds a sufficiently large effectively computable absolute

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des