HAL Id: inria-00353690
https://hal.inria.fr/inria-00353690
Submitted on 16 Jan 2009
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Jacques Nicolas, Christine Rousseau, Anne Siegel, Pierre Peterlongo, François Coste, Patrick Durand, Sébastien Tempel, Anne-Sophie Valin, Frédéric Mahé
To cite this version:
Jacques Nicolas, Christine Rousseau, Anne Siegel, Pierre Peterlongo, François Coste, et al.. Modeling local repeats on genomic sequences. [Research Report] RR-6802, INRIA. 2008, pp.43. �inria-00353690�
a p p o r t
d e r e c h e r c h e
N0249-6399ISRNINRIA/RR--6802--FR+ENG
Thème BIO
Modeling local repeats on genomic sequences
Jacques Nicolas — Christine Rousseau — Anne Siegel
— Pierre Peterlongo — François Coste — Patrick Durand
— Sébastien Tempel — Anne-Sophie Valin — Frédéric Mahé
N° 6802
Décembre 2008
Centre de recherche INRIA Rennes – Bretagne Atlantique
Jaques Niolas
∗ †
, Christine Rousseau
†
,Anne Siegel
†
, Pierre Peterlongo
†
, François Coste
†
,Patrik Durand
‡
,Sébastien Tempel
§
, Anne-Sophie Valin
¶
, FrédériMahé
k
ThèmeBIOSystèmesbiologiques
Équipes-ProjetsSymbiose
Rapportdereherhe n°6802Déembre200840pages
Abstrat: This paper deals with the speiation and searh of repeats of
biologial interest, i.e. repeatsthat may have arole in genomi strutures or
funtions. Althoughsomepartiularrepeatssuhastandemrepeatshavebeen
wellformalized,modelsdevelopedsofarremainoflimitedexpressivitywithre-
spettoknownforms ofrepeatsinbiologialsequenes. Thispaperintrodues
new general and realistionepts haraterizing potentiallyuseful repeats in
a sequene: Loalityand several renements aroundthe Maximality onept.
Loality is relatedto thedistribution of ourrenesof repeated elements and
haraterizesthewayourrenesarelusteredinthisdistribution. Theassoi-
atednotionofneighborhoodallowstoindiretlyexhibitwordswithadistribution
ofourrenesthatisorrelatedtoagivendistribution. Maximalityisrelatedto
theontextualdelimitationof therepeatedunits. Wehaveextended theusual
notion of maximality, working on the inlusion relation between repeats and
taking into aountlargerontexts. Mainly, weintrodued anew repeaton-
ept,largestmaximalrepeats, lookingfortheexisteneofasubsetofmaximal
ourrenesofarepeatedwordinsteadofaglobalmaximization.
Weproposealgorithmshekingforloalandrenedmaximalrepeatsusing
attheoneptuallevelasuxtreedatastruture. Experimentsonnaturaland
artiialdatafurtherillustratevariousaspetsofthisnewsetting. Allprograms
areavailable onthegenouestplatform,athttp://genouest.org/modulome.
Key-words: loal repeats,largestmaximal repeats,repeats,genomes,bioin-
formatis
∗
towhomorrespondeneshouldbeaddressed
†
Irisa/Inria,CampusdeBeaulieu,35042RennesCedex,Frane
‡
KorilogSARL,BP34,56190Muzilla,Frane
§
Giri,1925LandingsDr. MountainView,CA94043,USA
¶
LigueNationaleContreleCaner,14RueCorvisart,75013Paris,Frane
k
Eobio,CNRSUMR6553,CampusdeBeaulieu,35042RennesCedex,Frane
Résumé: Cetartileétudielamodélisationetlareherhederépétitionspar-
tiulières ayant un intérêt biologique, 'est àdire pouvant jouer un rle dans
lesstrutures oulesfontions génomiques. Même si, àl'imagedes répétitions
entandem, ertainstypesde répétitions ontdéjà étébien formalisées,lesmo-
dèles développés jusqu'alors sourent d'une expressivité limitées par rapport
auxformesonnuesdesrépétitions ayantunsensbiologique.
Ce papier introduit de nouveaux onepts génériques et réalistes qui a-
ratérisentdes répétitions d'intérêt dansles séquenes: laloalité et plusieurs
ranementsautour delanotiondelamaximalité. La loalité estliéeàladis-
tribution desourrenes desélémentsrépétéset aratérise lafaçondont les
ourrenessontgroupéesdans ettedistribution. Enoutre, lanotionassoiée
devoisinage permetd'exhiberdesorrélationsentredistributionsd'ourrenes
et ainsi de mettre à jour d'autres répétitions. La notion de Maximalité est
liée à la délimitation des unités répétées. Nous avons étendu la notion om-
munémentadmisede maximalité,paruneapprohebaséesurl'inlusionentre
répétitions et en onsidérantdes ontextes plus largequ'un unique aratère.
Enpartiulier,nousavonsproposéunnouveauoneptderépétitions,appelées
plusgrandesrépétitionsmaximales,qui herheàvérierl'existened'un sous
ensembled'ourrenesmaximalesd'unerépétitionpluttquedes'appuyersur
lareherhed'unemaximisationglobale.
Pourtouslesnouveauxoneptsintroduits,nousproposonsdesalgorithmes
dedétetiondanslesséquenesbasés auniveauoneptuelsur l'arbredes suf-
xes. Desrésultatsexpérimentauxsurdesdonnéesréellesetsimuléesillustrent
l'intérêtdenotreapprohe. Touslesprogrammessontdisponiblessurlaplate-
formegenouesthttp://genouest.org/modulome.
Mots-lés : répétitions loales, plusgrandes répétitions maximales, répéti-
tions,génomes,bioinformatique
1 Introdution
Awidenumberofstudieshasrevealedthatgenomesequenesontainrepeated
sub-sequenesplayingmajorrolesin thestruture, thefuntion,thedynamis
andtheevolutionofgenomes[18,22,11℄.
Thereexistsalargeliteratureoveringtheproblemofndingrepeatswhih
anbemainly dividedinto threeategories depending onthetypeof targeted
repeats: exatrepeats,repeatswitherrorsandstrutured repeats.
Exatrepeatshavebeenextensivelystudied,leadingtovariousoneptssuh
aslongestrepeats[10, 16℄, maximalrepeats [8, 12, 13,23℄ andsuper-maximal
repeats [8, 1℄. The onept of maximal repeat is quite attrative and simply
fouses on sequenes present in at least two largestommon bloks, without
possibleleftorrightextension,andwithoutanybiologialapriori.
Aseondategoryof algorithmsintroduesanerrormodelin thespeia-
tion of repeatedunits, suh as longestrepeats with ablok of don't ares[6℄,
maximalpairswithboundedgap[5,14℄,tandemrepeats[25,26,4℄andrepeats
witheditdistane[15℄. Thesekindsofalgorithmsaremoreadaptedtoanalyze
the many repeats ontained in genome sequenes that usually ontain opies
withmultiple variations.
Finally,thethirdkindofapproahtargetsthesearhofstruturedmotifsthat
onsist of anordered olletion ofp > 1 parts separatedfrom oneanother by
onstrained spaers[17, 9,19, 20℄. These algorithmsare ofpartiular interest
in studyinggeneexpressionandgeneregulation.
Ouronerninthispaperdealswiththespeiationandsearhofrepeats
ofbiologialinterest,i.e. repeatsthatmayhavearoleingenomistruturesor
funtions. Thisproblem is already addressedbyalgorithms from the twolast
ategories. Algorithmsallowing thetreatmentof errors anbeused to loate
genes n-pliation (n > 1), and various types of tandem repeats whih are,
among others, onstituents of entromeresand telomeres. Algorithmslooking
for strutured repeats have been proposed to takle the diult problem of
loating theset ofshort motifs onstitutingthe regulatory fatorsinvolvedin
genetransription.
However,allmodelsdevelopedsofarremainoflimitedexpressivitywithre-
spettoknownformsofrepeatsinbiologialsequenes. Transposableelements
forinstaneexhibitomplexopyingpatternsthatareonlypartiallyunderstood
sofar. Furtherstudiesareneededtodevelopmorerealistiformalsettings,while
preservingthegeneralityoftheonepts. Thispaperisaontributiontowards
this goal. We introdue some new variations on repeats that apture impor-
tantharateristisofobservedrepeatsintheontextofmoleularbiology. We
examinetheproblemofdening andmathing repeatsappearingat partiular
positionsorin partiularontexts.
Thesearhforrepeatsisalwaysbasedonthedetetionofelementaryunits
with several opies ourring in the studied sequene. Maximal repeats 1
have
largely been used for this purpose throughout the literature sine they an
representallotherrepeatsandhaveastrongmathematialstruture. Maximal
repeats ontainlongest repeatsand havea well dened struture ofinlusion,
their number islinear (atmost nexatmaximal repeats in asequene ofsize
1
AmaximalrepeatinastringSisasubstringwsuhthattherearetwosubstringsawc
andbwdof$S$suhthata6=b andc6=d($isaspeialharater thatdoesnotappearin S).
n),theyanbeomputedinlineartimeusingasux-treebasedalgorithm,and
theyan beusedasbasibloksto omputeerror-pronerepeats.
However,theysuerfromimportantdrawbakswithrespettoreal repeats.
Firstofall,itishardinpratieto distinguishmaximalrepeatsfrombak-
groundnoise. Pointing at largerepeatedwordsleads generallyto meaningful
units beause theprobabilitythat theyappearby hane isverylow. Inon-
trast, short wordssuh as those that appear in gene regulation an our at
a frequeny that is omparable to the frequeny of random words of similar
size. Maximality is a global onept that is dened with respet to all o-
urrenes of a word. What makesshort opies relevant has generally a loal
nature. This explain whyalarge partof this work is dediated to the simple
but powerful onept of loality of repeats. The lustering of ourrenes in
ompatstruturesin theneighborhoodofaninitialpositionisdesribeditself
using apartiular onstraint of loality. We thus introdue a omplementary
notion,neighbor repeats, thatis neessaryto takeintoaountthepreseneof
elements indiretlyloal beause theyare in the viinityof loal unitssome
ofthosebeingeventuallydegeneratedand nonobservable. Moreover,repeated
units have loally an inlusion struture and it is generally the largest ones
that are of interest. Within this idea, we haveintrodued a formal notionof
largestmaximal repeat that isarestritionof maximalrepeatstothose whose
at leastoneof the ourreneis notoveredby abiggerrepeat. Furthermore
itis worthto notie that nolevel ofnoiseorvariationis allowed betweentwo
opies. Inpratialases,mostofopiesshareaverysimilar sequenebutare
notfully idential. Totaklethis problem, weextended thenotion ofontext
ofthemaximal repeats,usually limitedto onesinglenuleotide. Itpermitsto
takeintoaountsmallvariationslikeSNP(SingleNuleotidePolymorphism).
Anotherwayto look at miro-variationson aset of ourrenesis to observe
theexisteneofaset ofoverlappingmaximalrepeats. A notionofunit reets
this struture. A repeated unit may either be asingleword orbemade ofan
overlappingassemblyofmoreelementaryunits.
Inadditiontomodelingthesenewoneptsaboutrepeatsandtheirloality
in genomi sequenes, we propose methods and provide algorithms and their
prooffortheiridentiationin asequeneS.
Thepaperisorganizedasfollows:thenextsetionprovidesaformalizationof
thealgorithmiframework. Setion3detailsonebyonenewonepts,presents
their identiation algorithm and provides results on artiial and biologial
sequenes. Beforeonluding,Setion 4brieydisussesthehoiesthat have
beenmadeinthisstudy,emphasizingamoregeneralresearhtrakonexibility.
Thepaperoerssupplementarymaterialat theendfordetailedalgorithmior
mathematialaspets.
2 Approah
Weproposein this work several renementsaround theonept of repeats in
sequenesandabouttheirloality. Foreahonept,analgorithm isprovided
for theirdetetion in genomi sequenes. It is worthnotiing that tools were
atuallydeveloped; both for validating modelson real genomi sequenesand
formaking ourapproah available to the ommunity. Please referto the web
sitehttp://genouest.org/modulomefordownloadingodes.
We use the suxtree data struture to desribe the searh for partiular
repeats. Wereallthat a(ompat)suxtreeforasequeneS isatreewhose
edges are labeled with non empty words; all internal nodes haveat least two
hildren and eah sux of S orresponds to exatly one pathfrom the tree's
roottoaleaf. Eahnodemaybeassoiatedwiththewordmadeoflettersread
onthepathfrom therootto thisnode. ConstrutingsuhatreeforS anbe
ahievedin timeandspaelinearin thelengthofS (see[7℄forareview).
Weintrodueabasigeneriproedureomputinganattributeoneahnode
ofthisstruture. Thisproedure,presentedin Algorithm1(inthesupplemen-
tary material), is alled Attributes. It is a simpledepth rstreursivevisit
ofthetree, omputingoneahnodeavaluesynthesizing thevaluesofitshil-
dren. It is linear in time and spae with respet to the size of the analyzed
sequene. People familiar with the omputation of maximal repeats will nd
its desription within this framework after proposition 3 in setion 3.4. The
proedure Attributesallowsgivingamoreabstrat presentationof algorithms
while ensuring the linear basi omplexity. Although suh a proedure does
not introdue a fundamentally new algorithmi sheme, it requires some non
trivialhoies in its desriptionandit seemsto bethe rsttime that suh an
expliitformaldesriptionisprovidedforasystematisuxtreedatastruture
exploiting. Thismayhelpdesribingandextendingthemyriadvirtuesofthis
datastruture[3℄withompatandpreiseodes.
Ofourse,thetreeanbeonsideredonlyatthelogiallevelandenhaned
suxarraysusedinstead[2℄ in pratialimplementations. However,itsusage
beingmoreintuitive,wepreferusingsuxtreefordidatipurpose.
Ouralgorithms are all basedon Attributesfuntion alls,using eah time
speiproeduresonnodesofthetree.
Inordertoillustratetheinterestofnewlyintroduedonepts,experiments
were onduted on several genomes. Sequene data were made of omplete
genomesof Arhaea and Bateria withasizein therange[2.4M b,3.5M b]and
ofrandomsequenesresultingfromshuedversionsofthesegenomes. Shuing
wasahievedviatheshueseqfuntionoftheEmboss4.0pakage[24℄.
3 Enhaning repeats onepts
In this setion we introdue one by one the new onepts we propose. The
Loality notion, appliable to any kind of repeat seen later, is rst exposed
(Setion 3.1). Then we explore various maximal repeats renements in Se-
tions 3.2, 3.3 and 3.4. In eah setion we provide rst the ontext and the
formalization,andseond,weproposesomeexperimental results.
3.1 Loal repeats
The rst property to be introdued in this paperonerns thedistribution of
ourrenesofrepeats. Loality isasimplerestritiononrepeatsintroduinga
boundedsize ontherangeoftheirourrenes. Thisrangeis formalizedusing
anotionofsope.
Thesope ofarepeatgivesaesstothesizeoftheregionswheretherepeat
oursin thesequenes. Typially,itwillgetalowvalueforlustered repeats.
Letusstartwithaverysimpledenition.
Figure1: Loalityofrepeatsandnotionofsope.
Denition1 (sope)Thesopeofasetofourrenesofwordsinasequene
isthedierenebetweenthelastandtherstourrenespositions.
Thesope is asimpleindex relatedto therandomnessof adistribution: a
narrowdistributionistypialofpartiularlyinterestingwords.Atrstestimate,
oneould dene the sope of a word in a sequene as the sope of its set of
ourrenes.Biologialsequenesexhibitinfatamorerenednotionofloality
sine a same repeat may be found in dierent lusterssee for instane the
CRISPRstruture in [21℄. If the distribution of awordis multimodal and if
twolusters are faraway, thenits sopemaybeartiially large. This is why
weintrodueaparameter, thenumber of modesdenoted byµ, that limitsthe
maximumnumberofallowedlustersorrespondingtoasamerepeat: allowing
µmodesexpressesthefat thatourrenesmaybelusteredin 1toµgroups.
Then, we dene the loality of a repeated word as the average sope of all
groups.
Denition2 ( µ-loality )Letµ∈N+ and wbearepeat withourrenes
positionspos. Givenanintegerµ,aµ-partitionisapartitionP={P1, . . . , P|P|}
ofposthatontainsatmostµbloks(mutuallyexlusivesubsets),eahonewith
atleasttwoelements. ForeahblokPi,wedeneminPi(resp. maxPi)asthe
smallest(resp. biggest)ourrenepositionontainedinPi.
Wedenotebyscope(P)thesumofsopesofalllustersofrepeats,scope(P) = P|P|
i=1(maxPi−minPi),andwedenotebyP∗ theminimalµ-partitionwithre-
spetto sope,thenwithrespetto size. Theµ-loalityof the repeatisdened
asthemeanvalueofthesopesoflustersin thisoptimalpartition:
µ_locality(w) = scope(P∗)
|P∗| .
Bloksofapartitionrepresentlustersofrepeatswithanassoiatedsope
eah blok has at least two ourrenes in order to avoid onsidering loally
isolated elements with null sope. The loality is alulated for an optimal
partition from the point ofview of theset of positions of all lusters. The 1-
loalityis just thesope, that is, themaximal dierene between positions of
ourrenes.
Example1 LetS =GCGAT AT AGAG. The sope of Gis 10, the sopeof A is 6, the sope of AT A, T or T A is 2 (see gure 1). The 2-loality of G
is (2 + 2)/2 = 2, orresponding to the partition {{1,3},{9,11}} of its set of
ourrenes.
Inreasingthenumberofmodesµneverinreasesthevalueoftheµ-loality
sineit is omputed using aminimum overall partitions into 1to µ subsets.
It hasanotieableeetonthesumof sopesonlyiflearlyseparatedregions
of ourrenes exist. Inpratial appliations, the user anx the maximum
number of modes by looking at the stabilization of this sum. In fat, if the
numberofmodesissuientlylarge,thevalueoftheloalityonvergestothe
mean interval between two ourrenes and reets the way ourrenes are
groupedtogether. Wethusdenetheasymptoti loality orsimplyloality ofa
repeatasitsµ-loalitywhenµtendstowardsitsmaximalvalue(thenumberof
ourrenesdividedbytwo,sineeah blokontainsatleasttwoelements).
Algorithmfor omputingthe loality of a repeat
Enumeratingallpossiblepartitionsofourrenesinordertogettheµ-loality
rendersitsomputationunfeasibleinmostpratialases. Wehaveestablished
a nie property that allows to drastially redue the set of interesting parti-
tions and results in aquadrati algorithm, applying adynami programming
approah.
Proposition1 Letµ∈N andw be arepeat withn ourrenes inasequene
whose positions arestoredinpos[1..n](n≥2). Theµ_loalityofw isequalto µ_locality(w) = pos[n]−pos[1]−opt(µ, n)
|P| ,
wherethere existsapartition {P1,· · ·P|P|} ofpos[1..n],|P| ≤µsuhthat:
eah blok Pi orresponds to an interval on pos[1..n] ontaining at least
twopositions, ai andbi: ai= minPi< bi= maxPi< ai+1;
theextremesof blokssatisfy the relation
P|P|−1
i=1 (ai+1−bi) =opt(µ, n);
thefuntion opt satisesthe following reurrentformulae:
opt(1, j) =opt(k,2) =opt(k,3) = 0 opt(k+ 1, j+ 2) = max
opt(k+ 1, j+ 1),
opt(k, j) +pos[j+ 1]−pos[j]
(k≥1, j ≥2). (1)
Proof 1 ofproposition1:
Werst needtoprove that the µ-loalityof a repeat isobtainedfor apartition
of its ourrenes in at most µ lusters of onseutive ourrenes that eah
ontains at least two elements. Consider a partition of the set of ourrenes
{P1,· · ·P|P|} with |P| ≤µ. The sope of eah blokis denoted by scope(Pi) = maxPi−minPi.
The partition is ordered sothat minPi <minPi+1 for every i ≤ |P|. Sup-
posenowthat thereexistsanindex i suhthatmaxPi>minPi+1.
Weshow thatswapping thesetwoelements leads toabetter partition.
LetPi′ =Pi∪ {minPi+1} \ {maxPi}andPi+1′ =Pi+1∪ {maxPi} \ {minPi+1}.
Then maxPi′ = max{(Pi \ {maxPi}), minPi+1} < maxPi and minPi′ = min{(Pi\ {maxPi}),minPi+1} ≥minPi+1≥minPi.
Similarly, minPi+1′ >minPi+1 andmaxPi+1′ ≤maxPi+1. Hene scope(Pi′)<