HAL Id: hal-00452240
https://hal.archives-ouvertes.fr/hal-00452240
Submitted on 20 Mar 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Algorithms for computing approximate repetitions in musical sequences
Emilios Cambouropoulos, Maxime Crochemore, Costas S. Iliopoulos, Laurent Mouchard, Yoan J. Pinzón Ardila
To cite this version:
Emilios Cambouropoulos, Maxime Crochemore, Costas S. Iliopoulos, Laurent Mouchard, Yoan J.
Pinzón Ardila. Algorithms for computing approximate repetitions in musical sequences. Australasian Workshop On Combinatorial Algorithms, 1999, Australia. pp.129-144. �hal-00452240�
Repetitions in Musial Sequenes
EmiliosCambouropoulos 1?
,Maxime Crohemore 2??
, CostasS.Iliopoulos 3???
,
LaurentMouhard 4 y
,andYoanJ.Pinzon 3
1
AustrianResearhInstitutefor ArtiialIntelligene,Shottengasse3,1010Wien,
Austria
emiliosai.univie.a.at
www.ai.univie.a.at/emilios
2
InstitutGaspardMonge,UniversitedeMarne-la-Vallee,77454Marne-la-Vallee
CEDEX2,Frane.
mauniv-mlv.fr
www-igm.univ-mlv.fr/ma
3
Dept.ComputerSiene,King'sCollegeLondon,LondonWC2R2LS,England,
andShoolofComputing,CurtinUniversityofTehnology,GPOBox1987U,WA.
fsi,pinzongds.kl.a.uk,
www.ds.kl.a.uk/staff/si, www.ds.kl.a.uk/pg/pinzon
4
LIFAR-ABISS,UniversitedeRouen,76821MontSaintAignan,Frane.and
ShoolofComputing,CurtinUniversityofTehnology,GPOBox1987 U,WA.
lmdir.univ-rouen.fr
www.dir.univ-rouen.fr/lm
Abstrat. Hereweintroduetwonewnotionsofapproximatemathing
withappliationinomputerassisted musianalysis. Wepresent algo-
rithmsforeahnotionofapproximation:forapproximatestringmathing
andforomputingapproximatesquares.
Keywords: String algorithms,approximate stringmathing, dynamiprogram-
ming,omputer-assistedmusianalysis.
1 Introdution
This paperfouses on a set of stringpattern-mathing problems that arise in
musialanalysis,andespeiallyinmusialinformationretrieval.Amusialsore
anbeviewedasastring:ataveryrudimentarylevel,thealphabetouldsimply
betheset ofnotesin thehromatiordiatoninotation,ortheset ofintervals
thatappearbetweennotes(e.g.pithmayberepresentedasMIDInumbersand
?
SupportedbytheSTARTprogrammeY99-INF,AustrianFederalMinistryofSiene
andTransport
??
PartiallysupportedbytheC.N.R.S.Program\Genomes"
???
PartiallysupportedbytheEPSRCgrantGR/J17844.
y
PartiallysupportedbytheC.N.R.S.Program\Genomes"
musial works play a ruial role in disovering similarities between dierent
musial entities and may be used for establishing \harateristi signatures"
(see [6℄). Suh algorithms an be partiularly useful for melody identiation
andmusialretrieval.
Bothexatandapproximatemathingtehniqueshavebeenusedforavariety
of musial appliations (see overviews in MGettrik [23℄ ; Crawford et al [6℄;
Rolland et al [28℄;Cambouropoulos et al [4℄). Thespei problem studied in
this paperis pattern-mathing for numeri stringswhere aertaintolerane is
allowedduringthemathingproedure.Thistypeofpattern-mathinghasbeen
onsideredneessaryforvariousmusialappliationsandhasbeenusedbysome
researhers(see,forinstane,Cope[5℄).AnumberofeÆientalgorithmswillbe
presentedin thispaperthattaklevariousaspetsofthis problem.
Mostomputer-aidedmusialappliationsadopt anabsolute numeripith
representation(mostommonlyMIDIpithandpithintervalsinsemitones;du-
rationisalsoenodedinanumeriform).Theabsolutepithenoding,however,
maybeinsuÆientfor appliationsin tonal musi asit disregardstonal quali-
tiesofpithesand pith-intervals(e.g. atonal transposition from amajortoa
minor keyresultsin adierentenoding ofthemusialpassageandthus exat
mathing annot detet the similaritybetween the two passages). Oneway to
aountforsimilaritybetweenloselyrelatedbutnon-identialmusialstringsis
to usewhatwill bereferredto asÆ-approximatemathing (and-approximate
mathing). In Æ-approximate mathing, equal-length patterns onsisting of in-
tegers math if eah orresponding integer diers by not more than Æ- e.g. a
C-majorf60;64;65;67gandaC-minorf60;63;65;67gsequeneanbemathed
ifatoleraneÆ=1isallowedinthemathingproess(-approximatemathing
is desribedin the next setion). Twosimple musial examplesthat illustrate
theusefulnessoftheproposedpattern-mathingtehniquesarepresentedinAp-
pendiesI andII.
Exatrepetitionshavebeenstudiedextensively.Therepetitionsanbeeither
onatenatedwiththeoriginal substringortheymayoverlaportheymaynot.
Algorithmsforndingnon-overlappingrepetitionsinagivenstringanbefound
in [1,8,15,21,18,26℄ andalgorithms foromputingoverlappingrepetitionsan
befound in [3,13,14,25℄. A natural extension of therepetitions problem is to
allow the presene of errors; that is, the identiation of substrings that are
dupliated to within a ertain tolerane k (usually edit distane orHamming
distane).Moreover,therepeatedsubstringmaybesubjettootheronstraints:
itmayberequiredtobeofat leastaertainlength,andertainpositionsin it
mayberequiredtobeinvariant.
Furthermore,eÆientalgorithmsforomputingtheapproximaterepetitions
arealsodiretlyappliabletomoleularbiology(see[11,17,24℄)andinpartiular
in DNA sequening by hybridization ([27℄), reonstrution of DNA sequenes
from known DNA fragments (see [29,30℄), in human organand bonemarrow
transplantationaswellasthedeterminationofevolutionarytreesamongdistint
speies([29℄).
thatofndingevolutionaryhains:givenastringt(the\text")andapatternp
(the\motif"),ndwhetherthereexistsasequeneu
1
=p;u
2
;:::;u
`
ourring
in the text t suh that u
i+1
ours to therightof u
i
in t andu
i and u
i+1 are
\similar" for 1 i < ` (i.e. they dier by a ertain number of symbols). In
[9℄ and[7℄ algorithms foroverlappingand non-overlappingevolutionary hains
werepresentedandseveralvariantsoftheproblemwerestudied: omputingthe
longesthain,omputingthehain withtheleastnumberoferrors.
Thepaperisorganisedasfollows.Inthenextsetionwepresentsomebasi
denitionsforstringsandbakgroundnotionsforapproximatemathing.InSe-
tion3wepresentanalgorithmforÆ-approximate(therstnotionofapproxima-
tion)patternmathing.Insetion4wepresentanalgorithmforÆ;-approximate
(theseond notionofapproximation)patternmathing.Insetion5wepresent
algorithmsforomputingallÆandfÆ;g-approximatesquaresin agiventext.
FinallyinSetion 6wepresentouronlusionsandopenproblems.
2 Bakground and basi string denitions
A string isasequeneof zeroormoresymbolsfrom analphabet;thestring
withzerosymbolsisdenoted by.Thesetofallstringsoverthealphabet is
denotedby
.Astringxoflengthnisrepresentedbyx
1 :::x
n
,wherex
i 2
for 1 i n. A string w is a substring of x if x = uwv for u;v 2
; we
equivalentlysaythatthestringwoursatpositionjuj+1ofthestringx.The
position juj+1 is said to be the starting position of w in x and the position
jwj+juj the end position of uin x.A stringw is a prex of x if x =wu for
u2
.Similarly,wisasuÆxofx ifx=uwforu2
.
Thestringxyisaonatenationoftwostringsxandy.Theonatenations
ofkopiesofxisdenotedbyx k
.Fortwostringsx=x
1 :::x
n
andy=y
1 :::y
m
suhthatx
n i+1 :::x
n
=y
1 :::y
i
forsomei1,thestringx
1 :::x
n y
i+1 :::y
m
isasuperpositionofx andy.Wesaythat xandy overlap.
Letxbeastringoflengthn.A prexx
1 :::x
p
, 1p<n,ofx is aperiod
of x ifx
i
=x
i+p
forall1in p.Theperiod of astringxis theshortest
period ofx.A stringy isaborder ofx ifyisaprexandasuÆxofx.
LetbeanalphabetofintegersandÆaninteger.Twosymbolsa;bofare
saidtobeÆ-approximate,denoted a=
Æ
bifandonlyif
ja bjÆ
Wesaythattwostringsx;y areÆ-approximate,denoted x Æ
=y ifandonlyif
jxj=jyj; andx
i
=
Æ y
i
; 8i2f1::jxjg (2:1)
Let beaninteger.Twostringsx;y aresaidto be-approximate,denoted
x
=y ifandonlyif
jxj=jyj; and jxj
X
jx
i y
i
j< (2:2)
Furthermore,wesaythattwostringsx;yaref;Æg-approximate,denotedx = y,
ifandonlyifx andy satisfyonditions(2.1)and(2.2).
3 Æ-Approximate Pattern Mathing
The problem of Æ-approximate pattern mathing isformally dened as follows:
givenastringt=t
1 :::t
n
andapatternp=p
1 :::p
m
omputeallpositionsj of
t suhthat
p Æ
=t[j::j+m 1℄
Thealgorithm isbasedontheO(1)-time omputationofthe\Deltastates"
DState
j
;j2f1::ngby usingbitoperationsunder theassumptionthat mw,
wherewisthenumberofbitsinamahineword.Thebasistepsofthealgorithm
areas follows:
1. Firstweomputethe\Deltatable"DT:wesetDT()=r,wheredenotes
asymbolourringin t andr=r
1 :::r
m
isabinarywordwithr
i
equalto
1ifj p
i
jÆ,otherwiser
i
isequalto 0fori2f1::mg.
2. LetLeftShiftbeabit-wiseoperationthatshiftsthebitsofabinaryword
byoneposition totheleft.Wedene
DState
j
=(LeftShift(DState
j 1
) OR 1) AND DT[t
j
℄ (3:1)
forj=1:::nandDState
0
=0;henethisproedureisalled\Shift-And".
OnewehaveomputedtheDT table,weanuseittoomputetheDState
j
forj=1 :::n,usingthereursiveformula(3.1).
3. WesaythatthereisaÆ-approximatemath(orsimplyÆ-math)atposition
j m+1ifandonlyifthem-th bitof DState
j
is1orequivalentlyifand
onlyifDState
j
, isgreaterorequalto 2 m 1
whenitis viewedasadeimal
integer.
Example. For=f1, :::, 9g let us onsider p=3,4,6,2, t=3,4,6,2,8,2,4,5,7,1
andÆ=1.Inthepreproessingtable,DT()denotesthepositionswherej p
i j
Æ. Forexample,DT[3℄=1011beausej3 p
i
j1fori=1;2;4.
i
pi DT[1℄ DT[2℄ DT[3℄ DT[4℄ DT[5℄ DT[6℄ DT[7℄ DT[8℄ DT[9℄
4 2 1 1 1 0 0 0 0 0 0
3 6 0 0 0 0 1 1 1 0 0
2 4 0 0 1 1 1 0 0 0 0
1 3 0 1 1 1 0 0 0 0 0
Table 1.ThetableDT forpatternp=2;6;4;3andalphabet=f1;:::;9g.
Thetable belowevaluatesDState usingtherelation(3.1).Forexample,
4 3 4
=(LeftShift(0100)OR1)ANDDT[2℄
=(1000OR1)AND1001
=1001AND1001
=1001
whihimpliesthatthereisamathstartingatposition1oft,sinethe4-th
bitofDState
4 is1.
j
1 2 3 4 5 6 7 8 9 10
tj 3 4 6 2 8 2 4 5 7 1
LeftShift(DState
j 1
)OR1 0001 0011 0111 1001 0011 0001 0011 0111 1101 1001
DT[t
j
℄ 1011 0011 0100 1001 0000 1001 0011 0110 0100 1000
DStatej 0001 0011 0100 1001 0000 0001 0011 0110 0100 1000
[DState
j
℄
10
1 3 4 9 0 1 3 6 4 8
Table 2.ComputingtheDstatesandndingtheÆ-approximatemathes.
AÆ-approximatemathoursatpositionj m+1oftif[DState
j
℄
10 2
m 1
,
where[DState
j
℄
10
denotestheDState
j
asadeimalinteger.Therefore,thereis
onemath ending atposition 4oft (f3,4,6,2g) andanother oneat position 10
oft(f4,5,7,1g)sinefDState
4
;DState
10 g2
3
.
3.1 Pseudo-ode
Fig.1givesaomplete speiationofthealgorithm.Intheline3wehavethe
preproessingphasewhihomputetheDT table.Inline6weusethereursive
formulatoomputetheDStates.Finally,inline7weapplythemathingriteria
tosee whetherthereisaÆ-approximatemathornot.
1. proedure Shift-And(p,t,Æ) fn=jtj; m=jpjg
2. begin
3. DT
i [℄
1 if j pijÆ
0 otherwise
8i2f1::mg; 82
4. DState
0 0
5. forj 1to n do
6. DStatej (LeftShift(DStatej
1
)OR 1 ) AND DT[tj℄
7. if DStatej2 m 1
then write j-m+1
8. od
9. end
Fig.1.TheShift-AndProedure.
Assuming that the pattern length is nolonger than the memory word size of
themahine(thusO(1)size), thetimeomplexityofthepreproessingphaseis
O(n)(sine weneedto evaluateDT only forthesymbolsthat ourin t) and
thetimeomplexityofthesearhingphaseinO(n).Figure2showsthetiming 1
fordierenttextsizes.
0 0.2 0.4 0.6 0.8 1
0 200 400 600 800 1000
Text Size (k)
Time (in secs.)
Pattern Size = 15 δ = 2
0 0.2 0.4 0.6 0.8 1
0 200 400 600 800 1000
Text Size (k)
Time (in secs.)
Pattern Size = 20 δ = 2
Fig.2.Timingurvesfor theShift-AndProedure.
4 fÆ;g-Approximate Pattern Mathing
Theproblem of fÆ;g-approximate pattern mathingis formally dened asfol-
lows:givenastringt=t
1 :::t
n
andapatternp=p
1 :::p
m
omputeallpositions
j oft suhthat
p Æ;
= t[j::j+m 1℄
InordertosolvethisproblemwerstmakeuseoftheShift-Andalgorithm
to nd the Æ-approximate mathes of the pattern p in t. One we nd a Æ-
approximatemathwewantto knowwhetheritisalsoa-approximatemath.
Todo so,weseektoomputesuessive\DeltaStates"DState
j
and\Gamma
States"GStates
j
in O(1) timeusing bitoperationsunder theassumption that
mwwherewisthenumberofbitsinamahineword.Themainstepsofthe
algorithmareas follows:
1. Weneedtoomputethe\DeltaTable"DTaswedidbeforeandthe\Gamma
Table" GT table; we set GT() = r, where denotes a symbol in the
alphabetandr=r
1 :::r
m
isawordwithr
i
equaltoj p
i
jifj p
i jÆ,
otherwiser
i
isequalto 0fori2f1::mg. Eahr
i
,i 2f1::mg isstored asa
binarynumberofdbitswhered=dlog(Æm)e.
1
UsingaSUNUltraEnterprise300MHzrunningSolarisUnix.
oneposition to theleft and R ightShift shiftsthe bits of abinary wordd
positions to theright.One wehaveomputed theDT and GT tables, we
anusethemto omputetheDState
j
andGState
j
forj=1:::n,usingthe
reursiveformulas
DState
j
=(LeftShift(DState
j 1
) OR 1) AND DT[t
j
℄ (4:1)
GState
j
=R ightShift(GState
j 1
;d)+GT[t
j
℄ (4:2)
Wealso needto dene the seeds DState
0
=0 and GState
0
=0. We allthis
proedure \Shift-Plus" beauseweuse the \shift" and \plus" operators
toomputeeahnewstate.
3. Wesaythatthereisamath(fÆ;g-approximatemath)atpositionj m+1
ifandonlyifthem-thbitofDState
j
is1andthem-thblokofdbitstaken
asanintegeris.
Example.Forourexamplelet=f1;:::;9g, thepattern p=3;4;6;2;the
text t =3;4;6;2;8;2;4;5;7;1, Æ =1 and =3. We will use bloks of size 3
(d =3) to store thej p
i
j valueswhere j p
i
j Æ. Forexample,GT[3℄=
000 100000100 beausej3 p
i
j 1for i=1,2,4 and thedierenes are 0,1,1
respetively.(seeleft handtableoftable 3).
i p
i
1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 0 0
1 3 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 1 0 1 0 0 0 0
2 4 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 1 0 0
3 6 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0
4 2 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
j 1 2 3 4 5 6 7 8 9 10
tj 3 4 6 2 8 2 4 5 7 1
0 1 0 1 0 1 1 0 0 0
3 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
1 0 1 0 1 0 1 0 0 0
4 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 1 0 1 0 1 0 0 1 0
6 0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 0 0 0 0
1 0 1 0 1 0 1 0 0 0
2 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1
Table 3.Thelefthandsidetableisthe\GammaTable"GT andtherighthandside
tableisthetableforndingf;Æg-approximatemathes.
Therighthand tableaboveshowstheomputation oftheDStatesandthe
GStatesusing(4.2).Forexample,
GState
9
=R ightShift(000010010000,3)+000000100000
=000000010010+000000100000=000000110010
Wealreadyknowthat therearetwoÆ-approximatemathesendingat posi-
tions4and10oft.NowweanusethelastthreebitsofGState
4
andGState
10
tondoutthevaluesof,whih are0and4respetively(seerighthandtable
ofFig.3).
Fig.3belowgivesaompletedesriptionofthealgorithm.Inthelines3and4are
thepreproessingphasewhihomputetheDT tableandGT tablerespetively.
Inlines8and9weomputethenextDStateandGStaterespetively.Finally,
inline10weapplythemathingriteriatoseewhetherthereisamathornot.
1. proedure Shift-Plus(p,t,Æ,) fn=jtj; m=jpjg
2. begin
3. DT
i [℄
1 if j pijÆ
0 otherwise
8i2f1::mg;82
4. GT
di d:::di 1 [℄
j pij if DTi[℄=1
0 otherwise
8i2f1::mg;82
5. DState
0 0
6. GState0 0
7. for j 1to n do
8. DState
j
(LeftShift(DState
j 1
)OR 1) AND DT[t
j
℄
9. GStatej R ightShift(GStatej 1,d)+ GT[tj℄
10. if DState
j 2
m 1
AND GState
dm d:::dm 1
then write j-m+1
11. od
12. end
Fig.3.TheShift-PlusAlgorithm.
4.2 Runningtime
Assumingthat Æm2 d
1thetimeomplexity ofthepreproessingphase
isO(Æm+jj)andthetimeomplexityofthesearhingphasein O(n),thus
independentfromthealphabetsize andthepatternlength.Figure4showsthe
timingfordierenttext sizes.
!htb
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0 20 40 60 80 100
Text Size (k)
Time (in secs.)
Pattern Size = 20 δ = 2
γ = 20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0 20 40 60 80 100
Text Size (k)
Time (in secs.)
Pattern Size = 15 δ = 2
γ = 20
Fig.4.TimingurvesfortheShift-PlusAlgorithm.