• Aucun résultat trouvé

The complexity of the Approximate Multiple Pattern Matching Problem for random strings

N/A
N/A
Protected

Academic year: 2021

Partager "The complexity of the Approximate Multiple Pattern Matching Problem for random strings"

Copied!
16
0
0

Texte intégral

(1)

HAL Id: hal-03214615

https://hal.archives-ouvertes.fr/hal-03214615

Submitted on 2 May 2021

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Matching Problem for random strings

Frédérique Bassino, Tsinjo Rakotoarimalala, Andrea Sportiello

To cite this version:

Frédérique Bassino, Tsinjo Rakotoarimalala, Andrea Sportiello. The complexity of the Approximate Multiple Pattern Matching Problem for random strings. AofA 2020, Jun 2020, Klagenfurt, Austria.

�10.4230/LIPIcs.AOFA.2020.24�. �hal-03214615�

(2)

Pattern Matching Problem for random strings

2

Frédérique Bassino

3

Université Paris 13, Sorbonne Paris Cité, LIPN, CNRS UMR 7030, 99 av. J.-B. Clément, F-93430

4

Villetaneuse, Francehttps://lipn.univ-paris13.fr/~bassino/

5

bassino@lipn.univ-paris13.fr

6

Tsinjo Rakotoarimalala

7

Université Paris 13, Sorbonne Paris Cité, LIPN, CNRS UMR 7030, 99 av. J.-B. Clément, F-93430

8

Villetaneuse, Francehttps://lipn.univ-paris13.fr/~rakotoarimalala/

9

rakotoarimalala@lipn.univ-paris13.fr

10

Andrea Sportiello

11

Université Paris 13, Sorbonne Paris Cité, LIPN, CNRS UMR 7030, 99 av. J.-B. Clément, F-93430

12

Villetaneuse, Francehttps://lipn.univ-paris13.fr/~sportiello/

13

sportiello@lipn.univ-paris13.fr

14

Abstract

15

We describe a multiple string pattern matching algorithm which is well-suited for approximate

16

search and dictionaries composed of words of different lengths. We prove that this algorithm has

17

optimal complexity rate up to a multiplicative constant, for arbitrary dictionaries. This extends to

18

arbitrary dictionaries the classical results of Yao [SIAM J. Comput. 8, 1979], and Chang and Marr

19

[Proc. CPM94, 1994].

20

2012 ACM Subject Classification Formal languages and automata theory; Design and analysis of

21

algorithms

22

Keywords and phrases Average-case analysis of algorithms, String Pattern Matching, Computational

23

Complexity bounds.

24

Digital Object Identifier 10.4230/LIPIcs.AOFA.2020.24

25

Funding Andrea Sportiello: Supported by the French ANR-MOST MetAConC.

26

1 The problem

27

1.1 Definition of the problem

28

Let Σ be an alphabet ofssymbols,ξ=ξ0. . . ξn−1∈Σn a word ofncharacters (the input

29

text string),D={w1, . . . , w`},wi∈Σ a collection of words (thedictionary). We say that

30

w=x1. . . xmoccurs inξ with final positionj ifw=ξj−m+1ξj−m+2· · ·ξj. We say thatw

31

occurs inξwith final positionj, with no more than kerrors, if the lettersx1, . . . , xmcan

32

be aligned to the lettersξj−m0, . . . , ξj with no more thankerrors of insertion, deletion or

33

substitution type, i.e., it hasLevenshtein distanceat mostkto the stringξj−m0. . . ξj (see an

34

example in Figure 1). Letrm(D) be the number of distinct words of length minD. We call

35

r(D) ={rm(D)}m≥1 thecontent ofD, a notion of crucial importance in this paper.

36

The approximate multiple string pattern matching problem (AMPMP), for the datum

37

(D, ξ, k), is the problem of identifying all the pairs (a, j) such thatwaD occurs inξwith

38

final positionj, and no more thankerrors (cf. Figure 1). This is a two-fold generalisation of

39

the classicalstring pattern matching problem(PMP), for which the exact search is considered,

40

and the dictionary consists of a single word.

41

A precise historical account of this problem, and a number of theoretical facts, are

42

presented in Navarro’s review [8]. The first seminal works have concerned the PMP. Results

43

© F. Bassino, T. Rakotoarimalala and A. Sportiello;

licensed under Creative Commons License CC-BY

(3)

...5...10...15...20...25...30...35...40...45...50...55. now.is.the.winter.of.our.discontent.made.glorious.summer sour ...Xour...XouX...

intent ...inteXX...Xntent...

galore ...gilorX...

therein ...theSSin...

soul: 23, 48; intent: 16, 35; galore: 45; therein: 14.

Figure 1 Typical output of an approximate multiple string pattern matching problem, on an English text (alphabet of 26 characters plus the space symbol .). In this case k = 2 and r(D) = (0,0,0,0,1,0,2,1,0, . . .). The symbols D, S and i stand for deletion, substitution and insertion errors, whileXcorresponds to an insertion or a substitution.

included the design of efficient algorithms (notably Knuth–Morris–Pratt and Boyer–Moore),

44

and have led to the far-reaching definition of the Aho–Corasick automata [1, 3, 7, 11]. In

45

particular, Yao [11] is the first paper that provide rigorous bounds for the complexity of

46

PMP in random texts. To make a long story short, it is argued that an interesting notion of

47

complexity is the asymptotic average fraction of text that needs to be accessed (in particular,

48

at least at this stage, it isnotthe time complexity of the algorithm), and is of order ln(m)/m

49

for a word of lengthm. The first works on approximate search, yet again for a single word

50

(APMP), are the description of the appropriate data structure, in [10, 4], and, more relevant

51

to our aims here, the derivation of rigorous complexity bounds in Chang and Marr [5]. Yet

52

again in simplified terms, if we allow forkerrors, the complexity result of Yao is deformed

53

into order [ln(m)+k]/m. More recent works have concerned the case of dictionaries composed

54

of several words, all of the same length [9],1however, also at the light of unfortunate flaws in

55

previous literature, the rigorous derivation of the average complexity for the MPMP has been

56

missing even in the case of words of the same length, up to our recent paper [2], where it is

57

established that the Yao scaling ln(m)/mis (roughly) modified into maxmln(mrm)/m(a

58

more precise expression is given later on). By combining the formula of Chang and Marr for

59

APMP, and our formula for MPMP, it is thus natural to expect that the AMPMP may have

60

a complexity of the order maxm[ln(mrm) +k]/m. This paper has the aim of establishing a

61

result in this fashion.

62

Of course, the present work uses results, ideas and techniques already presented in [2],

63

for the PMPM. A main difference is that in [2] we show that, for any dictionary, a slight

64

modification of an algorithm by Fredriksson and Grabowski [6] is optimal within a constant,

65

while this is not true anymore for approximate search with Levenshtein distance (we expect

66

that it remains optimal for approximate search in which only substitution errors are allowed,

67

although we do not investigate this idea here). As a result, we have to modify this algorithm

68

more substantially, by combining it with the algorithmic strategy presented in Chang and

69

Marr [5], and including one more parameter (to be tuned for optimality). This generalised

70

algorithm is presented in Section 2.2.

71

Also, a large part of our work in [2] is devoted to the determination of a relatively tight

72

lower bound, while the determination of the upper bound consists of a simple complexity

73

analysis of the Fredriksson–Grabowski algorithm. Here, instead, we will make considerable

74

efforts in order to determine an upper bound for the complexity of our algorithm, which is

75

1 This is the reason why, before our paper [2], which deals with dictionaries having words of different length, the forementioned notion of “content” of a dictionary did not appear in the literature.

(4)

exact approximate single word CYao

lnm

m (Yao) CCM

lnm+k

m (Chang and Marr) dictionary C1ex

1 mmin

+C2exmax

m

ln(smrm) m

C1k+C10 mmin

+C2max

m

ln(smrm) m

Table 1Summary of average complexities for exact and approximate search, for a single word or on arbitrary dictionaries. The results are derived from Yao [11], Chang and Marr [5], our previous paper [2], and the present paper, respectively.

the content of Section 2.4, while we will content ourselves of a rather crude lower bound,

76

derived with small effort in Section 1.3 by combining the results of [5] and [2].

77

1.2 Complexity of pattern matching problems

78

In our previous paper [2] we have established a lower bound for the (exact search) multiple

79

pattern matching problem, in terms of the sizesof the alphabet, and the contentr={rm}

80

of the dictionary, involving the length mmin of the shortest word in the dictionary, and a

81

functionφ(r) with the specially simple structure φ(r) = maxmf(m, rm). More precisely,

82

calling Φaver(r) (resp. Φmax(r)) the average over random texts, of the average (res. maximum)

83

over dictionariesDof contentr, of the asymptotic fraction of text characters that need to

84

be accessed, we have

85

ITheorem 1 (Bassino, Rakotoarimalala and Sportiello, [2]). Let s≥2 andmmin ≥2, and

86

defineκs= 5√

s. For all contentsr, the complexity of the MPMP on an alphabet of sizes

87

satisfies the bounds

88

1 κs

φ(r) + 1 2s mmin

aver(r)6Φmax(r)62

φ(r) + 1 2s mmin

, (1)

89

where

90

φ(r) := max

m

1

mln(s m rm). (2)

91

Note a relative factor lns between the statement of the result above, and its original

92

formulation in [2], due to a slightly different definition of complexity.

93

As we have anticipated, such a result is in agreement with the result of Yao [11], for

94

dictionaries composed of a single word, which is simply of the form ln(m)/m. Combining

95

this formula with the complexity result for APMP, derived in Chang and Marr [5], it

96

is natural to expect that the AMPMP has a complexity whose functional dependence

97

on k and r is as in Table 1. Indeed, the bottom-right corner of the table is consistent

98

both with the entry above it, and the entry at its left. Furthermore, it is easily seen

99

that, up to redefining the constants, several other natural guesses would have this same

100

functional form in disguise. Let us give some examples of this mechanism. Write X

101

aL/UY +bL/UZ as a shortcut foraLY +bLZ 6X 6aUY +bUZ. Now, suppose that we

102

establish that Φ(r, k)≷aL/U(k+ 1)/mmin+bL/Umaxm(ln(mrm) +k)/m. Then we also

103

have Φ(r, k) ≷a0L/U(k+ 1)/mmin+bL/Umaxmln(mrm)/m, witha0U = aU+bU (and all

104

other constants unchanged). On the other side, if we have Φ(r, k)≷aL/U(k+ 1)/mmin+

105

bL/U maxmln(mrm)/m, with aL > bL, then we also have Φ(r, k)≷ aL/U(k+ 1)/mmin+

106

b0L/U maxm(ln(mrm) +k)/m, withb0L=aLbL.

107

The precise result that we obtain in this paper is the following:

108

(5)

ITheorem 2. For the AMPMP, with k errors and a dictionary D of content {rm}, the

109

complexity rate Φ(D)is bounded in terms of the quantity

110

Φ(D) :=e C1(k+ 1)

mmin +C2max

m

ln(smrm)

m (3)

111

as

112

1

C1+κsC2Φ(D)e 6Φ(D)6Φ(D)e , (4)

113

witha= ln(2s2/(2s+ 1)),a0 = ln(4s2−1),κs= 5√

s (as in Theorem 1) and

114

C1=a+ 2a0

a ; C2=2(a+ 2a0)

aa0 = 2

a0C1. (5)

115 116

1.3 The lower bound

117

Now, let us derive a lower bound of the functional form as in Table 1 for the AMPMP, by

118

combining our results in [2] for the MPMP and the results in [5] for the APMP. Let us first

119

observe a simple fact. Suppose that we have two boundsALB(r, k)6Φ(r, k)6AUB(r, k)

120

andBLB(r, k)6Φ(r, k)6BUB(r, k) (withALB(r, k) andBLB(r, k) positive). Then, for all

121

functionsp(r, k), valued in [0,1], we have

122

p(r, k)ALB(r, k) + (1−p(r, k))BLB(r, k)6Φ(r, k)6AUB(r, k) +BUB(r, k).

123

We want to exploit this fact by using as boundsALB/UB(r, k) our previous result for the

124

exact search, and as lower boundBLB(r, k) the simple quantity (k+ 1)/mmin. Then, later

125

on, in Section 2, we will work on the determination of a boundBUB(r, k) which has the

126

appropriate form for our strategy above to apply. Let us discuss why Φ(r, k)≥(k+ 1)/mmin.

127

We will prove that this quantity is a bound to the minimal density of acertificate, over a

128

single word of lengthm=mmin, and text ξ. A certificate, as described in [11], is a subset

129

I⊆ {1, . . . , n}such that, for the given text, the characters{ξi}i∈I imply that no occurrences

130

of words of the dictionary may be possible, besides the ones which are fully contained inI.

131

Some reflection shows that: (1) for the interesting casem > k, the smallest density|I|/n of

132

a certificate is realised on anegative certificate, that is, on a text ξwith no occurrences of

133

the wordw; (2) the smallest density is realised, for example, by the text ξ=bbb· · ·b, and

134

the wordw=aaa· · ·a; (3) in such a certificate, we must have read at leastk+ 1 characters

135

in every interval of sizem, otherwise the alignment ofwto this portion of text, in which we

136

perform all the substitutions on the disclosed characters, would still be a viable candidate.

137

Note in particular that deletion and insertion errors do not lead to higher lower bounds

138

(although, for largem, they lead to bounds which are only slightly smaller).

139

As a result, recalling the expression for the lower bound in Theorem 1, by choosingp(r, k)

140

to satisfy 1−pp =κsCC2

1 we have

141

Φ(r, k)≥(1−p)k+ 1 mmin+ p

κsφ(r) = p κsC2

C1

k+ 1

mmin+C2φ(r)

=C1 k+1

mmin +C2φ(r) C1+κsC2 .

142

This proves the lower bound part of Theorem 2. Note that we could confine all the dependence

143

from{rm} to the functionφ(in particular, the choice 1−pp =κsCC2

1 only depends on the size

144

of the alphabets).

145

(6)

2 The (q, L) search algorithm, and the upper bound

146

2.1 Definition of alignment

147

We define a partial alignmentαof the wordw=x1. . . xmto the portion of textξi1. . . ξi2,

148

withk errors, and boundary parameters (ε, ε0)∈ N, as the datum α = (w;i1, i2;ε, ε0;u),

149

whereuis a string in{C, Sa, D, Ia}, (these letters stand forcorrect,substitution,deletion

150

andinsertion, respectively, and the indexaruns from 1 tos). Two integer parameters (for

151

examplei2 andε0) are not independent, as they are deduced (say) fromi1,εand the length

152

ofu. Indeed, say that the stringuhas mC symbolsC,mDsymbolsD,mS symbols of type

153

Sa (for all a’s altogether) andmI symbols of typeIa, then

154

k=mS+mD+mI (number of errors)

155

ε+ε0=m−(mC+mS+mD) (portion of the word on the sides)

156

i2i1+ 1 =mC+mS+mI (length of the aligned portion of text)

157158

The alignment has the following pattern (with a dash−denoting a skipped character, in the

159

text or in the word):

160

i1 i2

ξi1· · ·ξi2 = ... ξi1 ... wj ...... a ... a ... ξi2 ...

u= ... C ... D ... Sa ... Ia ...

w= w1· · ·wε

| {z }

ε

wε+1 ... wj ... wj0 ... wj00 ...... wm−ε0 wm−ε0+1· · ·wm

| {z }

ε0

161

For example, ifw=counteroffers, in our reference text of Figure 1 we have the alignment

162

α = (w;i1, i2;ε, ε0;u) = (w; 14,24; 3,1;u) with k = 4 and u = CCCCI CCS SoIuC, as

163

indeed

164

i1= 14 i2= 24

ξ= · · ·t h e . w i n t e r . o f . o u r . d i s c o n t e n t· · ·

u= C C C C I C C S SoIuC

w= c o u

| {z }

ε= 3

n t e r - o f f e - r s

|{z}

ε0= 1

165

This example shows an important feature of this notion: several stringsumay correspond

166

to equivalent alignments among the same word and the same portion of text, and with the

167

same offsetε. For example, the three last errors ofu=· · ·S SoIuC can be replaced as in

168

u0=· · ·S IoSuC or as inu00=· · ·I SoSuC. As the underlying idea in producing an upper

169

bound from an explicit algorithm is to analyse the algorithm while using the union bound on

170

the possible alignments, it will be useful to recognise classes of equivalent alignments, and,

171

in the bound, ‘count’ just the classes, instead of the elements (we are more precise on this

172

notion in Section 2.3).

173

We define a full alignment to be likewise a partial alignment, but withε=ε0 = 0. That

174

is, the goal of any algorithm for the AMPMP is to output the list of (say) positionsi2of the

175

full alignments among the given text and dictionary. Note that we can always complete a

176

partial alignment withkerrors and boundary parameters (ε, ε0) to a full alignment with no

177

more thank+ε+ε0 errors, and no less thankerrors, by including substitution or insertion

178

errors at the two sides.

179

We define a c-block partial alignment as the generalisation of the notion of partial

180

alignment to the case in which the portion of text consists ofcnon-adjacent blocks. In this

181

(7)

now.is.the.winter.of.our.discontent.made.glorious

deformed.pattern def

I

ormed.pat-te D

n def

I

ormed.pat-te Sn

rn

Figure 2Typical outcome for the search of the patterndeformed patternin our reference text.

In this exampleL= 3 andq= 12, the number of full blocks isc(α) = 2, and can be aligned to the disclosed portion of the text (denoted by underline) withk= 3 errors: one deletion on the first block, one insertion in the second block, and one deletion somewhere in between the two blocks. On the bottom line, another alignment of the same word, in which, instead of inserting the letterrin the second block, we have substitutednbyr, still withk= 3. These two alignments are sufficiently different to contribute separately to our estimate of the complexity, within our version of the union bound (because the values ofεare different).

case, besides the natural alignment parametersε,ε0, and i1,a,i2,a, andua, for the blocks

182

a= 1, . . . , c, we havec−1 parametersδa∈Z, associated to the offset between the alignment

183

of the word to the blocks with indexaanda+ 1. As a result, in order to extend ac-block

184

partial alignment to a full alignment, we need to perform at least−δafurther insertion errors,

185

or +δa further deletion errors, depending on the sign ofδa, for each of thec−1 intervals

186

between the portions of text. That is, anyc-block partial alignmentαwithk errors can be

187

completed to a full alignment with no less thank+P

aa| errors.

188

Note that in the following we willnot need to count all of the possible ways in which

189

these deletions or insertions can be performed, as it may seem natural in a naïve perspective

190

on the use of the union bound. This fact will allow us to efficiently bound the number of

191

possible multi-block partial alignments arising in our algorithm analysis (instead of counting

192

directly the possible full alignments, which would result in a too large bound).

193

2.2 The algorithm

194

Here we introduce an algorithm for AMPMP, concentrating on the pertinent notion of

195

complexity, which is the ratio between the number of accesses to the text and the length of

196

the text, and neglecting all implementation issues, and analysis of time complexity.

197

The algorithm is determined by two integersqandL, such thatk+ 16L < q6mmin−k.

198

The emerging inequality 2k+ 1< mminis not a limitation, as when this inequality is not

199

satisfied we have to read a fraction Θ(1) of the text, and in this regime there is no point in

200

showing that some algorithm can reach a complexity which is optimal up to a multiplicatve

201

constant. WhenL= 1, the algorithm coincides with the one described by Fredriksson and

202

Grabowski [6], and already analysed in detail in [2] for the MPMP. When we have a single

203

word of lengthm, andqhas the maximal possible valueq=mk, the algorithm coincides

204

with the one used by Chang and Marr [5] for their proof of complexity of the APMP. As

205

we will see in Section 2.5, choosing the optimal values ofq andLfor a given dictionaryD

206

(when the words are of different length) is not a trivial task.

207

Call the intervalξbqξbq+1· · ·ξbq+L−1 of the textξtheb-th block of text. The text is thus

208

decomposed in a list of blocks of lengthL, and of intervals between the blocks, of length

209

qL. To every possible full alignment αof the word w to the text, are associated two

210

integers: c(α) is the number of blocks which are fully contained in the alignment, andb(α) is

211

the index of the rightmost of these blocks. Furthermore, we definec(w) as the minimum of

212

c(α) among the possible alignments involvingw(indeed, it is either c(α) =c(w) for allα, or

213

c(α)∈ {c(w), c(w) + 1} for allα, and, of course, at fixedqandL,c(w) only depends on the

214

length|w| of the word).

215

(8)

Our algorithm accesses the text in three steps, namely, for every block index b =

216

0,1, . . . ,dn/qe −1:

217

We read all the characters ξi of the text, forbq 6i < bq+L, that is we read theb-th

218

block;

219

We consider the possiblec-block partial alignmentsα(withc=c(α)) such that b(α) =b,

220

and associated to the intervals of text read so far. If any of these alignments is not

221

excluded or determined positively, we read also the charactersξifori=bq−1,bq−2, . . .,

222

one by one, in this order, up to when all partial alignments are either excluded, or reach

223

ε= 0. For a given instance of the problem, callEL(b) (left-excess at blockb) the set of

224

positions of further characters that we need to access by this second step (with indices

225

shifted so that the block starts at 1), andeL(b) =|EL(b)|.

226

If at the previous step we still have partial alignments which are not excluded, we read

227

also the characters at positions i=bq+L,bq+L+ 1, . . ., in this order, up to when all

228

partial alignments are either excluded, or completed to a full alignment. Similarly to

229

above, introduceER(b) andeR(b) =|ER(b)|(right-excess at blockb).

230

An example with c(α) = 2 is in Figure 2. Note that, at all steps, the pattern of the accessed

231

part of the text consists of some blocks of lengthLand spacingq, plus one rightmost block

232

with lengthL0L and spacing q0 6q. A typical situation within the second step is as

233

follows (herec= 5,L= 3,q= 8, L0= 12 andq0 = 7):

234

L L L L0

q q q0

235

CallE(b) =EL(b)∪ ER(b), ande(b) =eL(b) +eR(b). Call Ψexacth the average over random

236

texts of the indicator function for the event thate(b)h. Clearly, the average complexity

237

rate of our algorithm is bounded by the expression

238

Φalg(D)6L+E(e(b))

q = L+P

h≥1Ψexacth

q ,

239

where the average is taken over random texts, at fixed dictionary. Note that, because of our

240

choice of range forqandL,c(α)≥1 for allα, andc(|w|)≥1 for allw.

241

Let αbe a full alignment associated to the blockb. CallE[α] the set of extra positions of

242

the text (besides the blocks) that we need to access in order to determine the alignmentα.

243

Then clearlyE(b) =S

αE[α].

244

2.3 Proof strategy for the upper bound

245

Our proof strategy is to prove that there exists a choice of parametersLandq, with the

246

properties thatq= Θ(mmin),L/q= Θ(φ(r(D))), andE(e(b)) = Θ(1). This last condition

247

is equivalent to the requirement that Ψexacth is a summable series, and we will see that

248

indeed the first can be bounded by a geometric series, and the second is rather small. Up to

249

calculating the pertinent multiplicative constants, such a pattern would imply the functional

250

form of the complexity anticipated in Section 1.2.

251

The idea is that the exact calculation ofE(e(b)) or of Ψexacth , even atqandLfixed (which

252

is easier than optimising w.r.t. these parameters), is rather difficult, but we can produce a

253

simpler upper bound by:

254

For alignments αwith c(α)>1, neglect the information coming from the e(b0) extra

255

characters that we have accessed at blocksb0< b. This allows to separate the analysis on

256

the different blocks of text.

257

(9)

Naïvely, for different (full) alignments α, we could perform a union bound, that is,

258

e(b) = |E(b)| = |S

αE[α]| 6 P

α|E[α]|, which thus separates the analysis over the

259

different alignments. We will make an improved version of this bound, namely we use this

260

bound, not with full alignments, but rather with “classes of equivalent partial alignments”.

261

As we anticipated, the crucial point is that we count partial alignments instead of full

262

alignments. A further slight improvement of the bound comes from considering these ‘classes

263

of equivalent partial alignments’, instead of just the partial alignments. These two facts are

264

motivated by the same argument, that we now elucidate.

265

Consider the two following notions: (1) Each setAh(w) of partial alignments is partitioned

266

into classesI. (2) There is a subset ¯Ah(w)⊆Ah(w) of alignments, that we shall callbasic

267

alignments. Now, suppose that the two following properties hold: (i) IA¯h(w) 6= ∅

268

for all classes I of Ah(w). (ii) For each αI, there exists a ¯αIA¯h(w), such that

269

E(α)⊆ E( ¯α). In this case it is easily estalished that the bound above can be improved into

270

e(b) =|E(b)|=|S

αE[α]|6P

¯

α|E[ ¯α]|, where the sum runs only on basic partial alignments.

271

Thus, calling Ψh:=P

w∈D

P

α∈A¯h(w)P(|E[ ¯α]| ≥h), we have Ψh≥Ψexacth .

272

We propose the following definition of basic alignment. Letαbe in Ah(w). In the string

273

u, suppose that we write Ca instead ofC, whenever the well-aligned character isa, and

274

Da when the deleted character is a(this is clearly just a bijective decoration of u). For

275

αA¯h(w), we require that there are no occurrences of CaIa as factors of u(as these are

276

equivalent toIaCa), of CaDa (as these are equivalent to DaCa) and ofIaDb orDbIa (as

277

these are equivalent toCa orSa, depending ifa=bor not). Ifαcan be obtained fromα0 by

278

a sequence of these rewriting rules, thenαandα0 are in the same classI.

279

It is easy to see that this definition of basic alignment and classes has the defining

280

properties above.

281

2.4 Evaluation of an upper bound at q and L fixed

282

Let us callpc,h,ε0(w) the probability that, for a given wordwand parameterε0, there exists

283

an alignmentαAh(w), to a text consisting ofc−1 blocks of lengthLand one block of

284

lengthL+h, which is visited by the algorithm (that is, it makes at mostk errors), that is,

285

in particular,

286

Ψh6

q−1

X

ε0=0

pc,h,ε0(w). (6)

287 288

We have the important fact

289

IProposition 3.

pc,h,ε0(w)6βs−(cL+h)BcL+h+c−1,k (7)

290

for allε0, where β=(2s−1)L−k(2s−1)L+k andBL,k = (2s−1)k L+kk .

291

The proof of this proposition is slightly complicated, and is presented in Appendix A. Note

292

however that for the special casec= 1, and with exactlykerrors (instead of at mostkerrors),

293

the bounds−(L+h)(2s)k L+kk

can be established trivially. Also note that the bound does

294

not depend onε0, and, in particular, it only depends onh=|EL|+|ER|for the alignmentsα

295

at givenwandε0, and not separately on the two summands.

296

We are now ready to evaluate the expressions for the upper bound on the quantity Ψh

297

in (6), in light of (7). CallRc=P

m:c(m)=crm=Pq(c+1)+L−2

m=qc+L−1 rm, andpc,h asq times the

298

(10)

RHS of (7) (that is, an upper bound toPq−1

ε0=0pc,h,ε0(w)). We have the bound

299

X

h

Ψh6X

c

RcX

h

pc,h=X

c

RcX

h

β q s−(cL+h)BcL+h+c−1,k. (8)

300 301

Recalling that

302

X

h≥0

s−h

a+k+h k

6 1

1−1sa+k+1a+1 a+k

k

,

303

(and that q < mmin), substituting in (8) gives

304

Φalg(D)61 q

L+βqX

c

Rc 1

1−1scL+k+ccL+c s−cL

cL+c−1 +k k

(2s−1)k

61 q

L+βmmin(2s−1)k 1−1sL+kL

X

c

Rcs−cL

c(L+ 1) +k k

. (9)

305

We want to prove that

306

Φalg(D)6C1k+C10

mmin +C2max

m

ln(smrm)

m , (10)

307

with suitable constantsC1,C10 andC2 (it will turn out at the end that we can setC10 =C1

308

andC1, C2 to be as in Theorem 2, but at this point it is convenient to let them be three

309

separate variables). This would prove the upper bound part of Theorem 2.

310

Note that, if k/mmin≥1/C1, the upper bound expression (10) is larger than the trivial

311

bound Φalg(D)61, and there is nothing to prove. So we can assume that k/mmin<1/C1.

312

2.5 Optimisation of q and L

313

We now have to analyse the expression (9), in order to understand which values ofqandL

314

make the bound smaller. The sum overcis the most complicated term. We simplify it by

315

using the fact that, for all ξ∈R+, ln a+kk

6kln(1 +ξ) +aln(1 +ξ−1), which gives

316

T :=mmin(2s−1)kX

c

Rcs−cL

c(L+ 1) +k k

317

6X

c

1 c2exph

c LA−1

c(ln(Rcmmin) +kln((1 +ξ)(2s−1)))−lnc2

c −ln(1 +ξ−1)i

318

=X

c

1 c2exp

−c(LA−φ0(c)−ln(1 +ξ−1))

, (11)

319 320

whereA= ln(sξ/(1 +ξ)),A0= ln((1 +ξ)(2s−1)) and

321

φ0(c) =ln(c2Rcmmin) +kA0

c . (12)

322

Ultimately, we want to chooseLsuch thatT is bounded by a constant, as its summands

323

overcare bounded by a convergent series. With this goal, letcbe the value maximising

324

the expressionφ0(c), andφ the value of the maximum. The sum above is then bounded by

325

X

c

1

c2exp[−c(LA−φ−ln(1 +ξ−1))].

326

(11)

For any value ofξsuch thatA >0 (that is, forξ >(s−1)−1), there exists a positive smallest

327

value ofLsuch that the exponent in the expression above is negative. So we set

328

L=

φ+ ln(1 +ξ−1) A

,

329

(as the choice of ξ is free, we can tune it at the end so that the ratio is an integer), and

330

recognise that the RHS of equation (11), specialised toL=L, is bounded byP

c 1

c2 =π2/6.

331

Note that

332

φφ0(1)≥kA0

333

so that

334

L kA0

A =ln((1 +ξ)(2s−1)) ln(s ξ/(1 +ξ)) ,

335

which implies that we can setβ= 2s−1+A/A2s−1−A/A00, and

336

1

1−1sL+kL 6 1

1−1s(1 +A/A0) = 1 1−1s ln(sξ(2s−1))

ln((1+ξ)(2s−1))

.

337

Now, let us chooseq=mmin−k 2

, which coincides with the choice of the analogous parameter

338

in Chang and Marr [5]. This is the largest possible value such thatc(w)≥1 for allwD.

339

With this choice,

340

1 q 6 2

mmin C1

C1−1.

341

Collecting the various factors calculated above, we get that the expression (9) is bounded by

342

Φalg(D)6 2 mmin

C1

C1−1

L+ βπ62 1−1s(1 +A/A0)

.

343

We are left with two tasks: choosing suitable values forξ and C1 (both of order 1), and

344

recognising that the expression forL (and forφ) can be related to the quantity φ(r) in

345

(2). Let us start from the latter. Note that, as for anymmmin

346

mk

q −26c(m)6 m

347 q

we can write2m6mminc(m)6s2m, which gives

348

maxc

1

cln(c2mminRc)6max

m

mmin

m ln(s2m2rm)62mminφ(r).

349

As, of course maxc(X(c) +Y(c))6maxcX(c) + maxcY(c), we have in particular that

350

φ62mminφ(r) +kA0 L6 2mminφ(r) +kA0+ ln(1 +ξ−1)

A ,

351 352

2 Becauses>2, and we anticipate that, under our choice,C15, thus m62(mkq)6mmin

mk q 2

6mminc(m)6mmin

m q 62

C

1

C11

m6s2m .

(12)

2 4 6 8 10 s 5

10 15

Figure 3 Plot of the constant C1(s), C10(s) and C2(s), as given by the expressions in (13) (respectively, in blue, green and red). The asymptotic values are 5, 5π2/12 and 0 respectively.

which thus implies

353

Φalg(D)6 2 mmin

C1 C1−1

2mmin

A φ(r) +kA0

A + βπ62

1−1s(1 +A/A0)+ln(1 +ξ−1) A

= 2C1

C1−1

"

A0 A

k mmin

+ βπ62

1−1s(1 +AA0)+ln(1 +ξ−1) A

! 1 mmin

+ 2 Aφ(r)

# .

354

Let us chooseC1= 2A0/A+ 1. The expression above simplifies into

355

Φalg(D)6 C1k mmin

+2A0+A AA0

"

π62

1−1s(1 + AA0)+ ln(1 +ξ−1)

! 1 mmin

+ 2φ(r)

# ,

356

in particular, this justifies the notationC1, which in the introduction was chosen to denote

357

the coefficient in front of the mk

min summand. Now we shall choose the optimal value ofξ.

358

The dependence onξis mild, provided that we are in the appropriate rangeξ >1/(s−1).

359

The choice ofξ, in turns, determines the ratio between the lower and upper bound, which

360

has the functional formC10 +κsC2 (with notations as in the theorem). A choice which is

361

a good trade-off among the three summands in this expression, and for which the analytic

362

expression is relatively simple, is to takeξ= 2s. Under this choice we have

363

C1= 1 + 2 ln(4s2−1)

ln(2s2/(2s+ 1)), C2= 4

ln(2s2/(2s+ 1))+ 2 ln(4s2−1),

364

C10 = C2

2

ln2s+ 1 2s +βπ2

6

sln(2s2/(2s+ 1)) ln(4s2−1) (s−1) ln(4s2−1)−ln(2s2/(2s+ 1))

,

365 366

or, in a more compact way, calling a = A|ξ=2s = ln(2s2/(2s+ 1)) and a0 = A0|ξ=2s =

367

ln(4s2−1), and substituting back the value ofβ,

368

C1= a+ 2a0

a , C2= a+ 2a0

a 2

a0, (13a)

369

C10 = a+ 2a0 a

lnsa a0 +π2

6

(2s−1)a0+a (2s−1)a0a

as (s−1)a0a

. (13b)

370 371

(13)

The behaviour insof these constants is depicted in Figure 3.

372

It can be verified that, with our choice ofξ,C10 < C1 for alls≥2.3 we can replaceC10 by

373

C1 in the functional form (10) for the bound on Φalg(D), and thus obtain the statement of

374

Theorem 2. This concludes our proof.

375

References

376

1 Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic

377

search. Communications of the ACM, 18(6):333–340, 1975.

378

2 Frédérique Bassino, Tsinjo Rakotoarimalala, and Andrea Sportiello. The complexity of the

379

Multiple Pattern Matching Problem for random strings. In2018 Proceedings of the Fifteenth

380

Workshop on Analytic Algorithmics and Combinatorics (ANALCO), pages 40–53. SIAM, 2018.

381

3 Robert S Boyer and J Strother Moore. A fast string searching algorithm. Communications of

382

the ACM, 20(10):762–772, 1977.

383

4 William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological

384

applications. Algorithmica, 12(4-5):327–344, 1994.

385

5 William I. Chang and Thomas G. Marr. Approximate string matching and local similarity.

386

InCombinatorial Pattern Matching - CPM 1994, volume 807 ofLecture Notes in Computer

387

Science, pages 259–273. Springer, 1994.

388

6 Kimmo Fredriksson and Szymon Grabowski. Average-optimal string matching. Journal of

389

Discrete Algorithms, 7(4):579–594, 2009.

390

7 Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. Fast pattern matching in strings.

391

SIAM Journal on Computing, 6(2):323–350, 1977.

392

8 Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys

393

(CSUR), 33(1):31–88, 2001.

394

9 Gonzalo Navarro and Kimmo Fredriksson. Average complexity of exact and approximate

395

multiple string matching. Theoretical Computer Science, 321(2-3):283–290, 2004.

396

10 Peter H. Sellers. The theory and computation of evolutionary distances: pattern recognition.

397

Journal of Algorithms, 1(4):359–373, 1980.

398

11 Andrew Chi-Chih Yao. The complexity of pattern matching for a random string. SIAM

399

Journal on Computing, 8(3):368–387, 1979.

400

3 One way to see this is by proving that bothC1(s) andC10(s) decrease monotonically as functions on the real interval [2,+∞[, that lims→∞C1(s) = 5, thatC1(s)> C10(s) fors∈ {2,3, . . . ,7}, and that C10(8)<5.

Références

Documents relatifs

Figure 1(b) shows the class Ordered associated with class Tree that defines the ordered property on trees.. The ordered property uses two auxiliary selection- properties: Largest

In order to solve this problem, they proposed permuted pattern matching algorithms by constructing data structure from text such as multi-track suffix tree [5], multi-track position

Bit-Parallel Multiple Pattern Matching Tuan Tu Tran, Mathieu Giraud, Jean-Stéphane Varré.. To cite this version: Tuan Tu Tran, Mathieu Giraud,

The correlogram of the scattering response of an isolated canonical corner reflector, which is known a priori, was com- pared with the correlogram of the received radar image within

We describe a simple conditional lower bound that, for any constant &gt; 0, an O(|E| 1− m)-time or an O(|E| m 1− )-time algorithm for exact pattern matching on graphs, with node

2014 We show that the replica symmetric solution of the matching problem (bipartite or not) with independent random distances is stable.. Even within this simple

Higher order connected link averages could be included in (19) as well, but in this paper we shall only consider the effect of triangular correlations in order to keep

The aim the failure function is to defer the computation of a transition from the current state, say p , to the computation of the transition from the state fail [ p ] with the