Consensus Strings with Small Maximum Distance and Small Distance Sum

(1)

HAL Id: hal-01930623

https://hal.archives-ouvertes.fr/hal-01930623

Submitted on 22 Nov 2018

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Small Distance Sum

Laurent Bulteau, Markus Schmid

To cite this version:

Laurent Bulteau, Markus Schmid. Consensus Strings with Small Maximum Distance and Small Dis-

tance Sum. 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS

2018)., Aug 2018, Liverpool, United Kingdom. �10.4230/LIPIcs.MFCS.2018.1�. �hal-01930623�

(2)

and Small Distance Sum

2

Laurent Bulteau

3

Université Paris-Est, LIGM (UMR 8049), CNRS, ENPC, ESIEE Paris, UPEM, F-77454,

4

Marne-la-Vallée, France

5

laurent.bulteau@u-pem.fr

6

Markus L. Schmid

7

Fachbereich 4 – Abteilung Informatikwissenschaften, Universität Trier, 54286 Trier, Germany

8

mlschmid@mlschmid.de

9

Abstract

10

The parameterised complexity of consensus string problems (Closest String,Closest Sub-

11

string,Closest String with Outliers) is investigated in a more general setting, i. e., with

12

a bound on the maximum Hamming distance and a bound on the sum of Hamming distances

13

between solution and input strings. We completely settle the parameterised complexity of these

14

generalised variants of Closest String and Closest Substring, and partly for Closest

15

String with Outliers; in addition, we answer some open questions from the literature re-

16

garding the classical problem variants with only one distance bound. Finally, we investigate the

17

question of polynomial kernels and respective lower bounds.

18

2012 ACM Subject Classification Theory of computation→ Problems, reductions and com-

19

pleteness, Theory of computation→Fixed parameter tractability, Theory of computation→W

20

hierarchy

21

Keywords and phrases Consensus String Problems, Closest String, Closest Substring, Parame-

22

terised Complexity, Kernelisation

23

Digital Object Identifier 10.4230/LIPIcs.MFCS.2018.1

24

1 Introduction

25

Consensus string problems have the following general form: given input strings S =

26

{s1, . . . , sk} and a distance bound d, find a string s with distance at most d from the

27

input strings. With the Hamming distance as the central distance measure for strings,

28

there are two obvious types of distance between a single string and a setS of strings: the

29

maximum distance between s and any string from S (called radius) and the sum of all

30

distances between sand strings fromS (calleddistance sum). The most basic consensus

31

string problem isClosest String, where we get a setS ofklength-`strings and a bound

32

d, and ask whether there exists a length-` solution string s with radius at most d. This

33

problem isNP-complete (see [?]), but fixed-parameter tractable for many variants (see [?]),

34

including the parameterisation byd, which in biological applications can often be assumed

35

to be small (see [?,?]). A classical extension isClosest Substring, where the strings ofS

36

have lengthat most `, the solution string must have a given lengthmand the radius boundd

37

is w. r. t. some length-msubstrings of the input strings. A parameterised complexity analysis

38

(see [?,?,?]) has shown Closest Substringto be harder than Closest String. If we

39

bound the distance sum instead of the radius, thenClosest String collapses to a trivial

40

problem, whileClosest Substring, which is then calledConsensus Patterns, remains

41

(3)

NP-complete. Closest String with Outliersis a recent extension, which is defined like

42

Closest String, but with the possibility to ignore a given number oftinput strings.

43

The main motivation for consensus string problems comes from the important task of

44

finding similar regions in DNA or other protein sequences, which arises in many different

45

contexts of computational biology, e. g., universal PCR primer design [?,?,?,?], genetic

46

probe design [?], antisense drug design [?,?], finding transcription factor binding sites in

47

genomic data [?], determining an unbiased consensus of a protein family [?], and motif-

48

recognition [?,?,?]. The consensus string problems are a formalisation of this computational

49

task and most variants of them areNP-hard. However, due to their high practical relevance,

50

it is necessary to solve them despite their intractability, which has motivated the study of

51

their approximability, on the one hand, but also their fixed-parameter tractability, on the

52

other (see the survey [?] for an overview of the parameterised complexity of consensus string

53

problems). This work is a contribution to the latter branch of research.

54

Problem Definition. Let Σ be a finite alphabet, Σ^∗ be the set of all strings over Σ,

55

including the empty stringεand Σ⁺ = Σ^∗\ {ε}. For w∈Σ^∗, |w| is the length of wand,

56

for every i, 1 ≤ i ≤ |w|, by w[i], we refer to the symbol at position i of w. For every

57

n ∈ N∪ {0}, let Σⁿ = {w ∈ Σ^∗ | |w| = n} and Σ^≤n = Sn

i=0Σⁱ. By , we denote the

58

substring relation over the set of strings, i. e., for u, v ∈ Σ^∗, uv if v = xuy, for some

59

x, y∈Σ^∗. We use the concatenation of sets of strings as usually defined, i. e., forA, B ⊆Σ^∗,

60

A·B={uv|u∈A, v∈B}.

61

For stringsu, v∈Σ^∗ with |u|=|v|,dH(u, v) is theHamming distance betweenuand v.

62

For a multi-setS={ui|1≤i≤n} ⊆Σ^`and a stringv∈Σ^`, for some`∈N, theradius ofS

63

(w. r. t.v) is defined byrH(v, S) = max{dH(v, u)|u∈S}and thedistance sum ofS (w. r. t.v)

64

is defined bys_H(v, S) =P

u∈Sd_H(v, u).¹ Next, we state the problem (r,s)-Closest String

65

in full detail, from which we then derive the other considered problems:

66

(r,s)-Closest String

67

Instance: Multi-setS={si|1≤i≤k} ⊆Σ^`,`∈N, and integersdr, ds∈N.

68

Question: Is there ans∈Σ^` withrH(s, S)≤dr andsH(s, S)≤ds?

69

For (r,s)-Closest Substring, we haveS⊆Σ^≤`and an additional input integerm∈N, and

70

we ask whether there is a multi-setS⁰={s⁰_i |s⁰_isi,1≤i≤k} ⊆Σ^mwithrH(s, S⁰)≤drand

71

s_H(s, S⁰)≤d_s. For (r,s)-Closest String with Outliers(or (r,s)-Closest String-wo

72

for short) we have an additional input integert∈N, and we ask whether there is a multi-

73

set S⁰ ⊆ S with |S⁰| = k−t such that rH(s, S⁰) ≤ dr and sH(s, S⁰) ≤ ds. We also call

74

(r,s)-Closest Stringthegeneral variant ofClosest String, while (r)-Closest String

75

and (s)-Closest Stringdenote the variants, where the only distance bound is dr or ds,

76

respectively; we shall also call them the (r)-and (s)-variant ofClosest String. Analogous

77

notation apply to the other consensus string problems. The problem names that are also com-

78

monly used in the literature translate into our terminology as follows: Closest String= (r)-

79

Closest String,Closest Substring= (r)-Closest Substring,Consensus Patterns

80

= (s)-Closest SubstringandClosest String-wo= (r)-Closest String-wo.

81

The motivation for our more general setting with respect to the boundsdranddsis the

82

following. While the distance measures of radius and distance sum are well-motivated, they

83

have, if considered individually, also obvious deficiencies. In the distance sum variant, we

84

1 Note that we slightly abuse notation with respect to the subset relation: for a multi-setAand a setB, A⊆Bmeans thatA⁰⊆B, whereA⁰is the set obtained fromAby deleting duplicates; for multi-sets A, B,A⊆Bis defined as usual. Moreover, whenever it is clear from the context that we talk about multi-sets, we also simply use the termset.

(4)

may consider strings as close enough that are very close to some, but totally different to the

85

other input strings. In the radius variant, on the other hand, we may consider strings as too

86

different, even though they are very similar to all input strings except one, for which the

87

bound is exceeded by only a small amount. Using an upper bound on the distance per each

88

input string and an upper bound on the total sum of distances caters for these cases.²

89

For any problemK, byK(p₁, p₂, . . .), we denote the variant ofK parameterised by the

90

parametersp1, p2, . . .. For unexplained concepts of parameterised complexity, we refer to the

91

textbooks [?,?,?].

92

Known Results. In contrast to graph problems, where interesting parameters are often

93

hidden in the graph structure, string problems typically contain a variety of obvious, but

94

nevertheless interesting parameters that could be exploited in terms of fixed-parameter

95

tractability. For the consensus string problems these are the number of input strings k,

96

their length`, the radius bounddr, the distance sum bound ds, the alphabet size|Σ|, the

97

substring length m (in case of (r,s)-Closest Substring), the number ofoutliers t and

98

inliersk−t(in case of (r,s)-Closest String-wo). This leads to a large number of different

99

parameterisations, which justifies the hope for fixed-parameter tractable variants.

100

The parameterised complexity (w. r. t. the above mentioned parameters) of the radius

101

as well as the distance sum variant of Closest String and Closest Substring has

102

been settled by a sequence of papers (see [?,?,?,?,?] and, for a survey, [?]), except

103

(s)-Closest Substring with respect to parameter `, which has been neglected in these

104

papers and mentioned as an open problem in [?], in which it is shown that the fixed-parameter

105

tractability results from (r)-Closest Stringcarry over to (r)-Closest Substring, if we

106

additionally parameterise by (`−m). The parameterised complexity analysis of the radius

107

variant of Closest String with Outliershas been started more recently in [?] and, to

108

the knowledge of the authors, the distance sum variant has not yet been considered.

109

The parameterised complexity of the general variants, where we have a bound on both the

110

radius and the distance sum, has not yet been considered in the literature. While there are

111

obvious reductions from the (r)- and (s)-variants to the general variant, these three variants

112

describe, especially in the parameterised setting, rather different problems.

113

Our Contribution. In this work, we answer some open questions from the literature

114

regarding the (r)- and (s)-variants of the consensus string problems, and we initiate the

115

parameterised complexity analysis of the general variants.

116

We extend all the FPT-results from (r)-Closest Stringto the general variant; thus, we

117

completely settle the fixed-parameter tractability of (r,s)-Closest String. While for some

118

parameterisations, this is straightforward, the case of parameterdr follows from a non-trivial

119

extension of the known branching algorithm for (r)-Closest String(d_r) (see [?]).

120

For (r,s)-Closest Substring, we classify all parameterised variants as being inFPT or

121

W[1]-hard, which is done by answering the open question whether (s)-Closest Substring(`)

122

is in FPT (see [?]) in the negative (which also settles the parameterised complexity of

123

(s)-Closest Substring) and by slightly adapting the existing FPT-algorithms.

124

Regarding (r,s)-Closest String-wo, we solve an open question from [?] w. r. t. the

125

radius variant, we showW[1]-hardness for a strong parameterisation of the (s)-variant, we

126

show fixed-parameter tractability for some parameter combinations of the general variant and,

127

as our main result, we present an FPT-algorithm (for the general variant) for parametersdr

128

2 To the knowledge of the authors, optimising both the radius and the distance sum has been considered first in [?], where algorithms for the special casek= 3 are considered.

(5)

andt(which is the same algorithm that shows (r,s)-Closest String(dr)∈FPTmentioned

129

above). Many other cases are left open for further research.

130

Finally, we investigate the question whether the fixed-parameter tractable variants of the

131

considered consensus string problems allow polynomial kernels; thus, continuing a line of work

132

initiated by Basavaraju et al. [?], in which kernelisation lower bounds for (r)-Closest String

133

and (r)-Closest Substring are proved. Our respective main result is a cross-composition

134

from (r)-Closest Stringinto (r)-Closest String-wo.

135

Due to space constraints, proofs for results marked with (∗) are omitted.

136

2 Closest String and Closest String-wo

137

In this section, we investigate (r,s)-Closest Stringand (r,s)-Closest String-wo(and

138

their (r)- and (s)-variants) and we shall first give some useful definitions.

139

It will be convenient to treat a setS ={si |1≤i≤k} ⊆Σ^` as a k×` matrix with

140

entries from Σ. By the termcolumn ofS, we refer to the transpose of a column of the matrix

141

S, which is an element from Σ^k; thus, the introduced string notations apply, e. g., ifcis the

142

i^th column ofS, thenc[j] corresponds tosj[i]. A string s∈Σ^` is amajority string (for a

143

setS ⊆Σ^`) if, for everyi, 1≤i≤`, s[i] is a symbol with majority in thei^th column ofS.

144

Obviously,s_H(s, S) = min{sH(s⁰, S)|s⁰∈Σ^`}if and only ifsis a majority string forS. We

145

call a strings∈Σ^` radius optimal ordistance sum optimal (with respect to a set S⊆Σ^`) if

146

r_H(s, S) = min{rH(s⁰, S)|s⁰∈Σ^`}ors_H(s, S) = min{sH(s⁰, S)|s⁰∈Σ^`}, respectively.

147

It is a well-known fact that (r)-Closest Stringallows FPT-algorithms for any of the

148

single parametersk,dr or`, and it is stillNP-hard for|Σ|= 2 (see [?]). While the latter

149

hardness result trivially carries over to (r,s)-Closest String(by setting d_s =k d_r), we

150

have to modify the FPT-algorithms for extending the fixed-parameter tractability results

151

to (r,s)-Closest String. We start with parameter k, for which we can extend the ILP-

152

approach that is used in [?] to show (r)-Closest String(k)∈FPT.

153

ITheorem 1(*). (r,s)-Closest String(k)∈FPT.

154

Next, we consider the parameter dr. For the (r)-variant of (r,s)-Closest String,

155

the fixed-parameter tractability with respect to dr is shown in [?] by a branching algo-

156

rithm, which proved itself as rather versatile: it has successfully been extended in [?] to

157

(r)-Closest String-wo(dr, t) and in [?] to (r)-Closest Substring(dr,(`−m)).

158

We propose an extension of the same branching algorithm, that allows for a bounddson the

159

distance sum; thus, it works for (r,s)-Closest String(dr). In fact, we prove in Theorem 7

160

an even stronger result, where we also extend the algorithm to exclude up totoutlier strings

161

from the input setS, i. e., we extend it to the problem (r,s)-Closest String-wo(d_r, t).

162

Since Theorem 3 can therefore be seen as a corollary of this result by takingt= 0, we only

163

give an informal description of a direct approach that solves (r,s)-Closest String(d_r) (and

164

refer to Theorem 7 for a formal proof).

165

The core idea is to apply the branching algorithm starting with the majority string for

166

the input setS, instead of any random string fromS. Then, as in [?], the algorithm would

167

replace some characters of the current string with characters of the solution string. This way,

168

it can be shown that the distance sum of the current string is always a lower bound of the

169

distance sum of the optimal string, which allows to cut any branch where the distance sum

170

goes beyond the thresholdds. We prove the following lemma, which allows to bound the

171

depth of the search tree (and shall also be used in the proof of Theorem 7 later on):

172

(6)

k dr ds |Σ| ` Result Note/Ref.

p – – – – FPT Thm. 1

– p – – – FPT Thm. 3

– – p – – FPT Cor. 4

– – – 2 – NP-hard from (r)-variant [?]

– – – – p FPT Cor. 4

Table 1Results for (r,s)-Closest String.

ILemma 2 (*). LetS⊆Σ^`, s∈Σ^`such thatrH(s, S)≤dr, and letsm be a majority string

173

forS. ThendH(sm, s)≤2dr.

174

ITheorem 3. (r,s)-Closest String(dr)∈FPT.

175

Obviously, we can assume d_r ≤` and we can further assume that every column ofS

176

contains at least two different symbols (all columns without this property could be removed),

177

which impliess_H(s_i, S)≥`for everys∈Σ^`; thus, we can assume`≤d_s. Consequently, we

178

obtain the following corollary:

179

ICorollary 4. (r,s)-Closest String(`)∈FPT,(r,s)-Closest String(d_s)∈FPT.

180

This completely settles the parameterised complexity of (r,s)-Closest String with

181

respect to parametersk,d_r,d_s,|Σ|and`; recall that the (r)-variant is already settled, while

182

the (s)-variant is trivial.

183

2.1 (r, s)-Closest String-wo

184

We now turn to the problem (r,s)-Closest String-wo and we first prove several fixed-

185

parameter tractability results for the general variant; in Sec. 2.2, we consider the (r)- and

186

(s)-variants separately.

187

First, we note that solving an instance of (r,s)-Closest String-wo(k) can be reduced

188

to solvingf(k) many (r,s)-Closest String(k)-instances, which, due to the fixed-parameter

189

tractability of the latter problem, yields the fixed-parameter tractability of the former.

190

ITheorem 5(*). (r,s)-Closest String-wo(k)∈FPT.

191

If the number k−t of inliers exceeds ds, then an (r,s)-Closest String-wo-instance

192

becomes easily solvable; thus,k−tcan be bounded by d_s, which yields the following result:

193

ITheorem 6(*). (r,s)-Closest String-wo(ds, t)∈FPT.

194

The algorithm introduced in [?] to prove (r)-Closest String(dr) ∈ FPT has been

195

extended in [?] with an additional branching that guesses whether a stringsj should be consid-

196

ered an outlier or not; thus, yielding fixed-parameter tractability of (r)-Closest String-wo(dr, t).

197

We present a non-trivial extension of this algorithm, with a carefully selected starting string,

198

to obtain the fixed-parameter tractability of (r,s)-Closest String-wo(d_r, t) (and, as ex-

199

plained in Section 2, also of (r,s)-Closest String(dr)):

200

ITheorem 7. (r,s)-Closest String-wo(d_r, t)∈FPT.

201

Proof. Let (S, ds, dr, t) be a positive instance of (r,s)-Closest String-wo(dr, t) withk≥

202

5t (otherwisekcan be considered as a parameter). A characterxisfrequent in columniif it

203

(7)

Input: s1= d b a d d c b c d b b d b b dr= 5 s2= d a a a a c b c d c c d b d d_s= 14 s₃= d a a d d a b c a c c d b d t= 2 s4= a a c d a c c d c c c a b d s₅= a a c d a a b d a c c a d d D= 10 s6= a c a a a a b c d d b a d d

Step S⁰ t s⁰ d rH(s⁰, S⁰) action

1 {s₁, s₂, . . . , s₆} 2 a b c d 20 13 s[3]←s₁[3]

2 {s1, s2, . . . , s6} 2 a a b c d 19 12 s[12]←s1[12]

3 {s1, s2, . . . , s6} 2 a a b cdd 18 11 removes6

4 {s1, s₂, . . . , s₅} 1 a a b cdd 18 11 s[6]←s₁[6]

5 {s1, s2, . . . , s5} 1 a a c b cdd 17 10 removes5

6 {s₁, . . . , s₄} 0 a a c b cdd 17 10

s⁰⁰= d a a d a c b c d c c d b d s[7]←s4[7]

7 {s₁, . . . , s₄} 0 a a c c cdd 16 10

s⁰⁰= d a a d a c c c d c c d b d returnS⁰, s⁰⁰

Figure 1 Example for Algorithm 1 on an instance of (r,s)-Closest String-wo. The shown steps correspond to one branch that yields a correct solution. The algorithm starts with the majority string where disputed characters are replaced by. At each step, the algorithm either inserts a character from an input string at maximal distance froms⁰(note that even non-disputed characters may be replaced), or removes one such string. Whent= 0, it is checked whether the completions⁰⁰ ofs⁰is a correct solution. At step 7, we return a solution withrH(s⁰⁰, S⁰) = 5 andsH(s⁰⁰, S⁰) = 14.

has at least as many occurrences as the majority character minust(thus, for anyS⁰ ⊆S,

204

|S⁰| ≥ |S| −t, all majority characters forS⁰ are frequent characters). A columni isdisputed

205

if it contains at least two frequent characters. LetDbe the number of disputed columns.

206

Let (S^∗, s^∗) be a solution for this instance. In a disputed column i, no character

207

occurs more than ^k+t₂ times, hence, among the k−t strings of S^∗, there are at least

208

(k−t)−^k+t₂ = ^k−3t₂ mismatches at position i. The disputed columns thus introduce at least

209

D^k−3t₂ mismatches. Since the overall number of mismatches is upper-bounded byd_r(k−t),

210

we haveD≤^2d_k−3t^r^(k−t)= 2d_r

1 + _k−3t^2t

, and, withk≥5t, the upper-boundD≤4d_rfollows.

211

We introduce a new character ∈/ Σ. A string s⁰ ∈ (Σ∪ {})^` is a lower bound for a

212

solutions^∗, if, for everyisuch thats⁰[i]6=s^∗[i], eitheriis a disputed column ands⁰[i] =, or

213

iis not disputed ands⁰[i] is the majority character for columniofS^∗ (which is equal to the

214

majority character for columniofS). Intuitively speaking, whenever a characters⁰[i] differs

215

froms^∗[i], it is the majority character of its column (except for disputed columns in which

216

we use an “undecided” character). Finally, thecompletion forS⁰ of a strings⁰∈(Σ∪ {})^∗

217

is the string obtained by replacing each occurrence of by a majority character of the

218

corresponding column inS⁰.

219

We now prove that Algorithm 1 solves (r,s)-Closest String-wo in time at most

220

O^∗((d_r+ 1)^6d^r2^6d^r^+t), using the following three claims (see Figure 1 for an example).

221

Claim 1: Any call toSolve Closest String-wo(S⁰, t, s⁰, d) always returns after a time

222

O^∗((dr+ 1)^d2^d+t)

223

Proof of Claim1: We prove this running time by induction: ifd=t= 0, then the function

224

returns in Line 3 or 4; thus, it returns after polynomial time. Otherwise, it performs at most

225

(8)

ALGORITHM 1: Solve Closest String-wo Input :S⁰⊆S,t∈N,s⁰∈(Σ∪ {})^`,d∈N Output: a pair (S^∗, s^∗) or the symbolO

1 if t= 0then

2 s⁰⁰= completion ofs⁰inS⁰;

3 ifsH(s⁰⁰, S⁰)≤ds, andrH(s⁰⁰, S⁰)≤dr thenreturn (S⁰, s⁰⁰);

4 ifd= 0thenreturnO;

5 Letsj∈S⁰ be such thatdH(s⁰, sj) is maximal;

6 if t >0then

7 sol=Solve Closest String-wo(S⁰\ {sj}, t−1, s⁰, d);

8 ifsol6=Othenreturnsol;

9 if d >0then

10 LetI⊆ {1, . . . , `}containdr+ 1 indicesis. t. s⁰[i]6=sj[i] (or all indices ifdH(sj, s⁰)≤dr);

11 fori∈Ido

12 s⁰⁰=s⁰,s⁰⁰[i] =sj[i];

13 sol=Solve Closest String-wo(S⁰, t, s⁰⁰, d−1);

14 ifsol6=Othenreturnsol;

15 returnO;

dr+1 recursive calls with parameters (d−1, t), and one recursive call with parameters (d, t−1).

226

By induction, the complexity of this step isO^∗((d_r+ 1)(d_r+ 1)^d−12^d+t−1+ (d_r+ 1)^d2^d+t−1) =

227

O^∗((dr+ 1)^d2^d+t). J(Claim 1)

228

A tuple (S⁰, t⁰, s⁰, d) is valid if|S⁰| −t⁰=|S| −t, there exists an optimal solution (S^∗, s^∗) for

229

whichS^∗⊆S⁰,|S^∗|=|S⁰| −t⁰,dH(s⁰, s^∗)≤d, ands⁰ is a lower bound fors^∗. A call of the

230

algorithm isvalid if its parameters form a valid tuple, itswitnessis the pair (S^∗, s^∗).

231

Claim 2: Any valid call toSolve Closest String-woeither directly returns a solution or

232

performs at least one recursive valid call.

233

Proof of Claim 2: Let S⁰ ⊆ Σ^`, t⁰ ≥0, s⁰ ∈(Σ∪ {})^`, and d≥ 0. Consider the call to

234

Solve Closest String-wo(S⁰, t⁰, s⁰, d). Assume it is valid, with witness (S^∗, s^∗).

235

Case 1: Ifd=t⁰= 0, thens^∗=s⁰ and S^∗=S⁰. The completions⁰⁰ofs⁰ is exactlys⁰, and

236

since (S⁰, s⁰) is a solution, it satisfies the conditions of Line 3 and is returned on Line 3.

237

Case 2: Ift⁰= 0 and ∀s∈S⁰:dH(s, s⁰)≤dr. ThenS^∗=S⁰ ands⁰ is a lower bound fors^∗.

238

Lets⁰⁰be the completion of s⁰. We show thats_H(s⁰⁰, S⁰)≤s_H(s^∗, S⁰)≤d_s. Indeed, consider

239

any columniwith s⁰⁰[i]6=s^∗[i]. Eithers⁰[i] =, in which cases⁰⁰[i] is the majority character

240

for column iof S⁰, or s⁰[i]6=, in which case by the definition of lower bound, i is not a

241

disputed column ands⁰[i] =s⁰⁰[i] contains the only frequent character of this column, which

242

is the majority character for S⁰. In both cases, s⁰⁰[i] is a majority character for S⁰ in any

243

column where it differs froms^∗; thus, it satisfies the upper-bound on the distance sum. Since

244

it also satisfies the distance radius (by the case hypothesis: dH(s, s⁰⁰)≤dH(s, s⁰)≤dr for all

245

s∈S⁰), it satisfies the conditions of Line 3; thus, solution (S⁰, s⁰⁰) is returned on Line 3.

246

In the following cases, we can thus assume that the algorithm reaches Line 5. Indeed,

247

if it returns on Line 3 then it returns a solution, and if it returns on Line 4 then we have

248

d=t⁰= 0, which is dealt in Case 1 above (the algorithm may not return on this line when it

249

has a valid input). We can thus defines_j to be the string selected in Line 5.

250

Case 3: s_j∈S⁰\S^∗. Then in particulart⁰>0; and sinceS^∗⊆S⁰\ {s_j}, the recursive call

251

in Line 7 is valid, with the same witness (S^∗, s^∗).

252

(9)

Case 4: sj ∈ S^∗, d = 0 and t⁰ > 0. Then s⁰ = s^∗, let s⁰_j be any string of S⁰\S^∗, and

253

S⁺=S^∗\ {s⁰_j} ∪ {sj}. Then the pair (S⁺, s^∗) is a solution, sincedH(s^∗, s⁰_j)≤dH(s^∗, sj) by

254

definition ofs_j. Thus the recursive call on Line 7 is valid, with witness (S⁺, s^∗).

255

Case 5: sj ∈S^∗, d >0 anddH(sj, s⁰)> dr. Consider the setI defined in Line 10. I has

256

sized_r+ 1, hence there existsi₀∈I such thats_j[i₀] =s^∗[i₀]. Then the recursive call with

257

parameters (S⁰, t, s⁰⁰, d−1) in Line 13 withi=i0 is valid with the same witness (S^∗, s^∗).

258

Indeed, s⁰⁰ is obtained from s⁰ by setting s⁰⁰[i0] = s^∗[i0] 6= s⁰[i0], hence, all mismatches

259

betweens⁰⁰ and s^∗ already exist between s⁰ and s^∗, which implies that s⁰⁰ is still a lower

260

bound fors^∗. Moreover,dH(s⁰⁰, s^∗) =dH(s⁰, s^∗)−1≤d−1.

261

From now on, we can assume thatd >0 and t⁰ >0. Indeed,d= 0 is dealt with in cases

262

1, 3 and 4, andt⁰ = 0, d >0 is dealt with in cases 2 and 5. Moreover, with cases 3 and 5, we

263

can assume thats_j∈S^∗andd_H(s_j, s⁰)≤d_r (i.e. d_H(s, s⁰)≤d_r for alls∈S^∗).

264

Case 6: There exists i0 such thatsj[i0] =s^∗[i0]6=s⁰[i0]. Then again consider the setI

265

defined in Line 10. SincedH(sj, s⁰)≤dr, we havei0∈I, and, with the same argument as in

266

Case 5, there is a valid recursive call in Line 13 wheni=i₀.

267

Case 7: For all i, sj[i]6= s⁰[i] ⇒sj[i] 6=s^∗[i]. In this case no character fromsj can be

268

used to improve our current solution, so the character switching procedure Line 13 will not

269

improve the solution, but stillsj is part of our witness setS^∗, so it is not clear a priori that

270

we can removesj from our current solution, i.e. that the recursive call on Line 7 is valid.

271

We handle this situation as follows. Lets⁺ be obtained froms⁰ by filling the-positions

272

ofs⁰ with the corresponding symbols ofs^∗. We now show that (S^∗, s⁺) is a solution. To this

273

end, lets∈S^∗. For everyi, 1≤i≤`, ifs[i]6=s⁺[i], then eithers⁰[i] =ors⁰[i]∈Σ with

274

s⁰[i] =s⁺[i]. In both cases, we haves[i]6=s⁰[i], which impliesdH(s, s⁺)≤dH(s, s⁰)≤dr, i. e.,

275

the radius is satisfied. Regarding the distance sum, we note that ifs⁺[i]6=s^∗[i], then, since

276

occurrences of ofs⁰ have been replaced by the corresponding symbol from s^∗, s⁰[i]6=,

277

which, by the definition of lower bound, implies that s⁺[i] =s⁰[i] is the majority character

278

for columniofS^∗. Consequently,P

s∈S^∗dH(s⁺[i], s[i])≤P

s∈S^∗dH(s^∗[i], s[i]), which implies

279

s_H(s⁺, S^∗)≤s_H(s^∗, S^∗)≤d_s.

280

Having defined a new solution strings⁺(with respect toS^∗), we now prove thats⁺is also a

281

solution string with respect toS⁺= (S^∗\{sj})∪{s⁰_j}, wheres⁰_jis any string ofS⁰\S^∗. To this

282

end, we prove thatdH(s⁰_j, s⁺)≤dH(sj, s⁺); together with the fact thatdH(s⁰_j, s⁰)≤dr, this

283

implies that (S⁺, s⁺) is a solution. For two stringss₁, s₂∈Σ^`, letd(s₁, s₂) be the number of

284

mismatches betweens1ands2at positionsisuch thats⁰[i] =, anddΣ(s1, s2) be the number

285

of mismatches at other positions. ClearlydH(s1, s2) =d(s1, s2) +dΣ(s1, s2). Comparing

286

stringss_jands⁰_j tos⁰, we haved(s_j, s⁰) =d(s⁰_j, s⁰) (both distances are equal to the number

287

of occurrences ofins⁰). SincedH(sj, s⁰) is maximal, we havedΣ(s⁰_j, s⁰)≤dΣ(sj, s⁰). Consider

288

nows⁺. Sinces⁺ is equal to s⁰ in every non-characters, we haved_Σ(s⁰_j, s⁺)≤d_Σ(s_j, s⁺).

289

Finally, for anyisuch thats⁰[i] =, by hypothesis of this case we havesj[i]6=s^∗[i] =s⁺[i],

290

henced(s_j, s⁺) is equal to the number of occurrences of ins⁰, which is an upper bound

291

ford(s⁰_j, s⁺). Overall,d(s⁰_j, s⁺)≤d(sj, s⁺), and (S⁺, s⁺) is a solution.

292

Thus, (S⁺, s⁺) is a solution such thatS⁺ ⊆ S⁰\ {s_j}, s⁰ is a lower bound fors⁺, and

293

dH(s⁰, s⁺)≤d, hence the recursive call in Line 7 is valid. J(Claim 2)

294

It follows from the claim above that any valid call toSolve Closest String-woreturns

295

a solution. Indeed, if it does not directly return a solution, then it receives a solution of a

296

more constrained instance from a valid recursive call, which is returned on Line 8 or 14.

297

Claim3: Lets⁰ be the majority string forS where for every disputed column i,s⁰[i] =.

298

ThenSolve Closest String-wo(S, t, s⁰,2dr+D) is a valid call.

299

(10)

Proof of Claim3: Consider a solution (S^∗, s^∗). We need to check whetherdH(s^∗, s⁰)≤2dr+D,

300

and whethers⁰ is a lower bound ofs^∗. The fact thats⁰ is a lower bound follows from the

301

definition, sinceis selected in every disputed column, and the majority character is selected

302

in the other columns. String s^∗ can be seen as a solution of (r,s)-Closest Stringover

303

S^∗, d_r, d_s, thus, we can use Lemma 2: the distance betweens^∗ and the majority string ofS^∗

304

is at most 2dr. Hence there are at most 2dr mismatches betweens⁰ ands^∗ in non-disputed

305

columns (since in those columns, the majority characters are identical inS andS^∗). Adding

306

theD mismatches from disputed columns, we get the 2dr+D upper bound. J(Claim

307

3) J

308

2.2 The (r)- and (s)-Variants of Closest String-wo

309

In [?], the fixed-parameter tractability of (r)-Closest String-wow. r. t. parameterkand

310

w. r. t. parameters (|Σ|, dr, k−t) are reported as open problems. Since Theorem 5 also applies

311

to (r)-Closest String-wo, the only open cases left for the (r)-variant are the following:

312

IOpen Problem 8. What is the fixed-parameter tractability of (r)-Closest String-wo

313

with respect to(|Σ|, k−t),(|Σ|, dr)and (|Σ|, dr, k−t)?

314

Next, we consider the (s)-variant of Closest String-wo. We recall that replacing

315

the radius bound by a bound on the distance sum turns (r)-Closest Stringinto a triv-

316

ial problem, while (s)-Closest Substring remains hard. The next result shows that

317

Closest String-wobehaves likeClosest Substring in this regard. For the proof, we

318

use Multi-Coloured Clique (which is W[1]-hard, see [?]), which is identical to the

319

standard parameterisation of Clique, but the input graph G = (V, E) has a partition

320

V =V₁∪. . .∪V_k_c, such that everyV_i, 1≤i≤kc, is an independent set (we denote the

321

parameter by kc to avoid confusion with the number of input stringsk).

322

ITheorem 9. (s)-Closest String-wo(ds, `, k−t)isW[1]-hard.

323

Proof. Let G= (V1∪. . .∪Vk_c, E) be aMulti-Coloured Clique-instance. We assume

324

that, for some q ∈N, V_i ={vi,1, v_i,2, . . . v_i,q}, 1≤ i≤ k_c, i. e., each vertex has an index

325

depending on its colour-class and its rank within its colour-class. Let Σ =V ∪Γ, where

326

Γ is some alphabet with |Γ| = |E|(k_c−2). For every e = (v_i,j, v_i⁰_,j⁰) ∈ E, let s_e ∈ Σ^k^c

327

withse[i] =vi,j, se[i⁰] = vi⁰,j⁰ and all other non-defined positions are filled with symbols

328

from Γ such that eachx∈Γ has exactly one occurrence in the stringsse, e∈E. We set

329

S={se|e∈E},t=|E| − ^k₂^c

(i. e., the number of inliers is ^k₂^c

) andds= ^k₂^c (kc−2).

330

LetKbe a clique ofGof size kc, lets∈Σ^k^cbe defined by{s[i]}=K∩Vi, 1≤i≤kc, and

331

letS⁰ ={se|e⊆K}. Sinced_H(s, s⁰) = k_c−2, for everys⁰∈S⁰,s_H(s, S⁰) =d_s. Consequently,

332

S⁰ andsis a solution for the (s)-Closest String-wo-instanceS, t,ds.

333

Now lets∈Σ^k^candS⁰ ⊆Swith|S⁰|= ^k₂^c

be a solution for the (s)-Closest String-wo-

334

instance S, t, ds. If, for some s⁰₁ ∈ S⁰, dH(s⁰₁, s) ≥kc−1, then there is an s⁰₂ ∈ S⁰ with

335

dH(s⁰₂, s)≤k_c−3. Thus, for somei, 1≤i≤k_c,s[i] =s⁰₂[i] ands⁰₂[i]∈Γ, which implies that

336

replacings[i] bys⁰₁[i] does not increasesH(s, S⁰). Moreover, after this modification,dH(s⁰₁, s)

337

has decreased by 1, whiledH(s⁰₂, s)≤kc−2. By repeating such operations, we can transform

338

ssuch thatd_H(s⁰, s)≤k_c−2,s⁰∈S⁰. Next, assume that, for somei, 1≤i≤k_c, there is an

339

S⁰⁰⊆S⁰ with|S⁰⁰|= kc and, for every s⁰∈S⁰⁰,s[i] =s⁰[i]. Since dH(s⁰, s)≤kc−2 for every

340

s⁰∈S⁰⁰, pigeon-hole principle implies that there ares⁰₁, s⁰₂∈S⁰⁰withs⁰₁[i⁰] =s⁰₂[i⁰] =s[i⁰], for

341

somei⁰, 1≤i⁰ ≤kc, andi⁰6=i, which, by the structure of the strings ofS, is a contradiction.

342

Consequently, for every i, 1 ≤ i ≤ k_c, s matches with at most k_c−1 strings from S⁰ at

343

positioni. Since there are at least 2 ^k₂^c

= kc(kc−1) matches, we conclude that, for every

344