Sequence graphs realizations and ambiguity in language models

(1)

HAL Id: hal-02495333

https://hal.archives-ouvertes.fr/hal-02495333v3

Submitted on 13 Jan 2021

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

models

Sammy Khalife, Yann Ponty, Laurent Bulteau

To cite this version:

Sammy Khalife, Yann Ponty, Laurent Bulteau. Sequence graphs realizations and ambiguity in lan-

guage models. COCOON 2021 - 27th International Computing and Combinatorics Conference, Oct

2021, Tainan, Taiwan. �hal-02495333v3�

(2)

language models

2

Sammy Khalife

3

LIX, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris, 91128 Palaiseau, France

4

khalife@lix.polytechnique.fr

5

Yann Ponty

6

LIX, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris, 91128 Palaiseau, France

7

yann.ponty@lix.polytechnique.fr

8

Laurent Bulteau

9

LIGM, CNRS, Université Gustave Eiffel, 77454 Marne-la-Vallée, France

10

laurent.bulteau@univ-eiffel.fr

11

Abstract

12

Several language models rely on an assumption modeling each local context as a (potentially oriented)

13

bag of words, and have proven to be very efficient baselines. Sequence graphs are the natural

14

structures encoding their information. However, a sequence graph may have several realizations

15

as a sequence, leading to a degree of ambiguity. In this paper, we study such degree of ambiguity

16

from a combinatorial and computational point of view. In particular, we present theoretical results

17

concerning the family of sequence graphs. Several combinatorial problems are presented, depending

18

on three levels of generalisation (window size, graph orientation, and weights), and whether some

19

of these are NP-complete is left opened. We establish different algorithms, including an integer

20

program and a dynamic programming formulation to respectively recognize a sequence graph and to

21

count the number of its distinct realizations. This allows us to show that this model assumption can

22

induce an important number of sentences to have the same representations. We empirically compare

23

the representations obtained with a recurrent neural networks for different realizations of sequence

24

graphs.

25

2012 ACM Subject Classification Mathematics of computing→Combinatoric problems; Mathem-

26

atics of computing→Combinatorics on words; Mathematics of computing→Graph algorithms;

27

Theory of computation→Complexity classes; Theory of computation→Problems, reductions and

28

completeness

29

Keywords and phrases Graphs, Sequences, Combinatorics, Inverse problem, Complexity class

30

1 Introduction

31

The automated treatment of familiar objects, either natural or artifacts, always relies on a

32

translation into entities manageable by computer programs. However, the correspondence

33

between the object to be treated and "its" representation is not necessarily one-to-one. The

34

representations used for learning algorithms are no exception to this rule. In particular,

35

natural language words and textual documents representations are essential for several tasks,

36

including document classification [23], role labelling [19], and named entity recognition

37

[16]. The traditional models based on pointwise mutual information, or graph-of-words

38

(GOW), [9, 17, 20], supplement the content of bag-of-words (TF, TFIDF) with statistics

39

of co-occurrences within a windowof fixed size w, introduced to mitigate the degree of

40

ambiguity. Several models [2, 14, 18, 21] also use the same type of information and constitute

41

strong baselines for natural language processing.

42

While these representations are more precise than the traditional bag-of-words (e.g Parikh

43

vectors), they still induce some level of ambiguity,i.e. a given graph can represent several

44

sequences. Our study is thus motivated by a quantification of the level of ambiguity, seen

45

(3)

Linux is not UNIX but

(a)No ambiguity (w= 3)

Linux is not UNIX but

(b)Ambiguity (w= 2)

Figure 1Sequence graphs (orgraphs-of-words) built for the sentence “Linux is not UNIX but Linux” using window sizes 3 (a) and 2 respectively (b). In the second case, the sequence graph is ambiguous, since any circular permutation of the words admits the same representation.

a b

c d

r

a b r a c a d a b r a a b r a b r a d a c a a b r a c a b r a d a a b r a b r a c a d a a b r a d a b r a c a a b r a d a c a b r a

...

(a)w= 2,Ghas 30 realizations

a b

c

d r

a b r a c a d a b r a a b r a c a d b a r a a b a r c a d a b r a a b a r c a d b a r a a b r a c a d a b r a a b r a c a d b a r a (b)w= 3,Ghas 6 realizations

a c b

d

r

a b r a c a d a b r a a b r a c a a d b r a a b r c a a d a b r a (c)w= 4,Ghas 3 realizations

a c b

d

r a b r a c a d a b r a (d)w= 5,Ghas one realization

Figure 2Sequence graphs (orgraphs-of-words) built for the sentence “a b r a c a d a b r a” using window sizes 2 (a), 3 (b), 4 (c) and 5 (d).

as an algorithmic problem, coupled with an empirical assessment of the consequences of

46

ambiguity for the representations.

47

After introducing in Section 2 the formal definition of a sequence graph and the descriptions

48

of our main problems, we establish in Section 3.1 complexity aspects of deciding the existence

49

and counting sequences in GOWs associated with a window sizew= 2. Then we consider

50

in Section 3.2 the general case w ≥ 3, and propose a integer program and a dynamic

51

programming algorithm to respectively recognize a sequence graph and count admissible

52

sequences. Finally, we assess the prevalence of ambiguity within a synthetic dataset, and

53

observe that sequences invariant with respect to the GOW representation do not lead to

54

invariance with respect to recurrent neural networks such as Long Short Term Memory

55

networks (LSTMs).

56

Related work

57

Sequence graphs encode the information of several co-occurences based models [2, 15, 18]. To

58

the best of our knowledge, the ambiguity and realizability questions addressed in this work

59

were never systematically addressed by prior work in computational linguistics. Furthermore,

60

we believe the problems studied in this paper are interesting from an algorithmic point of

61

view, and appear to be devoid of reduction to other well-known problems.

62

(4)

However, some similarities exist between our problem and others studied in the Distance

63

Geometry (DG) literature. In distance geometry, the input consists of a set of pairwise

64

distances between points, having unknown positions in a d-dimensional space. The problem

65

then consists in determining (the existence of) a set of positions for the points, satisfying the

66

distance constraints. Since a position is fully characterized fromd+ 1 constraining neighbors,

67

the problem can be solved by finding a sequential order for processing points, such that the

68

assignment of a point is always by at leastd+ 1 among its neighbors [13]. This statement

69

shares some level of similarity with our problem since an admissible sequence for a window

70

w = d+ 2 also represents a linear ordering of its nodes, in which w−1 = d+ 1 of the

71

neighbors have lower value with respect to the order.

72

The reasons for the insufficiency of linear ordering in DG to solve our realizability problem

73

are threefold. First, each element of the sequencexassociated to the protein backbone is

74

associated a unique vertex. This is not the case we investigate here, since a symbol can be

75

repeated several times, but only one vertex is created in the graph. This implies that the

76

vertex associated to thei^th element (i≥w) of xcan have strictly less thanw−1 distinct

77

neighbors in its predecessors inx. Second, the absence of loops in distance geometry, because

78

an element is at distance 0 from itself. Finally, the graphs are always undirected in distance

79

geometry.

80

2 Definitions and problem statement

81

Let x= x₁, x₂, ..., x_p be a finite sequence of discrete elements among a finite vocabulary

82

X. Without loss of generality, we can suppose thatX = {1, ..., n}. In the following, let

83

I_p={1, ..., p}. This motivates the following definition:

84

IDefinition 1. G= (V, E)is the graph of the sequencexwith window size w∈N^∗ if and

85

only ifV ={xi|i∈Ip}, and

86

(i, j)∈E ⇐⇒ ∃(k, k⁰)∈I_p², |k−k⁰| ≤w−1xk=iandxk⁰ =j (1)

87

For digraphs, Eq. (1)is replaced with

88

(i, j)∈E ⇐⇒ ∃(k, k⁰)∈I_p², k≤k⁰≤k+w−1, x_k =iandx_k⁰ =j. (2)

89

Finally, a weighted sequence graph Gis endowed with a matrixΠ(G) = (πij) such that

90

πij =Card{(k, k⁰)∈I_p²|k≤k⁰≤k+w−1, xk =iandxk⁰ =j} (3)

91

We say that xis aw-admissible sequence for G(or a realization ofG), ifGis the graph of

92

sequence xwith window sizew.

93

The natural integers π_ij represent the number of co-occurrences ofiandj in a window

94

of sizew. Hence, the graph of sequence is unique. An linear time algorithm to construct a

95

weighted sequence digraph is presented in Sec. A of the appendix. Other cases are obtained

96

similarly. The procedure in algorithm 1 defines a correspondence between the sequence set

97

X^?into the graph set G: φw:X^?→ G, x7→Gw(x). Based on these definitions, we consider

98

the following problems:

99

IProblem 1(Weighted-Realizable(W-Realizable) ).

100

Input: Possibly directed graphG, matrix weightsΠ, window size w

101

Output: True if (G,Π)is thew-sequence graph of some sequencex, False otherwise.

102

(5)

IProblem 2(Unweighted-Realizable(U-Realizable) ).

103

Input: Possibly directed graph G, window size w

104

Output: True ifGis thew-sequence graph of some sequencex, False otherwise.

105

We denote D-Realizable (resp. G-) the restricted version of Realizable where the

106

input graphGis directed (resp. undirected), andW-Realizable(resp. U-) the restricted

107

version ofRealizable where the input graphGis weighted (resp. unweighted), possibly

108

in combination with the D- or G- variants. We writeRealizablew for the case where w

109

is a fixed (given) constant. We also consider the variants of W-Realizable, denoted WG-

110

Realizableand WD-Realizable where the input graph is restricted to be respectively

111

undirected and directed. We define UG-Realizable and UD-Realizable similarly. Finally,

112

we write (WG-, WD-, ...)Realizablewfor the case wherewis a fixed strictly positive integer.

113

IProblem 3(Unweighted-NumRealizations(U-NumRealizations) ).

114

Input: Possibly directed graph G, window size w

115

Output: The number of realizations of G, i.e. preimages of G through φw such that

116

|{x∈X^?|φ_w(x) =G}|if finite, or+∞otherwise.

117

IProblem 4(Weighted-NumRealizations(W-NumRealizations)).

118

Input: Possibly directed graph G, matrix weights Π, window sizew

119

Output: The number ofrealizations ofGin the weighted sense.

120

Similarly, we use the same prefix for the directed or undirected versions of (D-, G-, i.e.

121

DU- for directed and unweighted). We also denoteNumRealizationsw for the case where

122

wis a fixed strictly positive integer. Note thatNumRealizationsstrictly generalizes the

123

previous one, asRealizablecan be solved by testing the nullity of the number of suitable

124

realization computed byNumRealizations.

125

DWDirected weighted DU Directed unweighted GWUndirected weighted GUUndirected unweighted

126

3 Theoretical results

127

In this section, we present our main theoretical results. Due to length limitations, some of

128

the proofs are left in the appendix.

129

3.1 A complete characterization of 2-sequence graphs

130

A graph has a sequential realization withw= 2 when there exists a path visiting every vertex

131

and covering all of its edges (at least once for the unweighted case and exactlyπ_efor the

132

edgeein the weighted case). This characterization enables relatively simple characterization

133

and algorithmic treatment, leading to the results summarized in Table 2.

134

Table 1Complexity for various instances of our problems (w= 2)

NumRealizations2 Realizable2

Data Instance Complexity #Sequences Complexity Characterization

Unweighted graph P {0,+∞} P Gconnected

Weighted graph #P-hard {0,1} ∪2N^∗ P ψ(G) (semi) Eulerian

Unweighted digraph P {0,1,+∞} P Theorem 14

Weighted digraph P N(BEST Theorem) P ψ(G) (semi) Eulerian

(6)

3.2 General sequence graphs and Realizable

w≥3

135

The characterization of more general sequence graphs, such as 3-graphs is not the same for

136

2-graphs, as shows the counter-example in Fig 3a: the depicted graph has no self-edge so

137

there must at least one clique of size 3. Similarly, Fig. 3b depicts a counter example for

138

directed graphs: Gdoes not have loops, so if it had a 3-admissible sequence, such sequence

139

must be of the form{1 2 3 1...,1 3 2 1...,2 3 1 2...,3 2 1 3...,2 1 3 2...}but then (2,1) would form

140

an edge.

141

1 2 3

(a) G is connected but not a 3-sequence graph

1 2 3

(b)Gis strongly connected but not a 3-sequence graph

Figure 3Counter examples forw= 3

3.3 A polynomial time algorithm for GU- Realizable

w

142

Similarly to the procedure in Sec. B, we will use an auxiliary graph built on G. Let

143

H(G) = (E, HE) be the new graph obtained with the following procedure. Two edges

144

e= (v₁, v₂),f = (v₃, v₄) ofE are connected inH(G) if and only if:

145

v2=v3 and (v1, v4)∈E (4)

146

This defines an injective function ˜h: EH →V³: an edge of H(G) can be seen as an

147

unique tripletv₁, v₂, v₃ where (v₁, v₂),(v₁, v₃) and (v₂, v₃)∈E. Therefore, by definition, a

148

walkP inH(G) is always of the form:

149

P= (t1, t2), ...,(tp−1, tp) s.t ∀i∈ {1, ..., p−1}, (ti, ti+1)∈E (5)

150

It is clear that if H(G) is a 2-graph, then Gis a 3-graph since there is a walk going

151

through all edges ofH(G) (so visiting every non isolated node and creating all edges of G).

152

However, the converse is not true as depicted in Fig. 4. In order to determine ifG= (V, E)

153

has an admissible sequence in the general case, a procedure is to recursively merge pairs of

154

vertices, maintaining constraints depending onE. These constraints are similar to Eq. 4. We

155

adopt the following notations,ui,j= (ui, uj) andu1:k= (u1, ..., uk). The iterative procedure

156

forw≥3 is summed up in the following equation. Namely,∀k∈ {2, ..., w−2}, one has

157

E^(k)={u1:k+1∈V^k+1 |u1:k∈E^(k−1), u2:k+1∈E^(k−1)∧(u₁, uk+1)∈E} (6)

158

LetH^(k)= (E^(k), E^(k+1)), it can be defined recursively through:

159

H⁽⁰⁾ =G ∀k∈N^∗, H^(k)=f(H^(k−1)) (7)

160161

wheref transforms edges into vertices and creates edges between new vertices that verify

162

Eq. 6.

163

IDefinition 2. Letu be a vertex of H^(k) fork ∈N, u= (u1, ..., uk, uk+1). The sequence

164

u₁, ..., uk+1 is the authenticsequence ofu. We also call an authentic sequence of a walk on

165

H^(k): P = (x1, ..., xk+1),(x2, ..., xk+2), ...,(xv, ..., xv+k)the sequencex1, x2, ..., xv+k.

166

(7)

In order to obtain admissible sequences of lengthp, the computation ofH^(p) requiresp

167

iterations, and the number of vertices and edges ofH^(k)can increase during iterations (the

168

complete graph is an example for which theses numbers increase exponentially).

169

I Proposition 3. Let x = x1, ..., xp be a w-admissible sequence of a graph (or digraph)

170

G= (V, E). If w≤p, x, thenxis an authentic sequence of a walk of length p−w+ 1on

171

H^(w−2).

172

Proof. Due to length limitation, we provide a proof sketch, full proof is left in the appendix.

The following property by induction onk:

∀k∈ {w, ..., p}, ∃walkP onH^(w−2)such that : x_1:k =P[1]1, P[2]1, ..., P[k−w]1, P[k−(w−1)]1:(w−1)

• Initialisation: k= 1. By construction of H^(w−2),x1 is the first element of the “static

173

walk”: x_1:w−1∈H^(w−2).

174

• Induction: Verification that ifx1:k is a walk of lengthk−w+ 1, one can find a walk of

175

length (k+ 1)−w+ 1 to generatex_1:(k+1). J

176

ITheorem 4. Let w∈N^∗. GU-Realizable^w is in P.

177

Proof. The case forw= 1 is trivial, andw= 2 has been treated. For w≥3, an algorithm

178

is obtained by going through all the connected components ofH^(w−2). LetC1, ..., Cm the

179

connected components of H^(w−2). On the one hand, it is possible to compute them in

180

polynomial time. On the other hand, it is possible to construct walks covering all of their

181

respective edges in polynomial time (for instance iteratively using shortest paths). Let

182

W1, ..., Wm such walks andX1, ..., Xm their respective admissible sequences.

183

Using Prop. 3,Gis aw-sequence graph if and only if there exists a walk ˜Wi₀ on some

184

C_i₀ creating exactly the edges ofG. However,W_i₀ creates more edges than any walk onC_i₀

185

by construction.

186

In conclusion, the assertion:

∃i∈ {1, ..., m}, φw(Xi) =G

is a characterization thatGis aw-sequence. This assertion is decidable in polynomial time

187

since for alli,φw(Xi) is computable in polynomial time (cf. Algorithm 1). J

188

For digraphs, the analogue of the aforementioned procedure would consist in enumerating

189

alll paths in the DAGR(H^(w−2)). However, the number of paths can be exponential, even for

190

a sequence graph. In the next subsection, we will prove that DU-Realizablew is actually

191

NP-hard. Finally, ifx1, ..., xc are vertices of a strongly component ofH^(w−2), which order

192

should be considered to form a new vertex attributex_C? The following lemma shows that

193

this order is not important, as long as it represents a walk in the component. Moreover, it

194

is possible to reconstruct all admissible sequences from walks onR(H^w−2). With the same

195

notations:

196

ILemma 5. Let xa walk on H^(w−2) whose authentic sequence isw-admissible for G. Ifx

197

goes through a strongly componentC ofH^(w−2), adding any supplementary path included in

198

C is stable forw-admissibility. Any graph generated by a walk onH^(w−2)can be generated

199

by a walk on R(H^(w−2)).

200

(8)

1 2

3 4

(a)G

31 24 23 43 42

41 34

32 (b)H

31 24 43

41 32

34234

(c)R(H)

Figure 4Procedure to find a 3-admissible sequence. 34234, 41: is 3-admissible, with authentic sequence 3 4 2 3 4 1

Proof. We present a proof sketch. The first statement concerning stability requires a

201

straightforward verification using the definition ofH^(w−2). Second, a procedure to generate

202

Gfrom a walk onR(H^(w−2)) using a walkx1:ponH^(w−2)) is to consider an iterative scheme,

203

and discuss three cases:

204

(i)xi andxi+1are not in a strongly connected component (SCC)

205

(ii)x_i is not in a SCC andx_i+1 is in a SCC

206

(iii) xi andxi+1 are both in SCCs

207

For case (i), we just keepxi andxi+1. For cases (ii) and (iii), we use the first part result of

208

the Lemma and add covering walks over the strongly connected components. J

209

3.4 Main complexity results

210

In this subsection we present the remaining complexity results, which are summarized in

211

Table 2. In the previous subsection, we proved that GU-Realizablew∈P, ∀w≥3. Besides,

212

for GU, the number of realizations of a graphGis either 0 (not realizable), +∞(realizable

213

and there exists a cycle in a component of H generating G), or 1 (realizable but no cycle

214

in any component of H generatingG). These three cases can be tested in polynomial time

215

using our algorithm, showing that GU-NumRealizationsw∈P, ∀w≥3. In the remaining

216

of this section, we present the reductions we used for the other instances.

217

Table 2Complexity for various instances of our problems (w≥3). We remind that a para-NP- hard problem does not admit any XP algorithm unless P=NP.

Constantw,w≥3 Parameterw

NumRealizationsw Realizablew NumRealizations Realizable

Variation Complexity Complexity Complexity Complexity

GU P P W[1]-hard; XP W[1]-hard; XP

GW NP-hard NP-hard para-NP-hard para-NP-hard

DU NP-hard NP-hard para-NP-hard para-NP-hard

DW NP-hard NP-hard para-NP-hard para-NP-hard

I Proposition 6. Clique admits a polynomial time parameterized reduction into GU-

218

Realizable.

219

Proof. Let G= (V, E) be a simple graph. Let G⁰ be a graph constructed fromGadding

220

two nodesaandbwith loops, such thataandbare connected to each vertex ofG. Letkbe

221

a strictly positive integer andw=k+ 1. We will show thatGhas ak-clique if and only if

222

G⁰ isw-realizable.

223

(9)

First, let us suppose thatGhas ak-clique. LetC be an arbitrary sequence of the vertices of one of itsk-clique. Letv1, . . . , v_|V_|be the vertices ofGand (u1, u⁰₁), . . . , (u_|E|, u⁰_|E|) be its edges. In the followingx^wrepresents thew-repetition ofx. Then, the following sequence is aw-realization ofG⁰:

a^wu₁u⁰₁a^wu₂u⁰₂a^w . . . a^wu_|E|u⁰_|E|a^w C b^wv₁b^wv₂b^k . . . b^wv_|V_|

Now let us suppose thatG⁰ is w-realizable and letx=x1, . . . , xpbe aw-realization ofG⁰.

224

Without loss of generality, let us suppose aappears beforeb inx. Leti_b be the index of

225

the first appearance ofb and let ia be the largest index of the appearance of abeforeib.

226

Theni_b−i_a≥w, otherwise there would be an edge betweenaandb. Furthermore, since

227

Gis simple, there cannot be two repetitions of a vertex in the sequence xi_a+1, . . . , xi_a+w−1.

228

Due to the definition of a sequence graph, all vertices{x_i_a₊₁, . . . , x_i_a_+w−1}are connected,

229

forming a clique inGof size w−1 =k, which ends the proof. J

230

ICorollary 7. GU-Realizable isW[1]-hard for parameterw.

231

DU-Realizable is NP-hard forw≥3

232

Consider the following intermediate problem:

233

OptionalRealizablew Given a directed unweighted graphD= (V, A), a subsetA⁰ ⊆A of

234

compulsory arcs, two distinguished verticess, s⁰ ∈V. Is there a sequence S such that the

235

graph ofS contains only arcs inAand (at least) all arcs inA⁰.

236

We first prove that this problem is NP-hard, then show how it reduces to DU-Realizable.

237

OptionalRealizablew,w≥3is NP-hard

238

GivenG= (V, A) and a start vertex s, build a directed weighted graphG⁰ = (V⁰, A⁰) as

239

follows:

240

Vertex set: V =S

v∈V{v0|v₁} ∪ {xⁱ_p,1≤p≤2n+ 1,1≤i≤w−2}

241

Arc set,

242

optional arcs (xⁱ_2p−1, v0), (v0, xⁱ_2p), (xⁱ_2p, v1), (v1, xⁱ_2p+1) for each v ∈ V, 1≤p≤n,

243

1≤i≤w−2.

244

optional arcs (u1, v0) for each (u, v) inA

245

compulsory arcs (v0, v1) for eachv∈V

246

optional arcs (xⁱ_p, x^j_p) fori < j and (xⁱ_p, x^j_p+1) forj≤i

247

Start vertices are (x¹₀, . . . , x^k−2₀ , s).

248

G⁰ is a yes-instance⇔Gadmits a hamiltonian path

249

⇐ Let v^p be the pth vertex of V in the hamiltonian path. Let X^p be the sequence

250

x⁰_2p−1. . . x^w−2_2p−1v^p₀x⁰_2p. . . x^w−2_2p v^p₁. Let Xⁿ⁺¹ = x⁰_2n. . . x^w−2_2n , and S be the concatenation

251

X¹. . . Xⁿ⁺¹. It can be checked thatS contains only arcs ofAand all compulsory arcs.

252

⇒Consider a sequence S, an occurrence ofxⁱ_p inS for some 1≤i≤w−2, 1≤p≤n

253

(note thatp6=n+ 1), and letS⁰ be the subsequence ofS containing thew−1 characters

254

following xⁱ_2p+1. Let T = xⁱ⁺¹_p . . . x^w−2_p and T = x¹_p+1. . . xⁱ_p+1 (note that T is possibly

255

empty). T andU are seen both as strings and as sets of vertices. The out-neighborhood

256

ofxⁱ_p contains all vertices of T∪U, as well as all verticesv_q forv∈V, where q= 0 ifpis

257

odd andq = 1 if pis even. Since there are k−2 vertices inT ∪U, and no vertex has a

258

self-loop, then by the pigeon-hole principle stringS⁰ must contain at least one vertexv^q,

259

v∈V. Since there are no arc (v^q, v^0q) forv, v⁰ ∈V,S⁰ contains exactly one vertexv^q, thus

260

(10)

it also contains all vertices ofT ∪U. Based on the direction of the arcs inT ∪U∪ {v^q}, it

261

follows that S⁰=T·v^q·U.

262

LetXpbe the string x¹_p. . . x^w−2_p . From the arguments above, and the fact thatS starts withX₁, there exist indicesi₁, j₁, . . . , i_n, j_n such that

S=X₁v_i⁰

1X₂v¹_j

1X₃v_i⁰

2X₄v¹_j

3X₅. . . X_2n+1 From the window sizew, there must exist an arc (v_i⁰

p, v_j¹

p) for each p, so by construction

263

ip = jp. Furthermore, these arcs are compulsory for each vertex v⁰, so (i1, . . . , in) is a

264

permutation of{1, . . . , n}. Finally, there also exist an arc (v¹_j

p, v⁰_i

p+1) inG⁰, so there exists

265

an arc (vi_p, vi_p+1) inG. Thus, (vi₁, . . . , vi_n) is a hamiltonian path inG.

266

DU-Realizablew is NP-hard

267

By reduction fromOptionalRealizablew. Given a directed unweighted graphG= (V, A), a

268

subsetA⁰ ⊆Aof compulsory arcs (letA⁰⁰=A\A⁰ be the set of optional arcs), an integerw,

269

andw−1 distinguished vertices s1. . . sw−1∈V.

270

Let m = |A⁰⁰|, write A⁰⁰ = {(u₁, v₁), . . . ,(u_m, v_m)}. Create G⁰ by adding w(m+ 1) separator verticesyⁱ_p, 1≤p≤m+ 1 and 1≤i≤wandmverticeszp. Build the strings

Z=

m

Y

p=1

(y_p¹. . . y_p^wupzpvp)

!

y_m+1¹ . . . y_m+1^w

Z⁰=Zs₁. . . s_w−1 . Add all arcs realized byZ⁰ involvingy_pⁱ and/orzp toG⁰.

271

Ghas a realization with optional arcs⇔G⁰ has a realization

272

⇒Build a realization forG⁰ by concatenatingZ with the realization forGstarting with

273

s1. . . s_w−1. All optional arcs ofG⁰ are realized inZ, all compulsory arcs ofG⁰ are realized

274

in the suffix (the realization ofG⁰), and all arcs involving a separator are realized inZ⁰. No

275

forbidden arc is realized.

276

⇐Let S be a realization ofG⁰. The set of in-neighbors of any separator has size at most

277

w−1 and induce a tournament in G⁰ (this is clear for all arcs involving separators, it is also

278

true for a potential pair of vertices (u_i, v_i) ofGsince Ghas no length-2 cycle. So thew−1

279

characters before any separator are ordered as inZ. Furthermore each separator (except

280

y₁¹) contains at least one other separator in each in-neighborhood, so any occurrence of a

281

separator is actually the last character of a substring ofS equal to a prefix ofZ. Since y₁¹

282

has in-degree 0, it may only appear as the first character of S, and any prefix of Z inS

283

is also a prefix of S. Moreover since y_m+1^w must appear in S, we haveS =ZS⁰ with no

284

separator appearing inS⁰. ThusS⁰ realizes only arcs fromG. From the out-neighborhood of

285

y_m+1¹ , . . . , y_m+1^w , we have thatSstarts withs₁, . . . s_w−1. Moreover no compulsory arc ofGis

286

realized inZ, nor with one vertex inZ and one inS⁰ (since such arcs start with a separator),

287

so all compulsory arcs are realized inS⁰. Overall,Gis a yes-instance of OptionalRealizablew

288

with sequenceS⁰.

289

GW-Realizable_w, DW-Realizable_w are NP-hard for all w≥3

290

By reduction from a variant of hamiltonian path:

291

Input: Undirected graph Gwith two degree-1 vertices.

292

Question: DoesGhave a hamiltonian path?

293

(11)

Start Gadget:

s0

s⁰₀

a

s

k _k

2

k

k+1 2

k 2

Queue Gadget:

a

s

b t

k 2

k+1 2

+ 2k

k+1 2

+ 2k

k+1 2

k+ 1

(2m−n+ 2) ^w₂+ 2(m−n)

Vertex Gadget

(for each vertexu, includingsandt):

a u b

u⁰

2du k 2

+ ^k+1₂ k

k

1 (du−1)( ^k+1₂ + 2k)

Edge Gadget (for each{u, v}):

u v

k+1 2

Figure 5Subgraphs used in the reduction from Hamiltonian Path to DW-Realizable3. Weights on double arcs apply to both directions. Note that some arcs appear in different gadgets, in which case the weights should be summed (in particular, so loops on s andt have total weight 2du k

2

+ ^k+1₂ + ^k₂

)

Note that this variant of HP is easily shown to be NP-hard from Hamiltonian cycle via

294

the following reduction: given a graph G on which we need to find a hamiltonian cycle,

295

pick any vertexv, duplicate it intov₁, v₂(each edge{u, v} becomes two edges{u, v₁}and

296

{u, v2}), and add pending verticessandtconnected tov1 andv2 respectively.

297

Reduction for DW-Realizable

298

Given G = (V, E) with degree-1 vertices s and t, build a directed weighted graph

299

G⁰ = (V⁰, A) as follows:

300

Vertex set. For eachu∈V, create a vertex denotedu⁰. Create two additional dummy

301

verticesa andb. LetV⁰:={a, b, s0, s⁰₀} ∪S

u∈V{u, u⁰}. The arcs are given in Figure 5, as

302

the union of the start gadget, the queue gadget, and the vertex and edge gadgets respectively

303

for each vertex and edge ofG.

304

Reduction for GW-Realizable

305

Build the directed graphG⁰ as above, and letG⁰_ube the undirected version ofG⁰: remove

306

arc orientations, foru6=vthe weight of{u, v}is the sum of the weight of (u, v) and (v, u) in

307

G⁰ (the weight of loops is unchanged).

308

Main claims

309

We prove the following three claims:

310

(i)Ghamiltonian⇒G⁰ has a realization

311

(ii)G⁰ has a realization⇒G⁰_uhas a realization

312

(iii)G⁰_uhas a realization ⇒Ghamiltonian

313

All together, they show the correctness of the reductions for both GW-Realizable and

314

DW-Realizable since they yield :

315

Ghamiltonian⇔G⁰ has a realization

316

Ghamiltonian⇔G⁰_u has a realization

317

(12)

318

Proof of Claim (i). Ghas a hamiltonian path, let (u1=s, u2, . . . , un=t) be its hamiltonian

319

path and (v₁, w₁), . . .(v_m⁰, w_m⁰) be the pairs of connected verices except pairs (u_i, u_i+1) (i.e.

320

the set S

{u,v}∈E{(u, v),(v, u)} \ {(ui, ui+1) | 1≤ i < n}. Note that m⁰ = 2m−(n−1).

321

Define sequenceS as follows.

322

S :=

s⁰₀s^k₀a s^ks⁰s^ka u^k₂u⁰₂u^k₂a . . . a u^k_n−1u⁰_n−1u^k_n−1a t^kt⁰t^ka b^wv^k₁b w^k₁b^wv^k₂b w^k₂. . . b^wv^k_m−nb w^k_m−nb^w

Note that a sequence of the form x^ka y^k yields ^k₂loops forx, ^k₂loops fory, as well

323

as ^k+1₂

arcs (x, y) (indeed, there are 1 + 2 +. . .+w−2 = ^k+1₂

such arcs). A sequence

324

of the form b x^kb^w yields in particular an arc (b, x) of weight k and arc (x, b) of weight

325

k+1 2

+k. J

326

Proof of Claim (ii). Clear, any realization forG⁰ is a realization for G⁰_u. J

327

Proof of Claim (iii). Pick a realizationS ofG⁰_u. Define the weight of a vertex inG_u as the

328

sum of the weights of its incident edges (counting loops twice). From the construction, we

329

obtain the following weights for a selection of vertices:

330

s⁰₀ has weightw−1

331

u⁰ has weight 2(w−1) foru∈V

332

ahas weight 2(n+ 1)(w−1)

333

From the weight of s⁰₀, it follows that this vertex must be an endpoint of S (wlog,S

334

starts with s⁰₀). It follows that for any other vertexv with weight 2i(w−1),v must have

335

exactlyioccurrences inS (in general it can be eitheriori+ 1, but ifvhasi+ 1 occurrences

336

it must be both the first and last character ofS, i.e. v=s⁰₀: a contradiction). Thus each u⁰

337

occurs once andaoccursn+ 1 times inS.

338

Each u⁰ occurs once, so order vertices of V according to their occurrence in S (i.e.

339

V ={u1, . . . , u_n}withu⁰₁ appearing beforeu⁰₂, etc.). For eachi, the neighborhood ofu⁰_i in

340

S containsatwice, one aon each side (since there is no (a, a) loop). Other neighbors of

341

u⁰_i may only be occurrences ofu_i, so eachu⁰_i belongs to a factor, denoted X_i, of the form

342

au^∗_iu⁰_iu^∗_ia. Two consecutive factorsXi, Xi+1 may overlap by at most one character (a), and

343

if they do, then there exists an edge{u_i, u_i+1} (sincew≥3) inG. There arensuch factors

344

Xui, and onlyn+ 1 occurrences ofa, so allas except extreme ones belong to the overlap of

345

two consecutiveXis, and there exists an edge{ui, ui+1} for eachi. Thus (u1, . . . , un) is a

346

hamiltonian path ofG. J

347

4 Effective general algorithms

348

4.1 Realizable

w

Linear integer programming formulation

349

LetG= (V, E) be a graph with integer weightsπe∈E. In this model, we represent a sequence

350

xover the alphabet{1, ...n}, as a (0−1) matrixX ∈Mn,p({0,1}) encoding the sequence x:

351

X_i,j=

(1 if xj =i 0 otherwise

352

(13)

It should be noted that the set sequence of sequences over the alphabet{1, ...n}is exactly represented by the (0−1) matrices such that

∀j∈ {1, ..., p}

n

X

i=1

Xi,j= 1

Given a window sizew, a unit ofπe=(v₁,v₂)corresponds to the appearance of two elements

353

v1,v2at a distancei∈ {1, ..., w−1}in the sequence. Now, let us consider a fixed distance i,

354

and a starting indexj∈ {1, ..., p−i}, we use a intermediary slack variabley_j^e(i)∈ {0,1} to

355

model the presence of such appearance using the constraint:

356

Xv1,jXv₂,j+i=y_j^e(i) (8)

357

Then, the Boolean variabley_j^e(i) is equal to 1 whenv₁is located at positionj andv₂ at

358

positionj+i. We linearise Eq. 8 as:

359

−Xv1,j +y_j^e(i)≤0

−X_v₂_,j+i+y_j^e(i)≤0 X_v₁_,1+X_v₂_,j+i−y_j^e(i)≤1

(9)

360

Each slack variabley^e_k(i) is attributed to an edgee, a relative distancei∈ {1, ..., w−1}and a starting positionk∈ {1, ..., p−i}. Given our constraint formulation, every slack variable is attributed 3 constraints. For a digraph, the number of possible pair positions for a unit of π_e=(v₁_,v₂₎is given by:

C=

w−1

X

i=1

(p−i) =p(w−1)−w(w−1)

2 = (w−1)(p−w 2)

Therefore, in our model, C corresponds to the number of slack variables attributed to

361

constraints for an edge of the graph.

362

On the contrary, the absence of an edge e= (v₁, v₂), corresponding to π_e= 0, can be

363

modeled for a distancei∈ {1, ..., w−1} and a starting positionj∈ {1, ..., p−i}as:

364

X_v1,j+X_v₂_,j+i≤1

365

Then,Realizablew can be formulated as the following linear integer program:

366

X∈{0,1}^p×nmin,y∈{0,1}^|E|×C

X

e∈E

X

i∈{1,...,w−1}

y₁^e(i) +...+y_p−i^e (i)

367

under the constraints

368

∀j∈ {1, ..., p}

n

X

i=1

Xi,j= 1

369

∀e= (v1, v2)∈E

∀e⁰ = (v₁⁰, v⁰₂)∈/ E

∀i∈ {1, ..., w−1}











−X_v₁_,1 +y^e₁(i)≤0

−X_v₂_,1+i+y^e₁(i)≤0 Xv₁,1+Xv₂,1+i−y^e₁(i)≤1

...

−Xv1,p−i +y^e_p−i(i)≤0

−X_v₂_,p +y^e_p−i(i)≤0 X_v₁_,p−i+X_v₂_,p −y^e_p−i(i)≤1

X_v⁰

1,1+X_v⁰

2,1+i≤1 ...

X_v⁰

1,p−i+X_v⁰

2,p ≤1

370