HAL Id: hal-02495333
https://hal.archives-ouvertes.fr/hal-02495333v3
Submitted on 13 Jan 2021
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
models
Sammy Khalife, Yann Ponty, Laurent Bulteau
To cite this version:
Sammy Khalife, Yann Ponty, Laurent Bulteau. Sequence graphs realizations and ambiguity in lan-
guage models. COCOON 2021 - 27th International Computing and Combinatorics Conference, Oct
2021, Tainan, Taiwan. �hal-02495333v3�
language models
2
Sammy Khalife
3
LIX, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris, 91128 Palaiseau, France
4
khalife@lix.polytechnique.fr
5
Yann Ponty
6
LIX, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris, 91128 Palaiseau, France
7
yann.ponty@lix.polytechnique.fr
8
Laurent Bulteau
9
LIGM, CNRS, Université Gustave Eiffel, 77454 Marne-la-Vallée, France
10
laurent.bulteau@univ-eiffel.fr
11
Abstract
12
Several language models rely on an assumption modeling each local context as a (potentially oriented)
13
bag of words, and have proven to be very efficient baselines. Sequence graphs are the natural
14
structures encoding their information. However, a sequence graph may have several realizations
15
as a sequence, leading to a degree of ambiguity. In this paper, we study such degree of ambiguity
16
from a combinatorial and computational point of view. In particular, we present theoretical results
17
concerning the family of sequence graphs. Several combinatorial problems are presented, depending
18
on three levels of generalisation (window size, graph orientation, and weights), and whether some
19
of these are NP-complete is left opened. We establish different algorithms, including an integer
20
program and a dynamic programming formulation to respectively recognize a sequence graph and to
21
count the number of its distinct realizations. This allows us to show that this model assumption can
22
induce an important number of sentences to have the same representations. We empirically compare
23
the representations obtained with a recurrent neural networks for different realizations of sequence
24
graphs.
25
2012 ACM Subject Classification Mathematics of computing→Combinatoric problems; Mathem-
26
atics of computing→Combinatorics on words; Mathematics of computing→Graph algorithms;
27
Theory of computation→Complexity classes; Theory of computation→Problems, reductions and
28
completeness
29
Keywords and phrases Graphs, Sequences, Combinatorics, Inverse problem, Complexity class
30
1 Introduction
31
The automated treatment of familiar objects, either natural or artifacts, always relies on a
32
translation into entities manageable by computer programs. However, the correspondence
33
between the object to be treated and "its" representation is not necessarily one-to-one. The
34
representations used for learning algorithms are no exception to this rule. In particular,
35
natural language words and textual documents representations are essential for several tasks,
36
including document classification [23], role labelling [19], and named entity recognition
37
[16]. The traditional models based on pointwise mutual information, or graph-of-words
38
(GOW), [9, 17, 20], supplement the content of bag-of-words (TF, TFIDF) with statistics
39
of co-occurrences within a windowof fixed size w, introduced to mitigate the degree of
40
ambiguity. Several models [2, 14, 18, 21] also use the same type of information and constitute
41
strong baselines for natural language processing.
42
While these representations are more precise than the traditional bag-of-words (e.g Parikh
43
vectors), they still induce some level of ambiguity,i.e. a given graph can represent several
44
sequences. Our study is thus motivated by a quantification of the level of ambiguity, seen
45
Linux is not UNIX but
(a)No ambiguity (w= 3)
Linux is not UNIX but
(b)Ambiguity (w= 2)
Figure 1Sequence graphs (orgraphs-of-words) built for the sentence “Linux is not UNIX but Linux” using window sizes 3 (a) and 2 respectively (b). In the second case, the sequence graph is ambiguous, since any circular permutation of the words admits the same representation.
a b
c d
r
a b r a c a d a b r a a b r a b r a d a c a a b r a c a b r a d a a b r a b r a c a d a a b r a d a b r a c a a b r a d a c a b r a
...
(a)w= 2,Ghas 30 realizations
a b
c
d r
a b r a c a d a b r a a b r a c a d b a r a a b a r c a d a b r a a b a r c a d b a r a a b r a c a d a b r a a b r a c a d b a r a (b)w= 3,Ghas 6 realizations
a c b
d
r
a b r a c a d a b r a a b r a c a a d b r a a b r c a a d a b r a (c)w= 4,Ghas 3 realizations
a c b
d
r a b r a c a d a b r a (d)w= 5,Ghas one realization
Figure 2Sequence graphs (orgraphs-of-words) built for the sentence “a b r a c a d a b r a” using window sizes 2 (a), 3 (b), 4 (c) and 5 (d).
as an algorithmic problem, coupled with an empirical assessment of the consequences of
46
ambiguity for the representations.
47
After introducing in Section 2 the formal definition of a sequence graph and the descriptions
48
of our main problems, we establish in Section 3.1 complexity aspects of deciding the existence
49
and counting sequences in GOWs associated with a window sizew= 2. Then we consider
50
in Section 3.2 the general case w ≥ 3, and propose a integer program and a dynamic
51
programming algorithm to respectively recognize a sequence graph and count admissible
52
sequences. Finally, we assess the prevalence of ambiguity within a synthetic dataset, and
53
observe that sequences invariant with respect to the GOW representation do not lead to
54
invariance with respect to recurrent neural networks such as Long Short Term Memory
55
networks (LSTMs).
56
Related work
57
Sequence graphs encode the information of several co-occurences based models [2, 15, 18]. To
58
the best of our knowledge, the ambiguity and realizability questions addressed in this work
59
were never systematically addressed by prior work in computational linguistics. Furthermore,
60
we believe the problems studied in this paper are interesting from an algorithmic point of
61
view, and appear to be devoid of reduction to other well-known problems.
62
However, some similarities exist between our problem and others studied in the Distance
63
Geometry (DG) literature. In distance geometry, the input consists of a set of pairwise
64
distances between points, having unknown positions in a d-dimensional space. The problem
65
then consists in determining (the existence of) a set of positions for the points, satisfying the
66
distance constraints. Since a position is fully characterized fromd+ 1 constraining neighbors,
67
the problem can be solved by finding a sequential order for processing points, such that the
68
assignment of a point is always by at leastd+ 1 among its neighbors [13]. This statement
69
shares some level of similarity with our problem since an admissible sequence for a window
70
w = d+ 2 also represents a linear ordering of its nodes, in which w−1 = d+ 1 of the
71
neighbors have lower value with respect to the order.
72
The reasons for the insufficiency of linear ordering in DG to solve our realizability problem
73
are threefold. First, each element of the sequencexassociated to the protein backbone is
74
associated a unique vertex. This is not the case we investigate here, since a symbol can be
75
repeated several times, but only one vertex is created in the graph. This implies that the
76
vertex associated to theith element (i≥w) of xcan have strictly less thanw−1 distinct
77
neighbors in its predecessors inx. Second, the absence of loops in distance geometry, because
78
an element is at distance 0 from itself. Finally, the graphs are always undirected in distance
79
geometry.
80
2 Definitions and problem statement
81
Let x= x1, x2, ..., xp be a finite sequence of discrete elements among a finite vocabulary
82
X. Without loss of generality, we can suppose thatX = {1, ..., n}. In the following, let
83
Ip={1, ..., p}. This motivates the following definition:
84
IDefinition 1. G= (V, E)is the graph of the sequencexwith window size w∈N∗ if and
85
only ifV ={xi|i∈Ip}, and
86
(i, j)∈E ⇐⇒ ∃(k, k0)∈Ip2, |k−k0| ≤w−1xk=iandxk0 =j (1)
87
For digraphs, Eq. (1)is replaced with
88
(i, j)∈E ⇐⇒ ∃(k, k0)∈Ip2, k≤k0≤k+w−1, xk =iandxk0 =j. (2)
89
Finally, a weighted sequence graph Gis endowed with a matrixΠ(G) = (πij) such that
90
πij =Card{(k, k0)∈Ip2|k≤k0≤k+w−1, xk =iandxk0 =j} (3)
91
We say that xis aw-admissible sequence for G(or a realization ofG), ifGis the graph of
92
sequence xwith window sizew.
93
The natural integers πij represent the number of co-occurrences ofiandj in a window
94
of sizew. Hence, the graph of sequence is unique. An linear time algorithm to construct a
95
weighted sequence digraph is presented in Sec. A of the appendix. Other cases are obtained
96
similarly. The procedure in algorithm 1 defines a correspondence between the sequence set
97
X?into the graph set G: φw:X?→ G, x7→Gw(x). Based on these definitions, we consider
98
the following problems:
99
IProblem 1(Weighted-Realizable(W-Realizable) ).
100
Input: Possibly directed graphG, matrix weightsΠ, window size w
101
Output: True if (G,Π)is thew-sequence graph of some sequencex, False otherwise.
102
IProblem 2(Unweighted-Realizable(U-Realizable) ).
103
Input: Possibly directed graph G, window size w
104
Output: True ifGis thew-sequence graph of some sequencex, False otherwise.
105
We denote D-Realizable (resp. G-) the restricted version of Realizable where the
106
input graphGis directed (resp. undirected), andW-Realizable(resp. U-) the restricted
107
version ofRealizable where the input graphGis weighted (resp. unweighted), possibly
108
in combination with the D- or G- variants. We writeRealizablew for the case where w
109
is a fixed (given) constant. We also consider the variants of W-Realizable, denoted WG-
110
Realizableand WD-Realizable where the input graph is restricted to be respectively
111
undirected and directed. We define UG-Realizable and UD-Realizable similarly. Finally,
112
we write (WG-, WD-, ...)Realizablewfor the case wherewis a fixed strictly positive integer.
113
IProblem 3(Unweighted-NumRealizations(U-NumRealizations) ).
114
Input: Possibly directed graph G, window size w
115
Output: The number of realizations of G, i.e. preimages of G through φw such that
116
|{x∈X?|φw(x) =G}|if finite, or+∞otherwise.
117
IProblem 4(Weighted-NumRealizations(W-NumRealizations)).
118
Input: Possibly directed graph G, matrix weights Π, window sizew
119
Output: The number ofrealizations ofGin the weighted sense.
120
Similarly, we use the same prefix for the directed or undirected versions of (D-, G-, i.e.
121
DU- for directed and unweighted). We also denoteNumRealizationsw for the case where
122
wis a fixed strictly positive integer. Note thatNumRealizationsstrictly generalizes the
123
previous one, asRealizablecan be solved by testing the nullity of the number of suitable
124
realization computed byNumRealizations.
125
DWDirected weighted DU Directed unweighted GWUndirected weighted GUUndirected unweighted
126
3 Theoretical results
127
In this section, we present our main theoretical results. Due to length limitations, some of
128
the proofs are left in the appendix.
129
3.1 A complete characterization of 2-sequence graphs
130
A graph has a sequential realization withw= 2 when there exists a path visiting every vertex
131
and covering all of its edges (at least once for the unweighted case and exactlyπefor the
132
edgeein the weighted case). This characterization enables relatively simple characterization
133
and algorithmic treatment, leading to the results summarized in Table 2.
134
Table 1Complexity for various instances of our problems (w= 2)
NumRealizations2 Realizable2
Data Instance Complexity #Sequences Complexity Characterization
Unweighted graph P {0,+∞} P Gconnected
Weighted graph #P-hard {0,1} ∪2N∗ P ψ(G) (semi) Eulerian
Unweighted digraph P {0,1,+∞} P Theorem 14
Weighted digraph P N(BEST Theorem) P ψ(G) (semi) Eulerian
3.2 General sequence graphs and Realizable
w≥3135
The characterization of more general sequence graphs, such as 3-graphs is not the same for
136
2-graphs, as shows the counter-example in Fig 3a: the depicted graph has no self-edge so
137
there must at least one clique of size 3. Similarly, Fig. 3b depicts a counter example for
138
directed graphs: Gdoes not have loops, so if it had a 3-admissible sequence, such sequence
139
must be of the form{1 2 3 1...,1 3 2 1...,2 3 1 2...,3 2 1 3...,2 1 3 2...}but then (2,1) would form
140
an edge.
141
1 2 3
(a) G is connected but not a 3-sequence graph
1 2 3
(b)Gis strongly connec- ted but not a 3-sequence graph
Figure 3Counter examples forw= 3
3.3 A polynomial time algorithm for GU- Realizable
w142
Similarly to the procedure in Sec. B, we will use an auxiliary graph built on G. Let
143
H(G) = (E, HE) be the new graph obtained with the following procedure. Two edges
144
e= (v1, v2),f = (v3, v4) ofE are connected inH(G) if and only if:
145
v2=v3 and (v1, v4)∈E (4)
146
This defines an injective function ˜h: EH →V3: an edge of H(G) can be seen as an
147
unique tripletv1, v2, v3 where (v1, v2),(v1, v3) and (v2, v3)∈E. Therefore, by definition, a
148
walkP inH(G) is always of the form:
149
P= (t1, t2), ...,(tp−1, tp) s.t ∀i∈ {1, ..., p−1}, (ti, ti+1)∈E (5)
150
It is clear that if H(G) is a 2-graph, then Gis a 3-graph since there is a walk going
151
through all edges ofH(G) (so visiting every non isolated node and creating all edges of G).
152
However, the converse is not true as depicted in Fig. 4. In order to determine ifG= (V, E)
153
has an admissible sequence in the general case, a procedure is to recursively merge pairs of
154
vertices, maintaining constraints depending onE. These constraints are similar to Eq. 4. We
155
adopt the following notations,ui,j= (ui, uj) andu1:k= (u1, ..., uk). The iterative procedure
156
forw≥3 is summed up in the following equation. Namely,∀k∈ {2, ..., w−2}, one has
157
E(k)={u1:k+1∈Vk+1 |u1:k∈E(k−1), u2:k+1∈E(k−1)∧(u1, uk+1)∈E} (6)
158
LetH(k)= (E(k), E(k+1)), it can be defined recursively through:
159
H(0) =G ∀k∈N∗, H(k)=f(H(k−1)) (7)
160161
wheref transforms edges into vertices and creates edges between new vertices that verify
162
Eq. 6.
163
IDefinition 2. Letu be a vertex of H(k) fork ∈N, u= (u1, ..., uk, uk+1). The sequence
164
u1, ..., uk+1 is the authenticsequence ofu. We also call an authentic sequence of a walk on
165
H(k): P = (x1, ..., xk+1),(x2, ..., xk+2), ...,(xv, ..., xv+k)the sequencex1, x2, ..., xv+k.
166
In order to obtain admissible sequences of lengthp, the computation ofH(p) requiresp
167
iterations, and the number of vertices and edges ofH(k)can increase during iterations (the
168
complete graph is an example for which theses numbers increase exponentially).
169
I Proposition 3. Let x = x1, ..., xp be a w-admissible sequence of a graph (or digraph)
170
G= (V, E). If w≤p, x, thenxis an authentic sequence of a walk of length p−w+ 1on
171
H(w−2).
172
Proof. Due to length limitation, we provide a proof sketch, full proof is left in the appendix.
The following property by induction onk:
∀k∈ {w, ..., p}, ∃walkP onH(w−2)such that : x1:k =P[1]1, P[2]1, ..., P[k−w]1, P[k−(w−1)]1:(w−1)
• Initialisation: k= 1. By construction of H(w−2),x1 is the first element of the “static
173
walk”: x1:w−1∈H(w−2).
174
• Induction: Verification that ifx1:k is a walk of lengthk−w+ 1, one can find a walk of
175
length (k+ 1)−w+ 1 to generatex1:(k+1). J
176
ITheorem 4. Let w∈N∗. GU-Realizablew is in P.
177
Proof. The case forw= 1 is trivial, andw= 2 has been treated. For w≥3, an algorithm
178
is obtained by going through all the connected components ofH(w−2). LetC1, ..., Cm the
179
connected components of H(w−2). On the one hand, it is possible to compute them in
180
polynomial time. On the other hand, it is possible to construct walks covering all of their
181
respective edges in polynomial time (for instance iteratively using shortest paths). Let
182
W1, ..., Wm such walks andX1, ..., Xm their respective admissible sequences.
183
Using Prop. 3,Gis aw-sequence graph if and only if there exists a walk ˜Wi0 on some
184
Ci0 creating exactly the edges ofG. However,Wi0 creates more edges than any walk onCi0
185
by construction.
186
In conclusion, the assertion:
∃i∈ {1, ..., m}, φw(Xi) =G
is a characterization thatGis aw-sequence. This assertion is decidable in polynomial time
187
since for alli,φw(Xi) is computable in polynomial time (cf. Algorithm 1). J
188
For digraphs, the analogue of the aforementioned procedure would consist in enumerating
189
alll paths in the DAGR(H(w−2)). However, the number of paths can be exponential, even for
190
a sequence graph. In the next subsection, we will prove that DU-Realizablew is actually
191
NP-hard. Finally, ifx1, ..., xc are vertices of a strongly component ofH(w−2), which order
192
should be considered to form a new vertex attributexC? The following lemma shows that
193
this order is not important, as long as it represents a walk in the component. Moreover, it
194
is possible to reconstruct all admissible sequences from walks onR(Hw−2). With the same
195
notations:
196
ILemma 5. Let xa walk on H(w−2) whose authentic sequence isw-admissible for G. Ifx
197
goes through a strongly componentC ofH(w−2), adding any supplementary path included in
198
C is stable forw-admissibility. Any graph generated by a walk onH(w−2)can be generated
199
by a walk on R(H(w−2)).
200
1 2
3 4
(a)G
31 24 23 43 42
41 34
32 (b)H
31 24 43
41 32
34234
(c)R(H)
Figure 4Procedure to find a 3-admissible sequence. 34234, 41: is 3-admissible, with authentic sequence 3 4 2 3 4 1
Proof. We present a proof sketch. The first statement concerning stability requires a
201
straightforward verification using the definition ofH(w−2). Second, a procedure to generate
202
Gfrom a walk onR(H(w−2)) using a walkx1:ponH(w−2)) is to consider an iterative scheme,
203
and discuss three cases:
204
(i)xi andxi+1are not in a strongly connected component (SCC)
205
(ii)xi is not in a SCC andxi+1 is in a SCC
206
(iii) xi andxi+1 are both in SCCs
207
For case (i), we just keepxi andxi+1. For cases (ii) and (iii), we use the first part result of
208
the Lemma and add covering walks over the strongly connected components. J
209
3.4 Main complexity results
210
In this subsection we present the remaining complexity results, which are summarized in
211
Table 2. In the previous subsection, we proved that GU-Realizablew∈P, ∀w≥3. Besides,
212
for GU, the number of realizations of a graphGis either 0 (not realizable), +∞(realizable
213
and there exists a cycle in a component of H generating G), or 1 (realizable but no cycle
214
in any component of H generatingG). These three cases can be tested in polynomial time
215
using our algorithm, showing that GU-NumRealizationsw∈P, ∀w≥3. In the remaining
216
of this section, we present the reductions we used for the other instances.
217
Table 2Complexity for various instances of our problems (w≥3). We remind that a para-NP- hard problem does not admit any XP algorithm unless P=NP.
Constantw,w≥3 Parameterw
NumRealizationsw Realizablew NumRealizations Realizable
Variation Complexity Complexity Complexity Complexity
GU P P W[1]-hard; XP W[1]-hard; XP
GW NP-hard NP-hard para-NP-hard para-NP-hard
DU NP-hard NP-hard para-NP-hard para-NP-hard
DW NP-hard NP-hard para-NP-hard para-NP-hard
I Proposition 6. Clique admits a polynomial time parameterized reduction into GU-
218
Realizable.
219
Proof. Let G= (V, E) be a simple graph. Let G0 be a graph constructed fromGadding
220
two nodesaandbwith loops, such thataandbare connected to each vertex ofG. Letkbe
221
a strictly positive integer andw=k+ 1. We will show thatGhas ak-clique if and only if
222
G0 isw-realizable.
223
First, let us suppose thatGhas ak-clique. LetC be an arbitrary sequence of the vertices of one of itsk-clique. Letv1, . . . , v|V|be the vertices ofGand (u1, u01), . . . , (u|E|, u0|E|) be its edges. In the followingxwrepresents thew-repetition ofx. Then, the following sequence is aw-realization ofG0:
awu1u01awu2u02aw . . . awu|E|u0|E|aw C bwv1bwv2bk . . . bwv|V|
Now let us suppose thatG0 is w-realizable and letx=x1, . . . , xpbe aw-realization ofG0.
224
Without loss of generality, let us suppose aappears beforeb inx. Letib be the index of
225
the first appearance ofb and let ia be the largest index of the appearance of abeforeib.
226
Thenib−ia≥w, otherwise there would be an edge betweenaandb. Furthermore, since
227
Gis simple, there cannot be two repetitions of a vertex in the sequence xia+1, . . . , xia+w−1.
228
Due to the definition of a sequence graph, all vertices{xia+1, . . . , xia+w−1}are connected,
229
forming a clique inGof size w−1 =k, which ends the proof. J
230
ICorollary 7. GU-Realizable isW[1]-hard for parameterw.
231
DU-Realizable is NP-hard forw≥3
232
Consider the following intermediate problem:
233
OptionalRealizablew Given a directed unweighted graphD= (V, A), a subsetA0 ⊆A of
234
compulsory arcs, two distinguished verticess, s0 ∈V. Is there a sequence S such that the
235
graph ofS contains only arcs inAand (at least) all arcs inA0.
236
We first prove that this problem is NP-hard, then show how it reduces to DU-Realizable.
237
OptionalRealizablew,w≥3is NP-hard
238
GivenG= (V, A) and a start vertex s, build a directed weighted graphG0 = (V0, A0) as
239
follows:
240
Vertex set: V =S
v∈V{v0|v1} ∪ {xip,1≤p≤2n+ 1,1≤i≤w−2}
241
Arc set,
242
optional arcs (xi2p−1, v0), (v0, xi2p), (xi2p, v1), (v1, xi2p+1) for each v ∈ V, 1≤p≤n,
243
1≤i≤w−2.
244
optional arcs (u1, v0) for each (u, v) inA
245
compulsory arcs (v0, v1) for eachv∈V
246
optional arcs (xip, xjp) fori < j and (xip, xjp+1) forj≤i
247
Start vertices are (x10, . . . , xk−20 , s).
248
G0 is a yes-instance⇔Gadmits a hamiltonian path
249
⇐ Let vp be the pth vertex of V in the hamiltonian path. Let Xp be the sequence
250
x02p−1. . . xw−22p−1vp0x02p. . . xw−22p vp1. Let Xn+1 = x02n. . . xw−22n , and S be the concatenation
251
X1. . . Xn+1. It can be checked thatS contains only arcs ofAand all compulsory arcs.
252
⇒Consider a sequence S, an occurrence ofxip inS for some 1≤i≤w−2, 1≤p≤n
253
(note thatp6=n+ 1), and letS0 be the subsequence ofS containing thew−1 characters
254
following xi2p+1. Let T = xi+1p . . . xw−2p and T = x1p+1. . . xip+1 (note that T is possibly
255
empty). T andU are seen both as strings and as sets of vertices. The out-neighborhood
256
ofxip contains all vertices of T∪U, as well as all verticesvq forv∈V, where q= 0 ifpis
257
odd andq = 1 if pis even. Since there are k−2 vertices inT ∪U, and no vertex has a
258
self-loop, then by the pigeon-hole principle stringS0 must contain at least one vertexvq,
259
v∈V. Since there are no arc (vq, v0q) forv, v0 ∈V,S0 contains exactly one vertexvq, thus
260
it also contains all vertices ofT ∪U. Based on the direction of the arcs inT ∪U∪ {vq}, it
261
follows that S0=T·vq·U.
262
LetXpbe the string x1p. . . xw−2p . From the arguments above, and the fact thatS starts withX1, there exist indicesi1, j1, . . . , in, jn such that
S=X1vi0
1X2v1j
1X3vi0
2X4v1j
3X5. . . X2n+1 From the window sizew, there must exist an arc (vi0
p, vj1
p) for each p, so by construction
263
ip = jp. Furthermore, these arcs are compulsory for each vertex v0, so (i1, . . . , in) is a
264
permutation of{1, . . . , n}. Finally, there also exist an arc (v1j
p, v0i
p+1) inG0, so there exists
265
an arc (vip, vip+1) inG. Thus, (vi1, . . . , vin) is a hamiltonian path inG.
266
DU-Realizablew is NP-hard
267
By reduction fromOptionalRealizablew. Given a directed unweighted graphG= (V, A), a
268
subsetA0 ⊆Aof compulsory arcs (letA00=A\A0 be the set of optional arcs), an integerw,
269
andw−1 distinguished vertices s1. . . sw−1∈V.
270
Let m = |A00|, write A00 = {(u1, v1), . . . ,(um, vm)}. Create G0 by adding w(m+ 1) separator verticesyip, 1≤p≤m+ 1 and 1≤i≤wandmverticeszp. Build the strings
Z=
m
Y
p=1
(yp1. . . ypwupzpvp)
!
ym+11 . . . ym+1w
Z0=Zs1. . . sw−1 . Add all arcs realized byZ0 involvingypi and/orzp toG0.
271
Ghas a realization with optional arcs⇔G0 has a realization
272
⇒Build a realization forG0 by concatenatingZ with the realization forGstarting with
273
s1. . . sw−1. All optional arcs ofG0 are realized inZ, all compulsory arcs ofG0 are realized
274
in the suffix (the realization ofG0), and all arcs involving a separator are realized inZ0. No
275
forbidden arc is realized.
276
⇐Let S be a realization ofG0. The set of in-neighbors of any separator has size at most
277
w−1 and induce a tournament in G0 (this is clear for all arcs involving separators, it is also
278
true for a potential pair of vertices (ui, vi) ofGsince Ghas no length-2 cycle. So thew−1
279
characters before any separator are ordered as inZ. Furthermore each separator (except
280
y11) contains at least one other separator in each in-neighborhood, so any occurrence of a
281
separator is actually the last character of a substring ofS equal to a prefix ofZ. Since y11
282
has in-degree 0, it may only appear as the first character of S, and any prefix of Z inS
283
is also a prefix of S. Moreover since ym+1w must appear in S, we haveS =ZS0 with no
284
separator appearing inS0. ThusS0 realizes only arcs fromG. From the out-neighborhood of
285
ym+11 , . . . , ym+1w , we have thatSstarts withs1, . . . sw−1. Moreover no compulsory arc ofGis
286
realized inZ, nor with one vertex inZ and one inS0 (since such arcs start with a separator),
287
so all compulsory arcs are realized inS0. Overall,Gis a yes-instance of OptionalRealizablew
288
with sequenceS0.
289
GW-Realizablew, DW-Realizablew are NP-hard for all w≥3
290
By reduction from a variant of hamiltonian path:
291
Input: Undirected graph Gwith two degree-1 vertices.
292
Question: DoesGhave a hamiltonian path?
293
Start Gadget:
s0
s00
a
s
k k
2
k
k+1 2
k 2
Queue Gadget:
a
s
b t
k 2
k+1 2
+ 2k
k+1 2
+ 2k
k+1 2
k+ 1
(2m−n+ 2) w2+ 2(m−n)
Vertex Gadget
(for each vertexu, includingsandt):
a u b
u0
2du k 2
+ k+12 k
k
1 (du−1)( k+12 + 2k)
Edge Gadget (for each{u, v}):
u v
k+1 2
Figure 5Subgraphs used in the reduction from Hamiltonian Path to DW-Realizable3. Weights on double arcs apply to both directions. Note that some arcs appear in different gadgets, in which case the weights should be summed (in particular, so loops on s andt have total weight 2du k
2
+ k+12 + k2
)
Note that this variant of HP is easily shown to be NP-hard from Hamiltonian cycle via
294
the following reduction: given a graph G on which we need to find a hamiltonian cycle,
295
pick any vertexv, duplicate it intov1, v2(each edge{u, v} becomes two edges{u, v1}and
296
{u, v2}), and add pending verticessandtconnected tov1 andv2 respectively.
297
Reduction for DW-Realizable
298
Given G = (V, E) with degree-1 vertices s and t, build a directed weighted graph
299
G0 = (V0, A) as follows:
300
Vertex set. For eachu∈V, create a vertex denotedu0. Create two additional dummy
301
verticesa andb. LetV0:={a, b, s0, s00} ∪S
u∈V{u, u0}. The arcs are given in Figure 5, as
302
the union of the start gadget, the queue gadget, and the vertex and edge gadgets respectively
303
for each vertex and edge ofG.
304
Reduction for GW-Realizable
305
Build the directed graphG0 as above, and letG0ube the undirected version ofG0: remove
306
arc orientations, foru6=vthe weight of{u, v}is the sum of the weight of (u, v) and (v, u) in
307
G0 (the weight of loops is unchanged).
308
Main claims
309
We prove the following three claims:
310
(i)Ghamiltonian⇒G0 has a realization
311
(ii)G0 has a realization⇒G0uhas a realization
312
(iii)G0uhas a realization ⇒Ghamiltonian
313
All together, they show the correctness of the reductions for both GW-Realizable and
314
DW-Realizable since they yield :
315
Ghamiltonian⇔G0 has a realization
316
Ghamiltonian⇔G0u has a realization
317
318
Proof of Claim (i). Ghas a hamiltonian path, let (u1=s, u2, . . . , un=t) be its hamiltonian
319
path and (v1, w1), . . .(vm0, wm0) be the pairs of connected verices except pairs (ui, ui+1) (i.e.
320
the set S
{u,v}∈E{(u, v),(v, u)} \ {(ui, ui+1) | 1≤ i < n}. Note that m0 = 2m−(n−1).
321
Define sequenceS as follows.
322
S :=
s00sk0a sks0ska uk2u02uk2a . . . a ukn−1u0n−1ukn−1a tkt0tka bwvk1b wk1bwvk2b wk2. . . bwvkm−nb wkm−nbw
Note that a sequence of the form xka yk yields k2loops forx, k2loops fory, as well
323
as k+12
arcs (x, y) (indeed, there are 1 + 2 +. . .+w−2 = k+12
such arcs). A sequence
324
of the form b xkbw yields in particular an arc (b, x) of weight k and arc (x, b) of weight
325
k+1 2
+k. J
326
Proof of Claim (ii). Clear, any realization forG0 is a realization for G0u. J
327
Proof of Claim (iii). Pick a realizationS ofG0u. Define the weight of a vertex inGu as the
328
sum of the weights of its incident edges (counting loops twice). From the construction, we
329
obtain the following weights for a selection of vertices:
330
s00 has weightw−1
331
u0 has weight 2(w−1) foru∈V
332
ahas weight 2(n+ 1)(w−1)
333
From the weight of s00, it follows that this vertex must be an endpoint of S (wlog,S
334
starts with s00). It follows that for any other vertexv with weight 2i(w−1),v must have
335
exactlyioccurrences inS (in general it can be eitheriori+ 1, but ifvhasi+ 1 occurrences
336
it must be both the first and last character ofS, i.e. v=s00: a contradiction). Thus each u0
337
occurs once andaoccursn+ 1 times inS.
338
Each u0 occurs once, so order vertices of V according to their occurrence in S (i.e.
339
V ={u1, . . . , un}withu01 appearing beforeu02, etc.). For eachi, the neighborhood ofu0i in
340
S containsatwice, one aon each side (since there is no (a, a) loop). Other neighbors of
341
u0i may only be occurrences ofui, so eachu0i belongs to a factor, denoted Xi, of the form
342
au∗iu0iu∗ia. Two consecutive factorsXi, Xi+1 may overlap by at most one character (a), and
343
if they do, then there exists an edge{ui, ui+1} (sincew≥3) inG. There arensuch factors
344
Xui, and onlyn+ 1 occurrences ofa, so allas except extreme ones belong to the overlap of
345
two consecutiveXis, and there exists an edge{ui, ui+1} for eachi. Thus (u1, . . . , un) is a
346
hamiltonian path ofG. J
347
4 Effective general algorithms
348
4.1 Realizable
wLinear integer programming formulation
349
LetG= (V, E) be a graph with integer weightsπe∈E. In this model, we represent a sequence
350
xover the alphabet{1, ...n}, as a (0−1) matrixX ∈Mn,p({0,1}) encoding the sequence x:
351
Xi,j=
(1 if xj =i 0 otherwise
352
It should be noted that the set sequence of sequences over the alphabet{1, ...n}is exactly represented by the (0−1) matrices such that
∀j∈ {1, ..., p}
n
X
i=1
Xi,j= 1
Given a window sizew, a unit ofπe=(v1,v2)corresponds to the appearance of two elements
353
v1,v2at a distancei∈ {1, ..., w−1}in the sequence. Now, let us consider a fixed distance i,
354
and a starting indexj∈ {1, ..., p−i}, we use a intermediary slack variableyje(i)∈ {0,1} to
355
model the presence of such appearance using the constraint:
356
Xv1,jXv2,j+i=yje(i) (8)
357
Then, the Boolean variableyje(i) is equal to 1 whenv1is located at positionj andv2 at
358
positionj+i. We linearise Eq. 8 as:
359
−Xv1,j +yje(i)≤0
−Xv2,j+i+yje(i)≤0 Xv1,1+Xv2,j+i−yje(i)≤1
(9)
360
Each slack variableyek(i) is attributed to an edgee, a relative distancei∈ {1, ..., w−1}and a starting positionk∈ {1, ..., p−i}. Given our constraint formulation, every slack variable is attributed 3 constraints. For a digraph, the number of possible pair positions for a unit of πe=(v1,v2)is given by:
C=
w−1
X
i=1
(p−i) =p(w−1)−w(w−1)
2 = (w−1)(p−w 2)
Therefore, in our model, C corresponds to the number of slack variables attributed to
361
constraints for an edge of the graph.
362
On the contrary, the absence of an edge e= (v1, v2), corresponding to πe= 0, can be
363
modeled for a distancei∈ {1, ..., w−1} and a starting positionj∈ {1, ..., p−i}as:
364
Xv1,j+Xv2,j+i≤1
365
Then,Realizablew can be formulated as the following linear integer program:
366
X∈{0,1}p×nmin,y∈{0,1}|E|×C
X
e∈E
X
i∈{1,...,w−1}
y1e(i) +...+yp−ie (i)
367
under the constraints
368
∀j∈ {1, ..., p}
n
X
i=1
Xi,j= 1
369
∀e= (v1, v2)∈E
∀e0 = (v10, v02)∈/ E
∀i∈ {1, ..., w−1}
−Xv1,1 +ye1(i)≤0
−Xv2,1+i+ye1(i)≤0 Xv1,1+Xv2,1+i−ye1(i)≤1
...
−Xv1,p−i +yep−i(i)≤0
−Xv2,p +yep−i(i)≤0 Xv1,p−i+Xv2,p −yep−i(i)≤1
Xv0
1,1+Xv0
2,1+i≤1 ...
Xv0
1,p−i+Xv0
2,p ≤1
370