• Aucun résultat trouvé

Sequence graphs realizations and ambiguity in language models

N/A
N/A
Protected

Academic year: 2021

Partager "Sequence graphs realizations and ambiguity in language models"

Copied!
24
0
0

Texte intégral

(1)

HAL Id: hal-02495333

https://hal.archives-ouvertes.fr/hal-02495333v3

Submitted on 13 Jan 2021

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

models

Sammy Khalife, Yann Ponty, Laurent Bulteau

To cite this version:

Sammy Khalife, Yann Ponty, Laurent Bulteau. Sequence graphs realizations and ambiguity in lan-

guage models. COCOON 2021 - 27th International Computing and Combinatorics Conference, Oct

2021, Tainan, Taiwan. �hal-02495333v3�

(2)

language models

2

Sammy Khalife

3

LIX, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris, 91128 Palaiseau, France

4

khalife@lix.polytechnique.fr

5

Yann Ponty

6

LIX, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris, 91128 Palaiseau, France

7

yann.ponty@lix.polytechnique.fr

8

Laurent Bulteau

9

LIGM, CNRS, Université Gustave Eiffel, 77454 Marne-la-Vallée, France

10

laurent.bulteau@univ-eiffel.fr

11

Abstract

12

Several language models rely on an assumption modeling each local context as a (potentially oriented)

13

bag of words, and have proven to be very efficient baselines. Sequence graphs are the natural

14

structures encoding their information. However, a sequence graph may have several realizations

15

as a sequence, leading to a degree of ambiguity. In this paper, we study such degree of ambiguity

16

from a combinatorial and computational point of view. In particular, we present theoretical results

17

concerning the family of sequence graphs. Several combinatorial problems are presented, depending

18

on three levels of generalisation (window size, graph orientation, and weights), and whether some

19

of these are NP-complete is left opened. We establish different algorithms, including an integer

20

program and a dynamic programming formulation to respectively recognize a sequence graph and to

21

count the number of its distinct realizations. This allows us to show that this model assumption can

22

induce an important number of sentences to have the same representations. We empirically compare

23

the representations obtained with a recurrent neural networks for different realizations of sequence

24

graphs.

25

2012 ACM Subject Classification Mathematics of computing→Combinatoric problems; Mathem-

26

atics of computing→Combinatorics on words; Mathematics of computing→Graph algorithms;

27

Theory of computation→Complexity classes; Theory of computation→Problems, reductions and

28

completeness

29

Keywords and phrases Graphs, Sequences, Combinatorics, Inverse problem, Complexity class

30

1 Introduction

31

The automated treatment of familiar objects, either natural or artifacts, always relies on a

32

translation into entities manageable by computer programs. However, the correspondence

33

between the object to be treated and "its" representation is not necessarily one-to-one. The

34

representations used for learning algorithms are no exception to this rule. In particular,

35

natural language words and textual documents representations are essential for several tasks,

36

including document classification [23], role labelling [19], and named entity recognition

37

[16]. The traditional models based on pointwise mutual information, or graph-of-words

38

(GOW), [9, 17, 20], supplement the content of bag-of-words (TF, TFIDF) with statistics

39

of co-occurrences within a windowof fixed size w, introduced to mitigate the degree of

40

ambiguity. Several models [2, 14, 18, 21] also use the same type of information and constitute

41

strong baselines for natural language processing.

42

While these representations are more precise than the traditional bag-of-words (e.g Parikh

43

vectors), they still induce some level of ambiguity,i.e. a given graph can represent several

44

sequences. Our study is thus motivated by a quantification of the level of ambiguity, seen

45

(3)

Linux is not UNIX but

(a)No ambiguity (w= 3)

Linux is not UNIX but

(b)Ambiguity (w= 2)

Figure 1Sequence graphs (orgraphs-of-words) built for the sentence “Linux is not UNIX but Linux” using window sizes 3 (a) and 2 respectively (b). In the second case, the sequence graph is ambiguous, since any circular permutation of the words admits the same representation.

a b

c d

r

a b r a c a d a b r a a b r a b r a d a c a a b r a c a b r a d a a b r a b r a c a d a a b r a d a b r a c a a b r a d a c a b r a

...

(a)w= 2,Ghas 30 realizations

a b

c

d r

a b r a c a d a b r a a b r a c a d b a r a a b a r c a d a b r a a b a r c a d b a r a a b r a c a d a b r a a b r a c a d b a r a (b)w= 3,Ghas 6 realizations

a c b

d

r

a b r a c a d a b r a a b r a c a a d b r a a b r c a a d a b r a (c)w= 4,Ghas 3 realizations

a c b

d

r a b r a c a d a b r a (d)w= 5,Ghas one realization

Figure 2Sequence graphs (orgraphs-of-words) built for the sentence “a b r a c a d a b r a” using window sizes 2 (a), 3 (b), 4 (c) and 5 (d).

as an algorithmic problem, coupled with an empirical assessment of the consequences of

46

ambiguity for the representations.

47

After introducing in Section 2 the formal definition of a sequence graph and the descriptions

48

of our main problems, we establish in Section 3.1 complexity aspects of deciding the existence

49

and counting sequences in GOWs associated with a window sizew= 2. Then we consider

50

in Section 3.2 the general case w ≥ 3, and propose a integer program and a dynamic

51

programming algorithm to respectively recognize a sequence graph and count admissible

52

sequences. Finally, we assess the prevalence of ambiguity within a synthetic dataset, and

53

observe that sequences invariant with respect to the GOW representation do not lead to

54

invariance with respect to recurrent neural networks such as Long Short Term Memory

55

networks (LSTMs).

56

Related work

57

Sequence graphs encode the information of several co-occurences based models [2, 15, 18]. To

58

the best of our knowledge, the ambiguity and realizability questions addressed in this work

59

were never systematically addressed by prior work in computational linguistics. Furthermore,

60

we believe the problems studied in this paper are interesting from an algorithmic point of

61

view, and appear to be devoid of reduction to other well-known problems.

62

(4)

However, some similarities exist between our problem and others studied in the Distance

63

Geometry (DG) literature. In distance geometry, the input consists of a set of pairwise

64

distances between points, having unknown positions in a d-dimensional space. The problem

65

then consists in determining (the existence of) a set of positions for the points, satisfying the

66

distance constraints. Since a position is fully characterized fromd+ 1 constraining neighbors,

67

the problem can be solved by finding a sequential order for processing points, such that the

68

assignment of a point is always by at leastd+ 1 among its neighbors [13]. This statement

69

shares some level of similarity with our problem since an admissible sequence for a window

70

w = d+ 2 also represents a linear ordering of its nodes, in which w−1 = d+ 1 of the

71

neighbors have lower value with respect to the order.

72

The reasons for the insufficiency of linear ordering in DG to solve our realizability problem

73

are threefold. First, each element of the sequencexassociated to the protein backbone is

74

associated a unique vertex. This is not the case we investigate here, since a symbol can be

75

repeated several times, but only one vertex is created in the graph. This implies that the

76

vertex associated to theith element (iw) of xcan have strictly less thanw−1 distinct

77

neighbors in its predecessors inx. Second, the absence of loops in distance geometry, because

78

an element is at distance 0 from itself. Finally, the graphs are always undirected in distance

79

geometry.

80

2 Definitions and problem statement

81

Let x= x1, x2, ..., xp be a finite sequence of discrete elements among a finite vocabulary

82

X. Without loss of generality, we can suppose thatX = {1, ..., n}. In the following, let

83

Ip={1, ..., p}. This motivates the following definition:

84

IDefinition 1. G= (V, E)is the graph of the sequencexwith window size w∈N if and

85

only ifV ={xi|iIp}, and

86

(i, j)∈E ⇐⇒ ∃(k, k0)∈Ip2, |k−k0| ≤w−1xk=iandxk0 =j (1)

87

For digraphs, Eq. (1)is replaced with

88

(i, j)∈E ⇐⇒ ∃(k, k0)∈Ip2, kk0k+w−1, xk =iandxk0 =j. (2)

89

Finally, a weighted sequence graph Gis endowed with a matrixΠ(G) = (πij) such that

90

πij =Card{(k, k0)∈Ip2|kk0k+w−1, xk =iandxk0 =j} (3)

91

We say that xis aw-admissible sequence for G(or a realization ofG), ifGis the graph of

92

sequence xwith window sizew.

93

The natural integers πij represent the number of co-occurrences ofiandj in a window

94

of sizew. Hence, the graph of sequence is unique. An linear time algorithm to construct a

95

weighted sequence digraph is presented in Sec. A of the appendix. Other cases are obtained

96

similarly. The procedure in algorithm 1 defines a correspondence between the sequence set

97

X?into the graph set G: φw:X?→ G, x7→Gw(x). Based on these definitions, we consider

98

the following problems:

99

IProblem 1(Weighted-Realizable(W-Realizable) ).

100

Input: Possibly directed graphG, matrix weightsΠ, window size w

101

Output: True if (G,Π)is thew-sequence graph of some sequencex, False otherwise.

102

(5)

IProblem 2(Unweighted-Realizable(U-Realizable) ).

103

Input: Possibly directed graph G, window size w

104

Output: True ifGis thew-sequence graph of some sequencex, False otherwise.

105

We denote D-Realizable (resp. G-) the restricted version of Realizable where the

106

input graphGis directed (resp. undirected), andW-Realizable(resp. U-) the restricted

107

version ofRealizable where the input graphGis weighted (resp. unweighted), possibly

108

in combination with the D- or G- variants. We writeRealizablew for the case where w

109

is a fixed (given) constant. We also consider the variants of W-Realizable, denoted WG-

110

Realizableand WD-Realizable where the input graph is restricted to be respectively

111

undirected and directed. We define UG-Realizable and UD-Realizable similarly. Finally,

112

we write (WG-, WD-, ...)Realizablewfor the case wherewis a fixed strictly positive integer.

113

IProblem 3(Unweighted-NumRealizations(U-NumRealizations) ).

114

Input: Possibly directed graph G, window size w

115

Output: The number of realizations of G, i.e. preimages of G through φw such that

116

|{x∈X?|φw(x) =G}|if finite, or+∞otherwise.

117

IProblem 4(Weighted-NumRealizations(W-NumRealizations)).

118

Input: Possibly directed graph G, matrix weights Π, window sizew

119

Output: The number ofrealizations ofGin the weighted sense.

120

Similarly, we use the same prefix for the directed or undirected versions of (D-, G-, i.e.

121

DU- for directed and unweighted). We also denoteNumRealizationsw for the case where

122

wis a fixed strictly positive integer. Note thatNumRealizationsstrictly generalizes the

123

previous one, asRealizablecan be solved by testing the nullity of the number of suitable

124

realization computed byNumRealizations.

125

DWDirected weighted DU Directed unweighted GWUndirected weighted GUUndirected unweighted

126

3 Theoretical results

127

In this section, we present our main theoretical results. Due to length limitations, some of

128

the proofs are left in the appendix.

129

3.1 A complete characterization of 2-sequence graphs

130

A graph has a sequential realization withw= 2 when there exists a path visiting every vertex

131

and covering all of its edges (at least once for the unweighted case and exactlyπefor the

132

edgeein the weighted case). This characterization enables relatively simple characterization

133

and algorithmic treatment, leading to the results summarized in Table 2.

134

Table 1Complexity for various instances of our problems (w= 2)

NumRealizations2 Realizable2

Data Instance Complexity #Sequences Complexity Characterization

Unweighted graph P {0,+∞} P Gconnected

Weighted graph #P-hard {0,1} ∪2N P ψ(G) (semi) Eulerian

Unweighted digraph P {0,1,+∞} P Theorem 14

Weighted digraph P N(BEST Theorem) P ψ(G) (semi) Eulerian

(6)

3.2 General sequence graphs and Realizable

w≥3

135

The characterization of more general sequence graphs, such as 3-graphs is not the same for

136

2-graphs, as shows the counter-example in Fig 3a: the depicted graph has no self-edge so

137

there must at least one clique of size 3. Similarly, Fig. 3b depicts a counter example for

138

directed graphs: Gdoes not have loops, so if it had a 3-admissible sequence, such sequence

139

must be of the form{1 2 3 1...,1 3 2 1...,2 3 1 2...,3 2 1 3...,2 1 3 2...}but then (2,1) would form

140

an edge.

141

1 2 3

(a) G is connected but not a 3-sequence graph

1 2 3

(b)Gis strongly connec- ted but not a 3-sequence graph

Figure 3Counter examples forw= 3

3.3 A polynomial time algorithm for GU- Realizable

w

142

Similarly to the procedure in Sec. B, we will use an auxiliary graph built on G. Let

143

H(G) = (E, HE) be the new graph obtained with the following procedure. Two edges

144

e= (v1, v2),f = (v3, v4) ofE are connected inH(G) if and only if:

145

v2=v3 and (v1, v4)∈E (4)

146

This defines an injective function ˜h: EHV3: an edge of H(G) can be seen as an

147

unique tripletv1, v2, v3 where (v1, v2),(v1, v3) and (v2, v3)∈E. Therefore, by definition, a

148

walkP inH(G) is always of the form:

149

P= (t1, t2), ...,(tp−1, tp) s.t ∀i∈ {1, ..., p−1}, (ti, ti+1)∈E (5)

150

It is clear that if H(G) is a 2-graph, then Gis a 3-graph since there is a walk going

151

through all edges ofH(G) (so visiting every non isolated node and creating all edges of G).

152

However, the converse is not true as depicted in Fig. 4. In order to determine ifG= (V, E)

153

has an admissible sequence in the general case, a procedure is to recursively merge pairs of

154

vertices, maintaining constraints depending onE. These constraints are similar to Eq. 4. We

155

adopt the following notations,ui,j= (ui, uj) andu1:k= (u1, ..., uk). The iterative procedure

156

forw≥3 is summed up in the following equation. Namely,∀k∈ {2, ..., w−2}, one has

157

E(k)={u1:k+1Vk+1 |u1:kE(k−1), u2:k+1E(k−1)∧(u1, uk+1)∈E} (6)

158

LetH(k)= (E(k), E(k+1)), it can be defined recursively through:

159

H(0) =G ∀k∈N, H(k)=f(H(k−1)) (7)

160161

wheref transforms edges into vertices and creates edges between new vertices that verify

162

Eq. 6.

163

IDefinition 2. Letu be a vertex of H(k) fork ∈N, u= (u1, ..., uk, uk+1). The sequence

164

u1, ..., uk+1 is the authenticsequence ofu. We also call an authentic sequence of a walk on

165

H(k): P = (x1, ..., xk+1),(x2, ..., xk+2), ...,(xv, ..., xv+k)the sequencex1, x2, ..., xv+k.

166

(7)

In order to obtain admissible sequences of lengthp, the computation ofH(p) requiresp

167

iterations, and the number of vertices and edges ofH(k)can increase during iterations (the

168

complete graph is an example for which theses numbers increase exponentially).

169

I Proposition 3. Let x = x1, ..., xp be a w-admissible sequence of a graph (or digraph)

170

G= (V, E). If wp, x, thenxis an authentic sequence of a walk of length pw+ 1on

171

H(w−2).

172

Proof. Due to length limitation, we provide a proof sketch, full proof is left in the appendix.

The following property by induction onk:

∀k∈ {w, ..., p}, ∃walkP onH(w−2)such that : x1:k =P[1]1, P[2]1, ..., P[kw]1, P[k−(w−1)]1:(w−1)

• Initialisation: k= 1. By construction of H(w−2),x1 is the first element of the “static

173

walk”: x1:w−1H(w−2).

174

• Induction: Verification that ifx1:k is a walk of lengthkw+ 1, one can find a walk of

175

length (k+ 1)−w+ 1 to generatex1:(k+1). J

176

ITheorem 4. Let w∈N. GU-Realizablew is in P.

177

Proof. The case forw= 1 is trivial, andw= 2 has been treated. For w≥3, an algorithm

178

is obtained by going through all the connected components ofH(w−2). LetC1, ..., Cm the

179

connected components of H(w−2). On the one hand, it is possible to compute them in

180

polynomial time. On the other hand, it is possible to construct walks covering all of their

181

respective edges in polynomial time (for instance iteratively using shortest paths). Let

182

W1, ..., Wm such walks andX1, ..., Xm their respective admissible sequences.

183

Using Prop. 3,Gis aw-sequence graph if and only if there exists a walk ˜Wi0 on some

184

Ci0 creating exactly the edges ofG. However,Wi0 creates more edges than any walk onCi0

185

by construction.

186

In conclusion, the assertion:

∃i∈ {1, ..., m}, φw(Xi) =G

is a characterization thatGis aw-sequence. This assertion is decidable in polynomial time

187

since for alli,φw(Xi) is computable in polynomial time (cf. Algorithm 1). J

188

For digraphs, the analogue of the aforementioned procedure would consist in enumerating

189

alll paths in the DAGR(H(w−2)). However, the number of paths can be exponential, even for

190

a sequence graph. In the next subsection, we will prove that DU-Realizablew is actually

191

NP-hard. Finally, ifx1, ..., xc are vertices of a strongly component ofH(w−2), which order

192

should be considered to form a new vertex attributexC? The following lemma shows that

193

this order is not important, as long as it represents a walk in the component. Moreover, it

194

is possible to reconstruct all admissible sequences from walks onR(Hw−2). With the same

195

notations:

196

ILemma 5. Let xa walk on H(w−2) whose authentic sequence isw-admissible for G. Ifx

197

goes through a strongly componentC ofH(w−2), adding any supplementary path included in

198

C is stable forw-admissibility. Any graph generated by a walk onH(w−2)can be generated

199

by a walk on R(H(w−2)).

200

(8)

1 2

3 4

(a)G

31 24 23 43 42

41 34

32 (b)H

31 24 43

41 32

34234

(c)R(H)

Figure 4Procedure to find a 3-admissible sequence. 34234, 41: is 3-admissible, with authentic sequence 3 4 2 3 4 1

Proof. We present a proof sketch. The first statement concerning stability requires a

201

straightforward verification using the definition ofH(w−2). Second, a procedure to generate

202

Gfrom a walk onR(H(w−2)) using a walkx1:ponH(w−2)) is to consider an iterative scheme,

203

and discuss three cases:

204

(i)xi andxi+1are not in a strongly connected component (SCC)

205

(ii)xi is not in a SCC andxi+1 is in a SCC

206

(iii) xi andxi+1 are both in SCCs

207

For case (i), we just keepxi andxi+1. For cases (ii) and (iii), we use the first part result of

208

the Lemma and add covering walks over the strongly connected components. J

209

3.4 Main complexity results

210

In this subsection we present the remaining complexity results, which are summarized in

211

Table 2. In the previous subsection, we proved that GU-RealizablewP, ∀w≥3. Besides,

212

for GU, the number of realizations of a graphGis either 0 (not realizable), +∞(realizable

213

and there exists a cycle in a component of H generating G), or 1 (realizable but no cycle

214

in any component of H generatingG). These three cases can be tested in polynomial time

215

using our algorithm, showing that GU-NumRealizationswP, ∀w≥3. In the remaining

216

of this section, we present the reductions we used for the other instances.

217

Table 2Complexity for various instances of our problems (w≥3). We remind that a para-NP- hard problem does not admit any XP algorithm unless P=NP.

Constantw,w≥3 Parameterw

NumRealizationsw Realizablew NumRealizations Realizable

Variation Complexity Complexity Complexity Complexity

GU P P W[1]-hard; XP W[1]-hard; XP

GW NP-hard NP-hard para-NP-hard para-NP-hard

DU NP-hard NP-hard para-NP-hard para-NP-hard

DW NP-hard NP-hard para-NP-hard para-NP-hard

I Proposition 6. Clique admits a polynomial time parameterized reduction into GU-

218

Realizable.

219

Proof. Let G= (V, E) be a simple graph. Let G0 be a graph constructed fromGadding

220

two nodesaandbwith loops, such thataandbare connected to each vertex ofG. Letkbe

221

a strictly positive integer andw=k+ 1. We will show thatGhas ak-clique if and only if

222

G0 isw-realizable.

223

(9)

First, let us suppose thatGhas ak-clique. LetC be an arbitrary sequence of the vertices of one of itsk-clique. Letv1, . . . , v|V|be the vertices ofGand (u1, u01), . . . , (u|E|, u0|E|) be its edges. In the followingxwrepresents thew-repetition ofx. Then, the following sequence is aw-realization ofG0:

awu1u01awu2u02aw . . . awu|E|u0|E|aw C bwv1bwv2bk . . . bwv|V|

Now let us suppose thatG0 is w-realizable and letx=x1, . . . , xpbe aw-realization ofG0.

224

Without loss of generality, let us suppose aappears beforeb inx. Letib be the index of

225

the first appearance ofb and let ia be the largest index of the appearance of abeforeib.

226

Thenibiaw, otherwise there would be an edge betweenaandb. Furthermore, since

227

Gis simple, there cannot be two repetitions of a vertex in the sequence xia+1, . . . , xia+w−1.

228

Due to the definition of a sequence graph, all vertices{xia+1, . . . , xia+w−1}are connected,

229

forming a clique inGof size w−1 =k, which ends the proof. J

230

ICorollary 7. GU-Realizable isW[1]-hard for parameterw.

231

DU-Realizable is NP-hard forw≥3

232

Consider the following intermediate problem:

233

OptionalRealizablew Given a directed unweighted graphD= (V, A), a subsetA0A of

234

compulsory arcs, two distinguished verticess, s0V. Is there a sequence S such that the

235

graph ofS contains only arcs inAand (at least) all arcs inA0.

236

We first prove that this problem is NP-hard, then show how it reduces to DU-Realizable.

237

OptionalRealizablew,w≥3is NP-hard

238

GivenG= (V, A) and a start vertex s, build a directed weighted graphG0 = (V0, A0) as

239

follows:

240

Vertex set: V =S

v∈V{v0|v1} ∪ {xip,1≤p≤2n+ 1,1≤iw−2}

241

Arc set,

242

optional arcs (xi2p−1, v0), (v0, xi2p), (xi2p, v1), (v1, xi2p+1) for each vV, 1≤pn,

243

1≤iw−2.

244

optional arcs (u1, v0) for each (u, v) inA

245

compulsory arcs (v0, v1) for eachvV

246

optional arcs (xip, xjp) fori < j and (xip, xjp+1) forji

247

Start vertices are (x10, . . . , xk−20 , s).

248

G0 is a yes-instance⇔Gadmits a hamiltonian path

249

⇐ Let vp be the pth vertex of V in the hamiltonian path. Let Xp be the sequence

250

x02p−1. . . xw−22p−1vp0x02p. . . xw−22p vp1. Let Xn+1 = x02n. . . xw−22n , and S be the concatenation

251

X1. . . Xn+1. It can be checked thatS contains only arcs ofAand all compulsory arcs.

252

⇒Consider a sequence S, an occurrence ofxip inS for some 1≤iw−2, 1≤pn

253

(note thatp6=n+ 1), and letS0 be the subsequence ofS containing thew−1 characters

254

following xi2p+1. Let T = xi+1p . . . xw−2p and T = x1p+1. . . xip+1 (note that T is possibly

255

empty). T andU are seen both as strings and as sets of vertices. The out-neighborhood

256

ofxip contains all vertices of TU, as well as all verticesvq forvV, where q= 0 ifpis

257

odd andq = 1 if pis even. Since there are k−2 vertices inTU, and no vertex has a

258

self-loop, then by the pigeon-hole principle stringS0 must contain at least one vertexvq,

259

vV. Since there are no arc (vq, v0q) forv, v0V,S0 contains exactly one vertexvq, thus

260

(10)

it also contains all vertices ofTU. Based on the direction of the arcs inTU∪ {vq}, it

261

follows that S0=T·vq·U.

262

LetXpbe the string x1p. . . xw−2p . From the arguments above, and the fact thatS starts withX1, there exist indicesi1, j1, . . . , in, jn such that

S=X1vi0

1X2v1j

1X3vi0

2X4v1j

3X5. . . X2n+1 From the window sizew, there must exist an arc (vi0

p, vj1

p) for each p, so by construction

263

ip = jp. Furthermore, these arcs are compulsory for each vertex v0, so (i1, . . . , in) is a

264

permutation of{1, . . . , n}. Finally, there also exist an arc (v1j

p, v0i

p+1) inG0, so there exists

265

an arc (vip, vip+1) inG. Thus, (vi1, . . . , vin) is a hamiltonian path inG.

266

DU-Realizablew is NP-hard

267

By reduction fromOptionalRealizablew. Given a directed unweighted graphG= (V, A), a

268

subsetA0Aof compulsory arcs (letA00=A\A0 be the set of optional arcs), an integerw,

269

andw−1 distinguished vertices s1. . . sw−1V.

270

Let m = |A00|, write A00 = {(u1, v1), . . . ,(um, vm)}. Create G0 by adding w(m+ 1) separator verticesyip, 1≤pm+ 1 and 1≤iwandmverticeszp. Build the strings

Z=

m

Y

p=1

(yp1. . . ypwupzpvp)

!

ym+11 . . . ym+1w

Z0=Zs1. . . sw−1 . Add all arcs realized byZ0 involvingypi and/orzp toG0.

271

Ghas a realization with optional arcs⇔G0 has a realization

272

⇒Build a realization forG0 by concatenatingZ with the realization forGstarting with

273

s1. . . sw−1. All optional arcs ofG0 are realized inZ, all compulsory arcs ofG0 are realized

274

in the suffix (the realization ofG0), and all arcs involving a separator are realized inZ0. No

275

forbidden arc is realized.

276

⇐Let S be a realization ofG0. The set of in-neighbors of any separator has size at most

277

w−1 and induce a tournament in G0 (this is clear for all arcs involving separators, it is also

278

true for a potential pair of vertices (ui, vi) ofGsince Ghas no length-2 cycle. So thew−1

279

characters before any separator are ordered as inZ. Furthermore each separator (except

280

y11) contains at least one other separator in each in-neighborhood, so any occurrence of a

281

separator is actually the last character of a substring ofS equal to a prefix ofZ. Since y11

282

has in-degree 0, it may only appear as the first character of S, and any prefix of Z inS

283

is also a prefix of S. Moreover since ym+1w must appear in S, we haveS =ZS0 with no

284

separator appearing inS0. ThusS0 realizes only arcs fromG. From the out-neighborhood of

285

ym+11 , . . . , ym+1w , we have thatSstarts withs1, . . . sw−1. Moreover no compulsory arc ofGis

286

realized inZ, nor with one vertex inZ and one inS0 (since such arcs start with a separator),

287

so all compulsory arcs are realized inS0. Overall,Gis a yes-instance of OptionalRealizablew

288

with sequenceS0.

289

GW-Realizablew, DW-Realizablew are NP-hard for all w≥3

290

By reduction from a variant of hamiltonian path:

291

Input: Undirected graph Gwith two degree-1 vertices.

292

Question: DoesGhave a hamiltonian path?

293

(11)

Start Gadget:

s0

s00

a

s

k k

2

k

k+1 2

k 2

Queue Gadget:

a

s

b t

k 2

k+1 2

+ 2k

k+1 2

+ 2k

k+1 2

k+ 1

(2mn+ 2) w2+ 2(mn)

Vertex Gadget

(for each vertexu, includingsandt):

a u b

u0

2du k 2

+ k+12 k

k

1 (du−1)( k+12 + 2k)

Edge Gadget (for each{u, v}):

u v

k+1 2

Figure 5Subgraphs used in the reduction from Hamiltonian Path to DW-Realizable3. Weights on double arcs apply to both directions. Note that some arcs appear in different gadgets, in which case the weights should be summed (in particular, so loops on s andt have total weight 2du k

2

+ k+12 + k2

)

Note that this variant of HP is easily shown to be NP-hard from Hamiltonian cycle via

294

the following reduction: given a graph G on which we need to find a hamiltonian cycle,

295

pick any vertexv, duplicate it intov1, v2(each edge{u, v} becomes two edges{u, v1}and

296

{u, v2}), and add pending verticessandtconnected tov1 andv2 respectively.

297

Reduction for DW-Realizable

298

Given G = (V, E) with degree-1 vertices s and t, build a directed weighted graph

299

G0 = (V0, A) as follows:

300

Vertex set. For eachuV, create a vertex denotedu0. Create two additional dummy

301

verticesa andb. LetV0:={a, b, s0, s00} ∪S

u∈V{u, u0}. The arcs are given in Figure 5, as

302

the union of the start gadget, the queue gadget, and the vertex and edge gadgets respectively

303

for each vertex and edge ofG.

304

Reduction for GW-Realizable

305

Build the directed graphG0 as above, and letG0ube the undirected version ofG0: remove

306

arc orientations, foru6=vthe weight of{u, v}is the sum of the weight of (u, v) and (v, u) in

307

G0 (the weight of loops is unchanged).

308

Main claims

309

We prove the following three claims:

310

(i)Ghamiltonian⇒G0 has a realization

311

(ii)G0 has a realization⇒G0uhas a realization

312

(iii)G0uhas a realization ⇒Ghamiltonian

313

All together, they show the correctness of the reductions for both GW-Realizable and

314

DW-Realizable since they yield :

315

Ghamiltonian⇔G0 has a realization

316

Ghamiltonian⇔G0u has a realization

317

(12)

318

Proof of Claim (i). Ghas a hamiltonian path, let (u1=s, u2, . . . , un=t) be its hamiltonian

319

path and (v1, w1), . . .(vm0, wm0) be the pairs of connected verices except pairs (ui, ui+1) (i.e.

320

the set S

{u,v}∈E{(u, v),(v, u)} \ {(ui, ui+1) | 1≤ i < n}. Note that m0 = 2m−(n−1).

321

Define sequenceS as follows.

322

S :=

s00sk0a sks0ska uk2u02uk2a . . . a ukn−1u0n−1ukn−1a tkt0tka bwvk1b wk1bwvk2b wk2. . . bwvkm−nb wkm−nbw

Note that a sequence of the form xka yk yields k2loops forx, k2loops fory, as well

323

as k+12

arcs (x, y) (indeed, there are 1 + 2 +. . .+w−2 = k+12

such arcs). A sequence

324

of the form b xkbw yields in particular an arc (b, x) of weight k and arc (x, b) of weight

325

k+1 2

+k. J

326

Proof of Claim (ii). Clear, any realization forG0 is a realization for G0u. J

327

Proof of Claim (iii). Pick a realizationS ofG0u. Define the weight of a vertex inGu as the

328

sum of the weights of its incident edges (counting loops twice). From the construction, we

329

obtain the following weights for a selection of vertices:

330

s00 has weightw−1

331

u0 has weight 2(w−1) foruV

332

ahas weight 2(n+ 1)(w−1)

333

From the weight of s00, it follows that this vertex must be an endpoint of S (wlog,S

334

starts with s00). It follows that for any other vertexv with weight 2i(w−1),v must have

335

exactlyioccurrences inS (in general it can be eitheriori+ 1, but ifvhasi+ 1 occurrences

336

it must be both the first and last character ofS, i.e. v=s00: a contradiction). Thus each u0

337

occurs once andaoccursn+ 1 times inS.

338

Each u0 occurs once, so order vertices of V according to their occurrence in S (i.e.

339

V ={u1, . . . , un}withu01 appearing beforeu02, etc.). For eachi, the neighborhood ofu0i in

340

S containsatwice, one aon each side (since there is no (a, a) loop). Other neighbors of

341

u0i may only be occurrences ofui, so eachu0i belongs to a factor, denoted Xi, of the form

342

auiu0iuia. Two consecutive factorsXi, Xi+1 may overlap by at most one character (a), and

343

if they do, then there exists an edge{ui, ui+1} (sincew≥3) inG. There arensuch factors

344

Xui, and onlyn+ 1 occurrences ofa, so allas except extreme ones belong to the overlap of

345

two consecutiveXis, and there exists an edge{ui, ui+1} for eachi. Thus (u1, . . . , un) is a

346

hamiltonian path ofG. J

347

4 Effective general algorithms

348

4.1 Realizable

w

Linear integer programming formulation

349

LetG= (V, E) be a graph with integer weightsπe∈E. In this model, we represent a sequence

350

xover the alphabet{1, ...n}, as a (0−1) matrixX ∈Mn,p({0,1}) encoding the sequence x:

351

Xi,j=

(1 if xj =i 0 otherwise

352

(13)

It should be noted that the set sequence of sequences over the alphabet{1, ...n}is exactly represented by the (0−1) matrices such that

∀j∈ {1, ..., p}

n

X

i=1

Xi,j= 1

Given a window sizew, a unit ofπe=(v1,v2)corresponds to the appearance of two elements

353

v1,v2at a distancei∈ {1, ..., w−1}in the sequence. Now, let us consider a fixed distance i,

354

and a starting indexj∈ {1, ..., pi}, we use a intermediary slack variableyje(i)∈ {0,1} to

355

model the presence of such appearance using the constraint:

356

Xv1,jXv2,j+i=yje(i) (8)

357

Then, the Boolean variableyje(i) is equal to 1 whenv1is located at positionj andv2 at

358

positionj+i. We linearise Eq. 8 as:

359

−Xv1,j +yje(i)≤0

Xv2,j+i+yje(i)≤0 Xv1,1+Xv2,j+iyje(i)≤1

(9)

360

Each slack variableyek(i) is attributed to an edgee, a relative distancei∈ {1, ..., w−1}and a starting positionk∈ {1, ..., pi}. Given our constraint formulation, every slack variable is attributed 3 constraints. For a digraph, the number of possible pair positions for a unit of πe=(v1,v2)is given by:

C=

w−1

X

i=1

(pi) =p(w−1)−w(w−1)

2 = (w−1)(pw 2)

Therefore, in our model, C corresponds to the number of slack variables attributed to

361

constraints for an edge of the graph.

362

On the contrary, the absence of an edge e= (v1, v2), corresponding to πe= 0, can be

363

modeled for a distancei∈ {1, ..., w−1} and a starting positionj∈ {1, ..., pi}as:

364

Xv1,j+Xv2,j+i≤1

365

Then,Realizablew can be formulated as the following linear integer program:

366

X∈{0,1}p×nmin,y∈{0,1}|E|×C

X

e∈E

X

i∈{1,...,w−1}

y1e(i) +...+yp−ie (i)

367

under the constraints

368

∀j∈ {1, ..., p}

n

X

i=1

Xi,j= 1

369

∀e= (v1, v2)∈E

∀e0 = (v10, v02)∈/ E

∀i∈ {1, ..., w−1}





























−Xv1,1 +ye1(i)≤0

Xv2,1+i+ye1(i)≤0 Xv1,1+Xv2,1+iye1(i)≤1

...

−Xv1,p−i +yep−i(i)≤0

Xv2,p +yep−i(i)≤0 Xv1,p−i+Xv2,pyep−i(i)≤1

Xv0

1,1+Xv0

2,1+i≤1 ...

Xv0

1,p−i+Xv0

2,p ≤1

370

Références

Documents relatifs

Rosier, K 2007, 'Arrêt de la Cour Européenne des Droits de l'Homme en matière de surveillance de l'emploi d'Internet, des e-mails et du téléphone' Bulletin Social, Numéro 366,

Toutes les opérations avec les nombres entiers (A) Utilisez la stratégie d’un nombre entier pour trouver chaque question... Toutes les opérations avec les nombres entiers

Content Motivation Detecting Chordless Circuits Enumerating the chordless circuits Operational results Conclusion1. Enumerating chordless circuits in

Since δ 0 does not change the number of vertices, the homology commutes with permuting vertices and to show that the complex is acyclic it is enough to show that the complex

The same map is analogously defined on the space of multi-directed, sourced or oriented graphs with permuted edges and vertices, as well as their subspaces..

In former papers (Bonfante et al., 2011; Guillaume and Perrier, 2012), we discussed the interest to use graphs rather than trees to deal with linguistic structures and we have shown

(Recall that Colombeau type generalized functions are never topological vector spaces, because scalar multiplication with elements of R or C is not continuous, as seen

In particular, these graphs are considered either to be used as an effective tool to approach specific problems in graph theory like constructing integral, expander, and