HAL Id: hal-01521890
https://hal-enac.archives-ouvertes.fr/hal-01521890
Submitted on 18 May 2017
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Knowledge Discovery in Graphs Through Vertex Separation
Marc Sarfati, Marc Queudot, Catherine Mancel, Marie-Jean Meurs
To cite this version:
Marc Sarfati, Marc Queudot, Catherine Mancel, Marie-Jean Meurs. Knowledge Discovery in Graphs
Through Vertex Separation. AI 2017, 30th Canadian Conference on Artificial Intelligence, May 2017,
Edmonton, Canada. pp 203-214, �10.1007/978-3-319-57351-9_25�. �hal-01521890�
Knowledge Discovery in Graphs through Vertex Separation
Marc Sarfati
1,2, Marc Queudot
2, Catherine Mancel
3, and Marie-Jean Meurs
21
Ecole Polytechnique, Palaiseau, France ´
2
Universit´e du Qu´ebec `a Montr´eal, Montr´eal, QC, Canada
3
ENAC ( ´ Ecole Nationale de l’Aviation Civile), Universit´e de Toulouse, Toulouse, France
Abstract. This paper presents our ongoing work on the Vertex Separator Prob- lem (VSP), and its application to knowledge discovery in graphs representing real data. The classic VSP is modeled as an integer linear program. We propose several variants to adapt this model to graphs with various properties. To evaluate the relevance of our approach on real data, we created two graphs of different size from the IMDb database. The model was applied to the separation of these graphs. The results demonstrate how the model is able to semantically separate graphs into clusters.
Keywords: graph partitioning, knowledge discovery, Vertex Separator Problem
1 Introduction
Extracting knowledge from graphs is often performed by clustering algorithms. While several methods are based on random walks or spectral decomposition for example [12], this paper focuses on graph separation through the Vertex Separator Problem (VSP) and its applications. The VSP can be formally defined as follows. Given a connected undi- rected graph G = (V, E), a vertex separator in G is a subset of vertices whose removal disconnects G. In other words, finding a vertex separator is finding a partition of V composed of 3 non-empty classes A, B and C such that there is no edge between A and B, as depicted in Figure 1. C is usually called the separator while A and B are the pil- lars. The VSP appears in a wide range of applications like very-large-scale integration design [15], matrix factorization [4], bioinformatics [6], or network security [13].
Fig. 1: Example of vertex separation
A C B
Fig. 2: Trade-off between separator-size and balanced components
The VSP consists in finding a minimum-sized separator in any connected graph. In any finite connected graph which is not complete, one can find a minimum-sized sepa- rator since the number of subsets of the vertex set is finite. The VSP is NP-hard [8,3], even if the graph is planar [7]. In [10], Kanevsky provided an algorithm to find all minimum-sized separators in a graph. However, the method is not efficient for comput- ing balanced pillars, especially on large instances. Indeed, computing a minimum-sized separator often leads to finding pillars of highly unbalanced cardinalities. Since most of the applications to real-life problems look for reasonably balanced pillars, adding cardinality constraints often results in finding slightly larger separators. Figure 2 shows on a small graph this trade-off between finding a small-sized separator and finding a balanced result. A minimum-sized separator is only composed of the red vertex, which separates the graph in two subsets whose cardinalities are 9 and 1. Any graph which contains a terminal vertex (i.e. which degree is 1). In Figure 2, the separator composed of blue vertices may be a better choice because it separates the graph into balanced dense subsets.
Other approaches to solve the VSP have been recently proposed. Among them, Benlic and Hao present an approximate solution, which uses Breakout Local Search [2].
The algorithm requires an initial solution, and then looks for a local optimum using a mountain climbing method. In order to avoid getting stuck on a local optimum, once the algorithm reaches a local optimum, a perturbation is added to leap to another search zone. A continuous polyhedral solution is proposed by Hager and Hungerford in [9]
where the optimization problem relaxes the separation constraint (|(A × B) ∩ E| = 0) to include it in the objective function. However, the optimization problem is quadratic thus not easily solvable by a mathematical programming solver. Following the seminal work from Balas and De Souza [1,14], the work we present in this paper is based on the polyhedral approach introduced by Didi Biha and Meurs. In [5], they presented a formulation of a Mixed-Integer Linear Programming (MILP) to solve the VSP. This solution is useful and elegant since it is exact and allows to easily add constraints on the sizes of the pillars.
Our initial model and the proposed variants are presented in the next Section. To evaluate the efficiency of our approach and its ability to provide semantic sets of ver- tices (pillars), we applied our approach to several graphs extracted from the Internet Movie Database (IMDb)
1. Section 3 describes these experiments and results, while we conclude and discuss future work in Section 4.
1
Information courtesy of IMDb (http://www.imdb.com). Used with permission.
2 Solving the VSP
2.1 Initial Model
Since this paper presents an extension of [5], it uses the same notations for the sake of clarity. Let G = (V, E) be a finite undirected graph, the goal is to find a parti- tion (A, B, C) as defined previously. Let x and y be two vectors indexed by the ver- tices in V , which are indicators of the presence of a vertex in A or B (i.e. x
u= 1 if u ∈ A, x
u= 0 otherwise). Computing the VSP is equivalent to finding a par- tition (A, B, C) such as there is no edge between A and B and if n denotes |V |,
|A| + |B| = n − |C| = P
v∈V
(x
v+ y
v) is maximized. The integer linear program is hence formulated as follows:
max
x,y∈{0,1}n
X
v∈V
(x
v+ y
v) (VSP)
Subject to
∀v ∈ V, x
v+ y
v≤ 1 (1)
∀(uv) ∈ E, x
u+ y
v≤ 1 (2)
∀(uv) ∈ E, x
v+ y
u≤ 1 (3)
Note that (VSP) is equivalent to
max
x,y∈{0,1}n
|A| + |B | (VSP’)
Constraint (1) reflects the fact that a vertex can not simultaneously be in A and in B. (2) and (3) express that there is no edge between A and B . (VSP) subject to (1)- (3) computes a minimum-sized separator of a graph G. An extension of the Menger’s theorem defined in [10] states that the size of a separator in a graph is necessarily greater than or equal to the connectivity of the graph. Adding this constraint reduces the size of the polyhedron admissible solutions, shrinking down the computation time needed to solve the model. Let k
Gdenote the connectivity of the graph, the linear program can be strengthened by adding a new constraint :
X
v∈V
(x
v+ y
v) ≤ n − k
G(4)
As seen previously, it can be useful to compute separators which lead to balanced pil- lars. Didi Biha and Meurs added two constraints to the linear program (VSP) in order to avoid large imbalances. Given β(n) an arbitrary number, they forced the cardinali- ties of the pillars A and B to be smaller than or equal to β(n). The literature usually recommends β(n) = 2n/3 [11].
X
v∈V
x
v≤ β(n) (5)
X
v∈V
y
v≤ β (n) (6)
2.2 Relaxing the constraints on the partition size
(VSP) subject to (1)-(6) computes a low-sized separator of a graph while generally avoiding large imbalances in the pillars. However, β(n) is arbitrarily chosen so it may lead to inconsistency. The graph in Figure 3 is clearly composed of 2 different dense blocks (cliques) linked by a single vertex (denoted 25). The bigger clique contains more than 2n/3 vertices, hence the linear program can not choose {25} as a separator. In order to satisfy constraints (5) and (6), the solver must add unnecessary vertices in the separator, which are part of a dense connected component.
Fig. 3: Inconsistency due to the choice of β(n)
In this paper we present a new approach to compute low-sized balanced separator in a graph. Instead of fixing an arbitrary upper bound to the sizes of the pillars, we relax constraints (5) and (6) but we penalize large gaps between |A| and |B| in the objective function. The objective function becomes:
x,y∈{0,1}
max
n|A| + |B| − γ(|B| − |A|) where γ ∈ [0, 1] and
|A| ≤ |B| (7)
Note that because pillars A and B have no predefined meaning, one can set |A| ≤
|B| or |B| ≤ |A| with no loss of generality.
The objective function can be simplified as follows:
|A| + |B| + γ(|A| − |B|) = (1 + γ)|A| + (1 − γ)|B| = (1 + γ)[|A| +
1−γ1+γ|B|].
Since (1 + γ) is positive, we can remove this term from the objective function and still find the same solutions x and y. Moreover γ 7→
1−γ1+γinduces a bijection between [0, 1]
and [0, 1], thus the objective function of the relaxed problem is:
max
x,y∈{0,1}n
|A| + λ|B| with λ ∈ [0, 1] (VSPr)
The new linear program is formulated as (VSPr) subject to (1)-(4) and (7). We can notice that λ measures the trade-off between finding a small size separator and finding balanced pillars. If λ = 1, (VSPr) is equivalent to the first linear program (VSP) so the solver will output a minimum-sized separator. If λ = 0 then the program will maximize
|A| given (7), thus generating a well balanced partition.
2.3 Adding weights to the vertices
Unlike Kanevsky’s algorithm, the linear programming solution allows to take into ac- count vertex weights. Let c denote the vector containing the vertex weights. For any subset S ⊂ V , we define the weight of S, c(S) = P
v∈S
c
v. The program which uses vertex weights is formulated as is :
max
x,y∈{0,1}n
c(A) + λ c(B) (VSPw)
subject to (1)-(4) and
c(A) ≤ c(B) (8)
This version, though taking into account vertex weights, is as simple as the previous one. Note that c(A) (respectively c(B)) can be calculated as c
>x (respectively c
>y).
2.4 Taking distances into account
We have previously seen that the VSP can be used in order to split a graph into balanced pillars. In such a context, the data points (graph vertices) have relational links (graph edges) but these links can also be associated to intrinsic information which leads to defining metrics, as for example distances between vertices. In order to have more con- sistent results, it is interesting to take these distances into account. The main goal here is to find a well-balanced partition separated by a small size separator while minimizing intern distances in the resulting pillars.
Let d : (u, v) 7→ d(u, v) denote the metric between the vertices. Since we only care about the distances between vertices that are both in A or B, we only consider pairs (u, v) where (x
u= 1 ∧ x
v= 1) or (y
u= 1 ∧ y
v= 1). This means we only take into account the pairs (u, v) where x
ux
v= 1 or y
uy
v= 1, i.e. x
ux
v+ y
uy
v= 1 because both terms can not be equal to 1.
The constraints of the optimization program remain (1)-(4), (8) but the objective function becomes:
max
x,y∈{0,1}n
(1 − µ)(c(A) + λ c(B)) − µ X
{u,v}∈V2
d(u, v)(x
ux
v+ y
uy
v) (VSPd)
where µ ∈ [0, 1]. As for λ, µ measures the trade-off between finding a partition with a small size separator and whose weights are balanced, and minimizing the distances within the pillars A and B .
If µ = 1 the solver will minimize the distance within the subsets, so it will surely put all the vertices in the separator C to reach its optimal value. If µ = 0, the optimization program is exactly (VSPw). The higher µ is, the more the optimization program takes the distances into account.
The program (VSPd) is not linear anymore since it contains quadratic terms in
the objective function. Common free and commercial solvers are also able to solve
Mixed-Integer Non-Linear Programs (MINLP), however the computation time is usu-
ally much higher. To linearize (VSPd), we introduce a new family of variables (z
uv),
for all (u, v) ∈ V
2, v > u. We want z
uvto be an indicator whether u and v are both in
the same pillar. We can not define z
uv= x
ux
v+ y
uy
vbecause it would add quadratic constraints to the linear program. Hence in order to have the family (z
uv) be such indi- cators, we add the following constraints to the linear program.
∀(u, v) ∈ V
2, v > u, z
uv≥ x
u+ x
v− 1 (9)
∀(u, v) ∈ V
2, v > u, z
uv≥ y
u+ y
v− 1 (10)
∀(u, v) ∈ V
2, v > u, z
uv∈ {0, 1} (11) The new objective function is :
max
x,y∈{0,1}n
(1 − µ)(c(A) + λ c(B)) − µ X
{u,v}∈V2