HAL Id: hal-00200109
https://hal.archives-ouvertes.fr/hal-00200109
Preprint submitted on 20 Dec 2007
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de
Graph kernels between point clouds
Francis Bach
To cite this version:
Francis Bach. Graph kernels between point clouds. 2007. �hal-00200109�
hal-00200109, version 1 - 20 Dec 2007
Graph kernels between point clouds
Francis R. Bach francis.bach@mines.org
INRIA - Willow project
D´ epartement d’Informatique, Ecole Normale Sup´ erieure 45, rue d’Ulm
75230 Paris, France
Abstract
Point clouds are sets of points in two or three dimensions. Most kernel methods for learning on sets of points have not yet dealt with the specific geometrical invariances and practical constraints associated with point clouds in computer vision and graphics. In this paper, we present extensions of graph kernels for point clouds, which allow to use kernel methods for such objects as shapes, line drawings, or any three-dimensional point clouds. In order to design rich and numerically efficient kernels with as few free parameters as possible, we use kernels between covariance matrices and their factorizations on graphical models. We derive polynomial time dynamic programming recursions and present applications to recognition of handwritten digits and Chinese characters from few training examples.
1. Introduction
In recent years, kernels for structured data have been designed in many domains, such as bioin- formatics [1], speech processing [2], text processing [3] and computer vision [4]. They provide an elegant way of including known a priori information, by using directly the natural topological struc- ture of objects. Using a priori knowledge through structured kernels have proved beneficial because it allows to reduce the number of training examples, and to re-use existing data representations that are already well developed by experts of those domains.
In this paper, we propose a kernel between point clouds, with applications to classification of line drawings (such as handwritten digits [5] or Chinese characters [6, 7]) or shapes [8]. The natural geometrical structure of point clouds is hard to represent in a few real-valued features [9], in particular because of (a) the required local or global invariances by rotation, scaling, and/or translation, (b) the lack of pre-established registrations of the point clouds (i.e., points from one cloud are not matched to points from another cloud), and (c) the noise and occlusion that impose that only portions of two point clouds ought to be compared.
Following one of the leading principles for designing kernels between structured data, we propose to look at all possible partial matches between two point clouds [10]. More precisely, we assume that each point cloud has a graph structure (most often a neighborhood graph), and we consider recently introduced graph kernels [11, 12]. Intuitively, these kernels consider all possible subgraphs and compare and count the matching subgraphs. However, the set of subgraphs (or even the set of paths) has exponential size and cannot be efficiently described recursively; so larger sets of substructures are commonly used, e.g., walks and tree-walks. As shown in Section 2, by choosing appropriate substructures and fully factorized local kernels, efficient dynamic programming implementations allow to sum over an exponential number of substructures in polynomial time. The kernel thus provides an efficient and elegant way of considering very large feature spaces (see, e.g., [10]).
However, in the context of computer vision, substructures correspond to matched sets of points, and dealing with local invariances imposes to use a local kernel that cannot be readily expressed as a product of separate terms for each pair of points, and the usual dynamic programming approaches cannot then be applied. The main contribution of this paper is to design a local kernel that is not fully factorized but can be instead factorized according to the graph underlying the substructure.
This is naturally done through graphical models and the design of positive kernels for covariance
Figure 1: (left) path, (center) 1-walk which is not a 2-walk, (right) 2-walk which is not a 3-walk.
matrices that factorize on graphical models (Section 3). With this novel local kernel, we derive new polynomial time dynamic programming recursions in Section 4. In Section 5, we present simulations on handwritten character recognition.
2. Graph kernels
In this section, we consider two labelled undirected graphs G = (V, E, a, x) and H = (W, F, b, y).
Two types of labels are considered: attributes, which are denoted a(v) ∈ A for vertex v ∈ V and b(w) ∈ A for vertex w ∈ W and positions, which are denoted x(v) ∈ X and y(w) ∈ X . Our motivating examples are line drawings, where X = A = R
2.
The definition of graph kernels between G and H relies on a set of substructures of the graphs.
The most natural ones are paths, subtrees and more generally subgraphs; however, they do not lead to efficient enumerations, and recent work [11, 13] has focused on larger sets of substructures that we now present.
2.1 Paths, walks, subtrees and tree-walks
Given the undirected graph G with vertex set V , a path is a sequence of distinct connected vertices, while a walk is a sequence of possibly non distinct connected vertices. For any positive integer β, we define β-walks as walks such that any β + 1 successive vertices are distinct (1-walks are regular walks). Note that when the graph G is a tree (no cycles), then the set of 2-walks is equal to the set of paths (see examples in Figure 1). More generally, for any graph, β -walks of length β + 1 are exactly paths of length β + 1.
A subtree is a subgraph of with no cycles. We can represent a subtree of G by a tree structure T over the vertex set {1, . . . , |T |}, where |T | is the number of nodes in T , and a sequence of distinct consistent labels I ∈ V
|T|(i.e., that are neighbors in G when neighbors in T ). In this paper, we consider only rooted subtrees, i.e., subtrees where a specific node is identified as the root.
1The notion of walk is extending the notion of path by allowing nodes to be equal. Similarly, we can extend the notion of subtrees to tree-walks, which can have nodes that are equal. More precisely, we define an α-ary tree-walk of depth γ of G as a (non complete) labelled α-ary tree of depth γ with nodes labelled by vertices in G, and such that the labels of neighbors in the tree-walk are neighbors in G. A tree-walk may be represented by a tree structure T over the vertex set {1, . . . , |T |} and a sequence of consistent but possibly non distinct labels I ∈ V
|T|. We can also define β-tree-walks, as tree-walks such that for each node in T , its label (i.e. an element of V ) and the ones of all its descendants up to the β-th generation are all distinct. With that definition, 1-tree-walks are regular tree-walks (see Figure 2). Note that if α = 1, we get back β-walks and the graph kernels that we use are often referred to as random walk kernels [11]. From now on, we refer to the descendants up to the β-th generation as the β-descendants.
We let denote T
α,γthe set of tree structures of depth less than γ and with at most α children per node. For T ∈ T
α,γ, we define J
β(T, G) the set of consistent labellings of T by vertices in V leading to β-tree-walks. With these definitions, a β-tree-walk of G is characterized by a tree structure T ∈ T
α,γand a labelling I ∈ J
β(T, G).
1. Moreover, all the trees that we consider are unordered trees (i.e., no order is considered among siblings).
Figure 2: (left) 2-tree-walk, (center) 1-tree-walk which is not a 2-tree-walk.
2.2 Graph kernels
Assuming a local kernel q
T,I,J(G, H) between two tree-walks that share the same structure is given, following [11], we can define the tree-kernel as the sum over all matching tree-walks of G and H of the local kernel:
k
Tα,β,γ(G, H) = X
T∈Tα,γ
f
λ,ν(T ) X
I∈Jβ(T,G)
X
J∈Jβ(T,H)
q
T,I,J(G, H). (1)
If the kernel q
T,I,J(G, H) has positive values and is equal to 1 if the two tree-walks are equal, it can be seen as a soft matching indicator, and then the kernel in Eq. (1) simply counts the softly matched tree-walks in the two graphs.
We add a nonnegative penalization f
λ,ν(T ) depending only on the tree-structure. Besides the usual penalization of the number of nodes |T |, we also add a penalization of the number of leaf nodes ℓ(T ). More precisely, we use the penalization f
λ,ν= λ
|T|ν
ℓ(T). This penalization suggested by [13]
is essential in our situation to avoid that trees with nodes of higher degrees dominate the sum.
If q
T,I,J(G, H) is a positive kernel between G and H, then k
α,β,γT(G, H) also defines a positive kernel. The kernel k
α,β,γT(G, H) sums the local kernel q
T,I,J(G, H) overall all tree-walks of G and H that share the same tree structure. The number of matching tree-walks is exponential in the depth γ, thus, in order to deal with potentially deep trees, a recursive definition is needed. It requires a specific type of local kernels.
2.3 Local kernel
We use a combination (product) of a kernel for attributes and a kernel for positions. For attributes, we use the following usual factorized form k
A(a(I), b(J )) = Q
|I|p=1
k
A(a(I
p), b(J
p), where k
Ais a positive definite kernel on A × A. This allows the separate comparison of each matched pair of points and efficient dynamic programming recursions [11, 4]. However, for our local kernel on positions, we need a kernel that jointly depends on the whole vectors x(I) and y(J), and not only on pairs (x(I
p), y(J
p)).
In this paper, we focus on X = R
dand translation invariant local kernels, which implies that the local kernel for positions may only depend on differences x(i)−x(i
′) and y(j)−y(j
′) for (i, i
′) ∈ I ×I and (j, j
′) ∈ J × J. We further reduce these to the kernel matrices corresponding to a translation invariant kernel k
X(x− x
′). Depending on the application, k
Xmay or may not be rotation invariant.
That is, we define full kernel matrices K ∈ R
|V|×|V|and L ∈ R
|W|×|W|for each graph, defined as K(v, v
′) = k
X(x(v) − x(v
′)) (and similarly for L). For simplicity, we assume that these matrices are positive definite (i.e., invertible), which can be enforced by adding a multiple of the identity matrix.
The local kernel will thus only depend on the submatrices K
I= K
I,Iand L
J= L
J,J, which are positive definite matrices. Note that we use kernel matrices K and L to represent the geometry of each graph, and that we use a kernel on such kernel matrices.
We use the following kernel on positive matrices K and L, the (squared) Bhattacharyya kernel k
B, defined as [14]:
k
B(K, L) = |K|
1/2|L|
1/2K+L2
−1
, (2)
c e
k
i j
h d
b
a
o m n
l f g
c
h d
i k
e
j b
a
o m n
l f g
f g d e
b c
cfg
flm gno dhi ejk
bde abc
Figure 3: (left) rooted tree structure T , (middle) decomposable graphical model Q
1(T ), defined in Section 3.4; (right) corresponding rooted junction tree with cliques (ellipses) and separa- tors (rectangles). By convention, all trees are rooted from top to bottom.
which is a positive kernel with pointwise positive values and such that k
B(K, K ) = 1 (|K| denotes the determinant of K).
By taking the product of the attribute-based local kernel and the position-based local kernel, we get the following local kernel q
T,I,J0(G, H) = k
B(K
I, L
J)k
A(A
I, B
J). However, this local kernel q
0T,I,J(G, H) does not yet depend on the tree structure T and the recursion may be efficient only if q
Tcan be computed recursively. The factorized term k
A(A
I, B
J) does not cause any problems;
however, for the term k
B(K
I, L
J), we need an approximation based on T . As we show in Section 3, this can be obtained by a factorization according to the appropriate graphical model.
3. Positive matrices and graphical models
The main idea underlying the factorization of the kernel is to consider symmetric positive matrices as covariance matrices and look at graphical models defined for Gaussian random vectors with those covariance matrices. In this section we assume that we have n random variables Z
1, . . . , Z
nwith probability distribution p(z) = p(z
1, . . . , z
n). Given a kernel matrix K (in our case defined as K
ij= e
−αkxi−xjk2, for positions x
1, . . . , x
n), we consider random variables Z
1, . . . , Z
nsuch that cov(Z
i, Z
j) = K
ij. In this section, with this identification, we consider covariance matrices as kernel matrices, and vice-versa.
3.1 Graphical models and junction trees
Graphical models provide a flexible and intuitive way of defining factorized probability distributions.
Given any undirected graph Q with vertices in {1, . . . , n}, the distribution p(z) is said to factorize in Q if it can be written as a product of potentials over all cliques (completely connected subgraphs) of the graph Q. When the distribution is Gaussian with covariance matrix K ∈ R
n×n, the distribution factorizes if and only if (K
−1)
ij= 0 for each (i, j) which is not an edge in Q [15, 16].
In this paper, we only consider decomposable graphical models, for which the graph Q is tri- angulated (i.e., there exists no chordless cycles of length larger than 4). In this case, the joint distribution is uniquely defined from its marginals p(z
C) on the cliques C of the graph Q. Namely, if C(Q) is the set of maximal cliques of Q, we can build a tree of cliques, a junction tree, such that p(z) = Q
C∈C(Q)
p(z
C)/ Q
C,C′∈C(Q),C∼C′
p(z
C∩C′). (see Figure 3 for an example of graphical model
and junction tree). The sets C ∩ C
′are usually referred to as separators and we let denote S(Q) the
set of such separators.
3.2 Graphical models and projections
We let denote Π
Q(K) the covariance matrix that factorizes in Q which is closest to K for the Kullback-Leibler divergence between normal distributions with covariance matrices K and L. In this paper, we essentially replace K by Π
Q(K); i.e., we project all our covariance matrices onto a graphical model, which is a classical tool in probabilistic modelling [16]. We leave the study of the approximation properties of such a projection (in particular along the lines of [17]) to future work.
Practically, since our kernel on kernel matrices involves determinant, we simply need to compute
|Π
Q(K)| efficiently. For decomposable graphical models, Π
Q(K) can be obtained in closed form [15]
and its determinant has the following simple expression:
log |Π
Q(K)| = P
C∈C(Q)
log |K
C| − P
S∈S(Q)
log |K
S|. (3)
If the junction tree is rooted (by choosing any clique as the root), then for each clique but the root, a unique parent clique p
Q(C) is defined, and we have:
log |Π
Q(K)| = P
C∈C(Q)
log
|K|KC|pQ(C)|
= P
C∈C(Q)
log |K
C|pQ(C)|, (4) where p
Q(C) is the parent clique of Q (and ∅ for the root clique) and the conditional covariance is defined, as usual, as K
C|pQ(C)= K
C,C− K
C,pQ(C)K
p−1Q(C),pQ(C)K
pQ(C),C.
3.3 Graphical models and kernels
We now propose several ways of defining a kernel adapted to graphical models. All of them are based on replacing determinants |M | by |Π
Q(M )|, and their different decompositions in Eq. (3) and Eq. (4). Using Eq. (3), we obtain the first kernel:
k
QB,0(K, L) = Q
C∈C(Q)
k
B(K
C, L
C) Q
S∈S(Q)
k
B(K
S, L
S)
−1. (5)
However, this is not always a positive kernel for general covariance matrices:
Proposition 1 For any decomposable model Q, the kernel k
B,0Qdefined in Eq. (5) is a positive kernel on the set of covariance matrices K such that for all separators S ∈ S (Q), K
S,S= I. In particular, when all separators have cardinal one, this is a kernel on correlation matrices.
In order to remove the condition on separators, we consider the rooted junction tree representa- tion in Eq. (4) and define another kernel as
k
BQ(K, L) = Q
C∈C(Q)
k
BC|pQ(C)(K, L). (6)
For the root, we define k
BR|∅(K, L) = k
B(K
R, L
R) and the kernels k
C|pB Q(C)(K, L) are defined as kernels between conditional Gaussian distributions of Z
Cgiven Z
pQ(C). We use
k
BC|pQ(C)(K, L) = |K
C|pQ(C)|
1/2|L
C|pQ(C)|
1/21
2
K
C|pQ(C)+
12L
C|pQ(C)+
14(K
C,pQ(C)K
p−1Q(C)−L
C,pQ(C)L
−1pQ(C))
⊗2, (7)
which corresponds to putting a prior with identity covariance matrix on variables Z
Π(Q)and consid- ering the kernel between the resulting joint covariance matrices on variables indexed by (C, p
Q(C)) (we use the notation M
⊗2= M M
⊤). We now have a positive kernel on all covariance matrices:
Proposition 2 For any decomposable model Q, the kernel k
BQ(K, L) defined in Eq. (6) and Eq. (7)
is a positive kernel on the set of covariance matrices.
21 32 43
42 24
34 23 12
3 4
2 1
Figure 4: (left) undirected graph G, (right) graph G
1,1.
Note that the kernel is not invariant by the choice of the particular root of the junction tree.
However, in our setting, this is not an issue because we have a natural way of rooting the junction trees (i.e, following the rooted tree-walk).
In Section 4, we will use the notation k
BI1|I2,J1|J2(K, L) for |I
1| = |I
2| and |J
1| = |J
2| to denote the kernel between covariance matrices K
I1∪I2and L
I1∪I2adapted to the conditional distributions I
1|I
2and J
1|J
2(defined through Eq. (7)).
3.4 Choice of graphical models
Given the rooted tree structure T of a β-tree-walk, we now need to define the graphical model Q
β(T ) that we use to project our kernel matrices. We define Q
β(T ) such that for all nodes in T , the node together with all its β-descendants form a clique, i.e., a node is connected to its β-descendants and all β-descendants are also mutually connected (see Figure 3 for example for β = 1): the set of cliques are thus the set of families of depth β (i.e., with β + 1 generations). Thus our final kernel is:
k
Tα,β,γ(G, H) = X
T∈Tα,γ
f
λ,ν(T ) X
I∈Jβ(T,G)
X
J∈Jβ(T,H)
k
BQβ(T)(K
I, L
J)k
A(A
I, B
J). (8)
The main intuition behind this definition is to sum local similarities over all matching subgraphs.
In order to obtain a tractable formulation, we simply needed to (a) extend the set of subgraphs (to tree-walks of depth γ) and (b) factorize the local similarities along the graphs. Note that the graph Q
β(T ) that we chose is the densest graph for which the following dynamic programming recursions may hold.
4. Dynamic programming recursions
In order to derive dynamic programming recursions, we follow [13] and rely on the fact that α-ary β-tree-walks of G can essentially be defined through 1-tree-walks on the augmented graph of all subtrees of G of depth at most β − 1 and arity less than α.
We thus consider the set V
α,βof non complete rooted (unordered) subtrees of G = (V, E ), of depths less than β − 1 and arity less than α. Given two different rooted unordered labelled trees, they are said equivalent if they share the same tree structure, and this is denoted ∼
t.
On this set V
α,β, we define a directed graph with edge set E
α,βas follows: R
0∈ V
α,βis connected to R
1∈ V
α,βif “the tree R
1extends the tree R
0one generation further”, i.e., if and only if (a) the first β − 2 generations of R
1are exactly equal to one of the complete subtree of R
0rooted at a child of the root of R
0, and (b) the nodes of depth β − 1 of R
1are distinct from the nodes in R
0. This defines a graph G
α,β= (V
α,β, E
α,β) (see Figure 4). Similarly we define a graph H
α,β= (W
α,β, F
α,β) for the graph H .
For a β-tree-walk, the root with its β − 1-descendants must have distinct vertices and thus
correspond exactly to elements of V
α,β. We let k
α,β,γT(G, H, R
0, S
0) denote the same kernel as
defined in Eq. (8), but restricted to tree-walks that start respectively start with R
0and S
0. Note
that if R
0and S
0are not equivalent, then k
α,β,γT(G, H, R
0, S
0) = 0
Figure 5: For digits and Chinese characters: (left) Original characters, (right) thinned and subsam- pled characters
We obtain the following recursion for all R
0∈ V
α,βand and S
0∈ W
α,βsuch that R
0∼
tS
0: k
Tα,β,γ(G, H, R
0, S
0) =
α
X
p=0
λ
pν
(p−1)+X
R
1, . . . , R
p⊂ N
Gα,β(R
0) R
1, . . . , R
pdisjoint
X
S
1, . . . , S
p⊂ N
Hα,β(S
0) S
1, . . . , S
pdisjoint
k
A(A
r(R), B
r(S)) k
∪p
i=1Ri|R0,∪pi=1Si|S0
B
(K, L)
Q
pi=1
k
RBi,Si(K, L)
p
Y
i=1
k
α,β,γ−1T(G, H, R
i, S
i)
! . Note that if any of the trees R
iis not equivalent to S
i, it does not contribute to the sum. The recursion is initialized with k
Tα,β,γ(G, H, R
0, S
0) = λ
|R0|ν
ℓ(R0)k
A(A
R0, B
S0)k
B(K
R0, L
S0) and the final kernel is obtained as k
Tα,β,γ(G, H) = P
R0∼tS0