• Aucun résultat trouvé

Application to the Analysis of Dynamic Scenes

N/A
N/A
Protected

Academic year: 2022

Partager "Application to the Analysis of Dynamic Scenes"

Copied!
32
0
0

Texte intégral

(1)

Application to the Analysis of Dynamic Scenes

S.V.N. Vishwanathan

Statistical Machine Learning Program National ICT Australia

Canberra, 0200 ACT, Australia

Tel: +61 (2) 6125 8657 E-mail:SVN.Vishwanathan@nicta.com.au

Ren´e Vidal

Center for Imaging Science,

Department of Biomedical Engineering, Johns Hopkins University 308B Clark Hall, 3400 N. Charles St., Baltimore MD 21218 Tel: +1 (410) 516 7306 E-mail:rvidal@cis.jhu.edu

Alexander J. Smola

Statistical Machine Learning Program National ICT Australia

Canberra, 0200 ACT, Australia

Tel: +61 (2) 6125 8652 E-mail:Alex.Smola@nicta.com.au

December 16, 2005

Abstract. We derive a family of kernels on dynamical systems by applying the Binet-Cauchy theorem to trajectories of states. Our derivation provides a unifying framework for all kernels on dynamical systems currently used in machine learn- ing, including kernels derived from the behavioral framework, diffusion processes, marginalized kernels, kernels on graphs, and the kernels on sets arising from the sub- space angle approach. In the case of linear time-invariant systems, we derive explicit formulae for computing the proposed Binet-Cauchy kernels by solving Sylvester equations, and relate the proposed kernels to existing kernels based on cepstrum coefficients and subspace angles.

Besides their theoretical appeal, these kernels can be used efficiently in the comparison of video sequences of dynamic scenes that can be modeled as the output of a linear time-invariant dynamical system. One advantage of our kernels is that they take the initial conditions of the dynamical systems into account. As a first example, we use our kernels to compare video sequences of dynamic textures. As a second example, we apply our kernels to the problem of clustering short clips of a movie. Experimental evidence shows superior performance of our kernels.

Keywords:Binet-Cauchy theorem, ARMA models and dynamical systems, Sylvester equation, Kernel methods, Reproducing Kernel Hilbert Spaces, Dynamic scenes, Dynamic textures.

Parts of this paper were presented at SYSID 2003 and NIPS 2004.

(2)

1. Introduction

The past few years have witnessed an increasing interest in the ap- plication of system-theoretic techniques to modeling visual dynamical processes. For instance, Doretto et al. [14, 36] model the appearance of dynamic textures, such as water, foliage, hair, etc. as the output of an Auto-Regressive Moving Average (ARMA) model; Saisanet al.[30]

use ARMA models to represent human gaits, such as walking, running, jumping, etc.; Aggarwalet al.[1] use ARMA models to describe the ap- pearance of moving faces; and Vidalet al. [40] combine ARMA models with GPCA for clustering video shots in dynamic scenes.

However, in practical applications we may not only be interested in obtaining a model for the process (identification problem), but also in determining whether two video sequences correspond to the same process (classification problem) or identifying which process is being observed in a given video sequence (recognition problem). Since the space of models for dynamical processes typically has a non-Euclidean structure,1 the above problems have naturally raised the issue of how to do estimation, classification and recognition on such spaces.

The study of classification and recognition problems has been the mainstream areas of research in machine learning for the past decades.

Among the various methods for nonparametric estimation that have been developed, kernel methods have become one of the mainstays as witnessed by a large number of books [38, 39, 12, 20, 32]. However, not much of the existing literature has addressed the design of kernels in the context of dynamical systems. To the best of our knowledge, the metric for ARMA models based on comparing their cepstrum coefficients [26]

is one of the first papers to address this problem. De Cook and De Moor [10] extended this concept to ARMA models in state-space repre- sentation by considering the subspace angles between the observability subspaces. Recently, Wolf and Shashua [47] demonstrated how subspace angles can be efficiently computed by using kernel methods.

A downside of these approaches is that the resulting kernel is in- dependent of the initial conditions. While this might be desirable in many applications, it might cause potential difficulties in others. For instance, not every initial configuration of the joints of the human body is appropriate for modeling human gaits.

1 For instance, the model parameters of a linear dynamical model are determined only up to a change of basis, hence the space of models has the structure of a Stiefel manifold.

(3)

1.1. Paper Contributions

In this paper, we attempt to bridge the gap between nonparamet- ric estimation methods and dynamical systems. We build on previous work using explicit embedding constructions and regularization the- ory [34, 47, 35] to propose a family of kernels explicitly designed for dynamical systems. More specifically, we propose to compare two dy- namical systems by computing traces of compound matrices of order q built from the system trajectories. The Binet-Cauchy theorem [2] is then invoked to show that such traces satisfy the properties of a kernel.

We then show that the two strategies which have been used in the past are particular cases of the proposed unifying framework.

− The first family is based directly on the time-evolution properties of the dynamical systems [34, 47, 35]. Examples include the diffu- sion kernels of [23], the graph kernels of [16, 22], and the similarity measures between graphs of [7]. We show that all these kernels can be found as particular cases of our framework, the so-calledtrace kernels, which are obtained by setting q= 1.

− The second family is based on the idea of extracting coefficients of a dynamical system (viz. ARMA models), and defining a metric in terms of these quantities. Examples include the Martin distance [26] and the distance based on subspace angles between observabil- ity subspaces [10]. We show that these metrics arise as particular cases of our framework, the so-calleddeterminant kernels, by suit- able preprocessing of the trajectory matrices and by setting the order of the compound matrices q to be equal to the order of the dynamical systems. The advantage of our framework is that it can take the initial conditions of the systems into account explicitly, while the other methods cannot.

Finally, we show how the proposed kernels can be used to classify dynamic textures and segment dynamic scenes by modeling them as ARMA models and computing kernels on the ARMA models. Ex- perimental evidence shows the superiority of our kernels in capturing differences in the dynamics of video sequences.

1.2. Paper Outline

After a very brief overview of kernel methods and Support Vector Machines (SVM) in Section 2 and an outline of the basic concepts involved in computing kernels on dynamical systems in Section 3 we derive explicit formulae for so-called trajectory kernels in Section 4.

Extensions to kernels on Markov process are discussed in Section 5.

(4)

The application of our kernels to the analysis of dynamic scenes is presented in Sections 6. We conclude with a discussion in Section 7.

2. Support Vector Machines and Kernels

In this section we give a brief overview of binary classification with SVMs and kernels. For a more extensive treatment, including exten- sions such as multi-class settings, the ν-parameterization, loss func- tions, etc., we refer the reader to [32, 12, 20] and the references therein.

Given m observations (xi, yi) drawn iid (independently and identi- cally distributed) from a distribution over X × Y our goal is to find a functionf :X → Y which classifies observationsx∈ X into classes +1 and −1. In particular, SVMs assume thatf is given by

f(x) = sign(hw, xi+b),

where the sets {x|f(x) ≥ 0} and {x|f(x) ≤ 0} denote half-spaces separated by the hyperplane H(w, b) :={x|hw, xi+b= 0}.

In the case of linearly separable datasets, that is, if we can find (w, b) such that all (xi, yi) pairs satisfy yif(xi) = 1, the optimal sep- arating hyperplane is given by the (w, b) pair for which all xi have maximum distance from H(w, b). In other words, it is the hyperplane which separates the two sets with the largest possible margin.

Unfortunately, not all sets are linearly separable, hence we need to allow a slack in the separation of the two sets. Without going into details (which can be found in [32]) this leads to the optimization problem:

minimize

w,b,ξ

1

2kwk2+C

m

X

i=1

ξi

subject to yi(hw, xii+b)≥1−ξi, ξi ≥0, ∀1≤i≤m.

(1) Here the constraint ∀i yi(hw, xii+b) ≥ 1 ensures that each (xi, yi) pair is classified correctly. The slack variable ξi relaxes this condi- tion at penalty Cξi. Finally, minimization of kwk2 ensures maximiza- tion of the margin by seeking the shortest w for which the condition

∀i yi(hw, xii+b)≥1−ξi is still satisfied.

An important property of SVMs is that the optimal solutionw can be computed as a linear combination of some of the data points via w=Piyiαixi. Typically many coefficientsαiare 0, which considerably speeds up the estimation step, since these terms can be dropped from the calculation of the optimal separating hyperplane. The data points associated with the nonzero terms are commonly referred to assupport

(5)

vectors, since they support the optimal separating hyperplane. Efficient codes exist for the solution of (1) [28, 8, 21, 42].

2.1. Kernel Expansion

To obtain a nonlinear classifier, one simply replaces the observations xi by Φ(xi). That is, we extract features Φ(xi) fromxi and compute a linear classifier in terms of the features. Note that there is no need to compute Φ(xi) explicitly, since Φ only appears in terms of dot products:

− hΦ(x), wican be computed by exploiting the linearity of the scalar product, which leads to PiαiyihΦ(x),Φ(xi)i.

− Likewise kwk2 can be expanded in terms of a linear combination scalar products by exploiting the linearity of scalar products twice to obtain Pi,jαiαjyiyjhΦ(xi),Φ(xj)i.

Furthermore, note that if we define

k(x, x0) :=hΦ(x),Φ(x0)i,

we may usek(x, x0) whereverhx, x0i occurs. The mapping Φ now takes X to what is called a Reproducing Kernel Hilbert Space (RKHS).

This mapping is often referred to as the kernel trick and the resulting hyperplane (now in the RKHS defined by k(·,·)) leads to the following classification function

f(x) =hΦ(x), wi+b=

m

X

i=1

αiyik(xi, x) +b.

The family of methods which relies on the kernel trick are popularly called kernel methods and SVMs are one of the most prominent kernel methods. The main advantage of kernel methods stems from the fact that they do not assume a vector space structure on X. As a result, non-vectorial representations can be handled elegantly as long as we can define meaningful kernels. In the following sections, we show how to define kernels on dynamical systems, i.e.,kernels on a set of points {xi} whose temporal evolution is governed by a dynamical system.

3. Kernels, Dynamical Systems, and Trajectories 3.1. Dynamical Systems

We begin with some definitions. The state of a dynamical system is defined by some x ∈ X, where X is assumed to be a RKHS with

(6)

its kernel denoted by κ(·,·). Moreover, we assume that the temporal evolution of x∈ X is governed by

xA(t) :=A(x, t) for t∈ T,

where t is the time of the measurement and A : X × T → X denotes a set of operators indexed by T. Furthermore, unless stated otherwise, we will assume that for every t ∈ T, the operator A(·, t) :X → X is linear and drawn from some fixed set A.2 We will choose T = N0 or T =R+0, depending on whether we wish to deal with discrete-time or continuous-time systems.

We can also assume that x is a random variable drawn from X and A(·, t) is a random operator drawn from the set A. In this case, we need to assume that both X and A are endowed with suitable probability measures. For instance, we may want to consider initial conditions corrupted by additional noise or LTI systems with additive noise.

Also note that xA(t) need not be the only variable involved in the time evolution process. For instance, for partially observable models, we may only seeyA(t) which depends on the evolution of a hidden state xA(t). These cases will be discussed in more detail in Section 4.

3.2. Trajectories

When dealing with dynamical systems, one may compare their similar- ities by checking whether they satisfy similar functional dependencies.

For instance, for LTI systems one may want to compare the transfer functions, the system matrices, the poles and/or zeros, etc. This is indeed useful in determining when systems are similar whenever suit- able parameterizations exist. However, it is not difficult to find rather different functional dependencies, which, nonetheless, behave almost identically, e.g., as long as the domain of initial conditions is sufficiently restricted. For instance, consider the maps

x←a(x) =|x|p and x←b(x) = min(|x|p,|x|)

for p > 1. While a and b clearly differ, the two systems behave iden- tically for all initial conditions satisfying |x| ≤ 1. On the other hand, for |x| > 1 system b is stable while a could potentially diverge. This example may seem contrived, but for more complex maps and higher dimensional spaces such statements are not quite as easily formulated.

One way to amend this situation is to compare trajectories of dy- namical systems and derive measures of similarity from them. The

2 If further generality is requiredA(·, t) can be assumed to be non-linear, albeit at the cost of significant technical difficulties.

(7)

advantage of this approach is that it is independent of the param- eterization of the system. This approach is in spirit similar to the behavioral framework of [44, 45, 46], which identifies systems by iden- tifying trajectories. However, it has the disadvantage that one needs efficient mechanisms to compute the trajectories, which may or may not always be available. We will show later that in the case of ARMA models or LTI systems such computations can be performed efficiently by solving a series of Sylvester equations.

In keeping with the spirit of the behavioral framework, we will consider the pairs (x,A) only in terms of the trajectories which they generate. We will focus our attention on the map

TrajA :X → XT where TrajA(x) :=A(x,·). (2) In other words, we define TrajA(x)(t) =A(x, t). The following lemma provides a characterization of this mapping.

LEMMA 1. Let A :X × T → X be such that for everyt∈ T, A(·, t) is linear. Then the mapping TrajA defined by (2)is linear.

Proof Let, x, y∈ X andα, β ∈R. For a fixedtthe operator A(·, t) is linear and hence we have

TrajA(αx+βy)(t) =A(αx+βy, t) =αA(x, t) +βA(y, t).

Since the above holds for every t∈ T we can write TrajA(αx+βy) =αTrajA(x) +βTrajA(y),

from which the claim follows. B

The fact that the state space X is a RKHS and TrajA defines a linear operator will be used in the sequel to define kernels on trajectories.

3.3. Binet-Cauchy Theorem and Kernels on Trajectories In order to define kernels on trajectories we need to introduce the Binet- Cauchy theorem on matrices. All the concepts discussed in this section can be naturally extended to well behaved linear operators3, but we will restrict our attention to matrices for ease of exposition. Details about the operator version of the Binet-Cauchy theorem and the proof of a more general version of Theorem 4 can be found in a companion paper [43].

We begin by defining compound matrices, which are formed by choosing subsets of entries of a matrix and computing their determi- nants. The definition can also be extended to operators.

3 The so called Fredholm operators.

(8)

DEFINITION 2 (Compound Matrix). LetA∈Rm×n. Ifq ≤min(m, n), define Iqn = {i = (i1, i2, . . . , iq) : 1 ≤ i1 < . . . < iq ≤ n, ii ∈ N} and likewise Iqm. The compound matrix of order q, Cq(A), is defined as

[Cq(A)]i,j := det(A(ik, jl))qk,l=1 where i∈Iqn and j∈Iqm. Here i,j are assumed to be arranged in lexicographical order.

We are now ready to state the Binet-Cauchy theorem on compound matrices.

THEOREM 3 (Binet-Cauchy). Let A ∈ Rl×m and B ∈ Rl×n. For all q ≤min(m, n, l) we have Cq(A>B) =Cq(A)>Cq(B).

When q =m=n=l we have Cq(A) = det(A) and the Binet-Cauchy theorem becomes the well known identity det(A>B) = det(A) det(B).

On the other hand when q = 1 we have C1(A) = A, so the above theorem reduces to a tautology. The Binet-Cauchy theorem allows us to state a strengthened version of [47, Corollary 5]:

THEOREM 4. For any two matrices A, B ∈ Rn×k and 1 ≤ q ≤ min(n, k) the function

kq(A, B) := trCq(A>B)

is a kernel. Therefore, the following functions are also kernels

k1(A, B) = trC1(A)>C1(B) = trA>B (3)

kn(A, B) = detA>B. (4)

Proof From Theorem 3 we have kq(A, B) = trCq(A)>Cq(B). Since trCq(A)>Cq(B) is a scalar product between the entries of Cq(A) and Cq(B), we may writekq(A, B) as the dot product Φq(A)>Φq(B), where Φq(X) is a vector containing all the entries of the matrix Cq(X). It follows that kq is a kernel with associated embedding Φq. B Moreover, notice that we can modify (3) and (4) to include a positive semidefinite weighting matrix W. This leads to kernels of the form trA>W B and detA>W B, respectively. Efficient computation strate- gies for both (3) and (4) when A and B are arbitrary matrices can be found in [43]. In the sequel we will focus on applying Theorem 4 to trajectories of dynamical systems, with special emphasis on trajectories of ARMA models.

(9)

3.4. Trajectories and Kernels

To use Theorem 4 for computing kernels on trajectories, we notice from Lemma 1 that TrajA is a linear operator. Therefore, we can apply (3) and (4) to TrajA and TrajA0 in order to obtain a measure of similarity between dynamical systems A and A0. As we shall see, (3) and some of its refinements lead to kernels which carry out comparisons of state pairs. Equation (4), on the other hand, can be used to define kernels via subspace angles. We describe both variants below.

3.4.1. Trace Kernels

Computing TrajA(x)>TrajA(x0) means taking the sum over scalar prod- ucts between the rows of TrajA(x) and TrajA0(x0). Recall that scalar products in the state space X are given by the kernel function κ.

Therefore, for trajectories, computing the trace amounts to summing overκ(xA(t), x0A0(t)) with respect tot. In order to ensure that the sum over tconverges we need to use an appropriate probability measure µ over the domain T. Exponential discounting schemes of the form

µ(t) =λe−λt forλ >0 and T =R+0 µ(t) = 1

1−e−λe−λt forλ >0 and T =N0

are a popular choice in reinforcement learning [37, 5] and control theory.

Another possible measure is

µ(t) =δτ(t) (5)

whereδτ corresponds to the Kronecker-δforT =N0 and to the Dirac’s δ-distribution for T =R+0.

The particular choice ofµgives rise to a number of popular and novel kernels for graphs and similar discrete objects [23, 16]. We discuss this in more detail in Section 5.2. Using the measure µ we can extend (3) to obtain the following kernel4

k((x,A),(x0,A0)) :=Et∼µ(t)κ xA(t), x0A0(t). (6) We can specialize the above equation to define kernels on model parameters as well as kernels on initial conditions.

Kernels on Model Parameters: We can define a kernel on model parameters simply by using (6) and taking expectations over initial conditions. In other words,

k(A,A0) :=Ex,x0

k (x,A),(x0,A0). (7)

4 Recall that a convex combination of kernels is a kernel. Hence, taking expectations over kernel values is still a valid kernel.

(10)

However, we need to show that (7) actually satisfies Mercer’s con- dition [27, 24].

THEOREM 5. Assume that {x, x0} are drawn from an infinitely extendable and exchangeable probability distribution. Then (7)sat- isfies Mercer’s condition.

Proof By De Finetti’s theorem [13], infinitely exchangeable dis- tributions arise from conditionally independent random variables.

In practice this means that p(x, x0) =

Z

p(x|c)p(x0|c)dp(c)

for some p(x|c) and a measure p(c). Hence we can rearrange the integral in (7) to obtain

k(A,A0) = Z

k (x,A),(x0,A0)dp(x|c)dp(x0|c)dp(c).

Here the result of the inner integral is a kernel by the decomposi- tion admitted by Mercer’s theorem. Taking a convex combination of such kernels preserves Mercer’s condition, hence k(A,A0) is a

kernel. B

In practice, one typically uses either p(x, x0) = p(x)p(x0) if inde- pendent averaging over the initial conditions is desired, orp(x, x0) = δ(x −x0)p(x) whenever the averaging is assumed to occur syn- chronously.

Kernels on Initial Conditions: By the same strategy, we can also define kernels exclusively on initial conditions x, x0 simply by av- eraging over the dynamical systems they should be subjected to:

k(x, x0) :=EA,A0k (x,A),(x0,A0). (8) As before, whenever p(A,A0) is infinitely exchangeable in A,A0, (8) is a proper kernel.5Note that in this fashion the metric imposed on initial conditions is one that follows naturally from the particu- lar dynamical system under consideration. For instance, differences in directions which are rapidly contracting carry less weight than a discrepancy in a divergent direction of the system.

5 Note that we made no specific requirements on the parameterization ofA,A0. For instance, for certain ARMA models the space of parameters has the structure of a manifold. The joint probability distribution, by its definition, has to take such facts into account. Often the averaging simply takes additive noise of the dynamical system into account.

(11)

3.4.2. Determinant Kernels

Instead of computing traces of TrajA(x)>TrajA0(x0) we can follow (4) and compute determinants of such expressions. As before, we need to assume the existence of a suitable measure which ensures the conver- gence of the sums.

For the sake of computational tractability one typically chooses a measure with finite support. This means that we are computing the determinant of a kernel matrix, which in turn is then treated as a kernel function. This allows us to give an information-theoretic inter- pretation to the so-defined kernel function. Indeed, [18, 3] show that such determinants can be seen as measures of the independence between sequences. This means that independent sequences can now be viewed as orthogonal in some feature space, whereas a large overlap indicates statistical correlation.

4. Kernels on ARMA Models

A special yet important class of dynamical systems are ARMA models of the form

yt=Cxt+wt where wt∼ N(0, R),

xt+1 =Axt+vt where vt∼ N(0, Q), (9) where the driving noise wt, vt are assumed to be zero mean iid normal random variables, yt ∈ Y are the observed random variables at time t, xt ∈ X are the latent variables, and the matrices A, C, Q, and R are the parameters of the dynamical system. These systems are also commonly known as Kalman Filters. Throughout the rest of the paper we will assume thatY =Rm, and X =Rn withmn.

In this section, we show how to efficiently compute an expression for the trace and determinant kernels between two ARMA models (x0, A, C) and (x00, A0, C0), wherex0 and x00 are the initial conditions.

We also establish a connection between the determinant kernel, the Martin kernel [26], and the kernels based on subspace angles [10].

4.1. Trace kernels on ARMA models

If one assumes an exponential discountingµ(t) =e−λt with rateλ >0, then the trace kernel for ARMA models is

k((x0, A, C),(x00, A0, C0)) := E

v,w

" X

t=1

e−λty>t W y0t

#

, (10)

(12)

whereW ∈Rm×mis a user-defined positive semidefinite matrix specify- ing the kernel κ(yt, yt0) :=yt>W yt0. By default, one may chooseW =1, which leads to the standard Euclidean scalar product between yt and yt0.

A major difficultly in computing (10) is that it involves the compu- tation of an infinite sum. But, as the following lemma shows, this is easily achieved by solving two Sylvester equations. Note that Sylvester equations of the type SXT +U XV =E, where S, T, U, V ∈Rn×n can be readily solved in O(n3) time with freely available code [15].

THEOREM 6. Ife−λkAkkA0k<1, and the two ARMA models(x0, A, C), and (x00, A0, C0) evolve with the same noise realization then the kernel of (10) is given by

k=x>0M x00+eλ−1−1trhQM˜ +W Ri, (11) where M,M˜ satisfy the Sylvester equations

M =e−λA>C>W C0A0+e−λA>M A0 M˜ =C>W C0+e−λA>M A˜ 0.

Proof See Appendix A. B

In the above theorem, we assumed that the two ARMA models evolve with the same noise realization. In many applications this might not be a realistic assumption. The following corollary addresses the issue of computation of the kernel when the noise realizations are assumed to be independent.

COROLLARY 7. Same setup as above, but the two ARMA models (x0, A, C), and (x00, A0, C0) now evolve with independent noise realiza- tions. Then (10) is given by

k=x>0M x00, (12) where M satisfies the Sylvester equation

M =e−λA>C>W C0A0+e−λA>M A0.

Proof [Sketch] The second term in the RHS of (11) is contribution due to the covariance of the noise model. If we assume that the noise realizations are independent (uncorrelated) the second term vanishes.B

(13)

The complexity of solving the Sylvester equations is O(n3), where n m is the latent variable dimension. On the other hand, O(nm2) time is required to perform the matrix multiplications and compute A>C>W C0A0 andC>W C0. Computation ofW RrequiresO(m3) time, while QM˜ can be computed in O(n3) time. Therefore, the overall time complexity of computing (12) is O(nm2) while that of computing (11) is O(m3).

In case we wish to be independent of the initial conditions, we can simply take the expectation over x0, x00. Since the only dependency of (11) on the initial conditions manifests itself in the first term of the sum, the kernel on the parameters of the dynamical systems (A, C) and (A0, C0) is

k((A, C),(A0, C0)) := E

x0,x00

k((x0, A, C),(x00, A0, C0))

= trΣM+eλ−1−1trhQM˜ +W Ri, where Σ is the covariance matrix of the initial conditions x0, x00 (as- suming that we have zero mean). As before, we can drop the second term if we assume that the noise realizations are independent.

An important special case are fully observable ARMA models, i.e., systems with C = 1 and R = 0. In this case, the expression for the trace kernel (11) reduces to

k=x>0M x00+eλ−1−1tr(QM˜), (13) where M,M˜ satisfy

M =e−λA>W A0+e−λA>M A0 M˜ =W +e−λA>M A˜ 0.

This special kernel was first derived in [35], though with the summation starting att= 0.

4.2. Determinant kernels on ARMA models

As before, we consider an exponential discounting µ(t) = e−λt, and obtain the following expression for the determinant kernel on ARMA models:

k((x0, A, C),(x00, A0, C0)) := E

v,wdet

" X

t=1

e−λtytyt0>

#

. (14)

(14)

Notice that in this case the effect of the weight matrix W is just a scaling of the kernel value by det(W), thus, we assume w.l.o.g. that W =1.

Also, for the sake of simplicity and in order to compare with other kernels we assume an ARMA model with no noise, i.e., vt = 0 and wt= 0. Then, we have the following:

THEOREM 8. If e−λkAkkA0k<1, then the kernel of (14) is given by k((x0, A, C),(x00, A0, C0)) = detCM C0>, (15) where M satisfies the Sylvester equation

M =e−λAx0x0>0 A0>+e−λAM A0>. Proof We have

k((x0, A, C),(x00, A0, C0)) = det

X

t=1

e−λtCAtx0x0>0 (A0t)>C0>

= detCM C0>, where, by using Lemma 10, we can write

M =

X

t=1

e−λtAtx0x0>0 (A0t)>

=e−λAx0x0>0 A0>+e−λAM A0>.

B

As pointed out in [47], the determinant is not invariant to per- mutations of the columns of TrajA(x) and TrajA0(x0). Since different columns or linear combinations of them are determined by the choice of the initial conditions x0 and x00, this means, as is obvious from the formula forM, that the determinant kernel does depend on the initial conditions.

In order to make it independent from initial conditions, as before, we can take expectations over x0 and x00. Unfortunately, although M is linear in on bothx0 and x00, the kernel given by (15) is multi-linear.

Therefore, the kernel depends not only on the covariance Σ on the initial conditions, but also on higher order statistics. Only in the case of single-output systems, we obtain a simple expression

k((A, C),(A0, C0)) =CM C0>,

(15)

where

M =e−λAΣA0>+e−λAM A0>, for the kernel on the model parameters only.

4.3. Kernel via subspace angles and Martin kernel

Given an ARMA model with parameters (A, C) (9), if we neglect the noise terms wtand vt, then the Hankel matrix of the output satisfies

Y=

y0 y1 y2 · · · y1 y2 · · · y2 ... . ..

...

=Ox0 x1 x3 · · ·=

C CA CA2

...

x0 x1 x3 · · ·

Therefore, the infinite output sequence yt lives in an n-dimensional observability subspace spanned by the columns of the extendedobserv- ability matrix O. De Cook and De Mooret al. [10] propose to compare ARMA models by using the subspace angles among their observability subspaces. More specifically, a kernel between ARMA models (A, C) and (A0, C0) is defined as

k (A, C),(A0, C0)=

n

Y

i=1

cos2i), (16) where θi is the i−th subspace angle between the column spaces of the extended observability matrices O = [C> A>C> · · ·]> and O0 = [C0> A0>C0> · · ·]>. Since the subspace angles should not depend on a particular choice of a basis for each one of the observability subspaces, the calculation is typically done by choosing a canonical basis via the QR decomposition. More specifically, the above kernel is computed as det(Q>Q0)2, whereO=QRandO0 =Q0R0are theQRdecompositions of O and O0, respectively.6 Therefore, the kernel based on subspace angles is essentially the square of a determinant kernel7 formed from a whitened version of the outputs (via the QR decomposition) rather than from the outputs directly, as with the determinant kernel in (15).

Yet another way of defining a kernel on dynamical systems is via cepstrum coefficients, as proposed by Martin [26]. More specifically, if

6 As shown in [10, Note 1 and Equation 8], the squared cosines of the principal angles between the column spaces ofOandO0can be computed as theneigenvalues of (O>O)−1O>O0(O0>O0)−1O0>O. From the QR decompositionQ>Q=Q0>Q0 = I, henceQn

i=1cos2i) = det(O>O0)2/(det(O>O) det(O0>O0)).

7 Recall that the square of a kernel is a kernel thanks to the product property.

(16)

H(z) is the transfer function of the ARMA model described in (9), then the cepstrum coefficientscnare defined via the Laplace transformation of the logarithm of the ARMA power spectrum,

logH(z)H(z−1) = X

n∈Z

cnz−n. (17) The Martin kernel between (A, C) and (A0, C0) with corresponding cepstrum coefficients cn and c0n is defined as

k((A, C),(A0, C0)) :=

X

n=1

ncnc0n. (18) As it turns out, the Martin kernel and the kernel based on subspace angles are the same as shown in [10].

It is worth noticing that, by definition, the Martin kernel and the ker- nel based on subspace angles depend exclusively on the model param- eters (A, C) and (A0, C0), and not on the initial conditions. This could be a drawback in practical applications where the choice of the initial condition is important when determining how similar two dynamical systems are, e.g. in modeling human gaits [30].

5. Kernels on Continuous Models

The kernels described above assumed discrete time linear systems. In this section we concentrate on continuous time systems. The equations are considerably more involved, but the basic ideas still remain the same, i.e.,we replace the standard Euclidean metric by the covariance of two trajectories under a dynamical system. We also apply the kernels we derive to graphs and show that a wide range of well known graph kernels arise as special cases of our approach. This points to deeper connections between various disparate looking graph kernels, and also demonstrates the utility of our approach.

5.1. Continuous LTI Systems

The time evolution of a continuous time LTI system (x0, A) can be modeled as

d

dtx(t) =Ax(t).

It is well known (cf. [25, Chapter2]) that for a continuous LTI system, x(t) is given by

x(t) = exp(At)x(0),

(17)

where x(0) denotes the initial conditions.

Given two LTI systems (x0, A) and (x00, A0), a positive semi-definite matrix W, and a measure µ(t) we can define the trace kernel on the trajectories as

k((x0, A),(x00, A0)) :=x>0 Z

0

exp(At)>Wexp(A0t)dµ(t)

x00. (19) As before, we need to assume a suitably decaying measure in order to ensure that the above integral converges. The following theorem characterizes the solution when µ(t) = e−λ, i.e., , the measure is exponentially decaying.

THEOREM 9. Let, A, A0 ∈ Rm×m, and W ∈Rm×m be non-singular.

Furthermore, assume that kAk,kA0k ≤Λ,||W||is bounded, andµ(t) = e−λt. Then, for allλ >2Λ (19) is well defined and can be computed as

k((x0, A),(x00, A0)) =x>M x, where

A−λ

21 >

M+M

A0−λ 21

=−W.

Furthermore, if µ(t) =δτ(t) we have the somewhat trivial kernel k((A, x0),(A0, x00)) =x>0 hexp(Aτ)>W exp(Aτ)ix00. (20)

Proof See Appendix B. B

5.2. Graphs

A graphGconsists of a set of verticesV numbered from 1 ton, and a set of edgesE ⊂V×V. A vertexvi∈V is said to be a neighbor ofvj ∈V if they are connected by an edge. In some cases, such as in the analysis of molecules, the vertices will be equipped with labelsl.P the adjacency matrix of G is a n×n real matrix, with Pij = 1 if (i, j) ∈ E, and 0 otherwise. An undirected graph is a graph whose adjacency matrix is symmetric, while the adjacency matrix of a weighted graph contains non-zero entries Pij ∈(0,∞) iff (i, j)∈E. Let D be a n×n diagonal matrix withDii=PjPij. The Laplacian ofGis defined asL:=D−P and the Normalized Laplacian is ˜L:= D12LD12 = I−D12W D12. The normalized graph Laplacian often plays an important role in graph segmentation [33].

(18)

A walk on G is a sequence of vertices denoted by v1, v2, . . ., where vi and vi+1 are connected by an edge for every i. A random walk is a walk where the vertex vi+1 is picked uniformly at random from the neighbors of vi. Let xt denote a probability distribution over the set of vertices at time t. Then, the time-evolution of xt under a random walk can be modeled as xt+1 =P D−1xt. Observe that this is a fully observed ARMA model and the trace kernel on such a model can be computed using (13).

On the other hand, if we assume a continuous diffusion process on G then the evolution ofxt can be modeled as the following continuous LTI system

d

dtx(t) =Lx(t).

− We begin by studying a very simple case. Letei denote the canoni- cal unit vector of appropriate dimension with 1 at thei-th location, and zero elsewhere. Given a graph G whose Laplacian is L, set A=A0 =L,W =1,x0 =ei, andx00=ej in (20) to obtain

k((L, ei),(L, ej)) =hexp(Lτ)>exp(Lτ)i

ij.

Observe that the kernel now measures the probability that any other state l could have been reached jointly fromiand j [23]. In other words, it measures the probability of diffusing from either node ito nodej or vice versa. Then×nkernel matrixK can be written as

K= exp(Lτ)>exp(Lτ), (21) and viewed as a covariance matrix between the distributions over vertices.

− Kondor and Lafferty [23] suggested to study diffusion onundirected graphs with W = 1. Their derivations are a special case of (20).

Note that for undirected graphs the matricesDandLare symmet- ric. Ifx andx0 denote two different initial conditions over vertices of a graphGwith Laplacian L, (20) can be further simplified and written as

k(x, x0) =x>exp(LT)>exp(LT)x=x>exp(2LT)x.

This is the kernel proposed in [23].

− If we study an even more particular example of a graph, namely the Cayley graph of a group, we thereby have a means of imposing a kernel on the elements of a group (see also [23] for further details).

(19)

− The kernel proposed by G¨artner et al. [16] can be recovered by setting W to have entries {0,1} according to whether vertices i and jbear the same label and considering a discrete time random walk rather than a continuous time diffusion processes (various measuresµ(t) take care of the exponential and harmonic weights).

− Finally, to recover kernels similar to the ones proposed by [22] we need to use a weight matrix W, whose i, j-th entry measures the overlap between the labels of stateiand state j.

6. Application to the Analysis of Dynamic Scenes Applications of kernel methods to computer vision have so far been largely restricted to methods which analyze a single image at a time, possibly with some post-processing to take advantage of the fact that images are ordered. Essentially there are two main approaches [19]:

kernels which operate directly on the raw pixel data [31, 6], such as the Haussdorff kernel which matches similar pixels in a neighborhood [4], and kernels which exploit the statistics of a small set of features on the image, such as intensity, texture, and color histograms [9]. However, when dealing with video sequences, such similarity measures are highly inadequate as they do not take the temporal structure of the data into account.

The work of Doretto and coworkers [14, 36] points out a means of comparing dynamic textures by first approximating them with an ARMA model, and subsequently computing the subspace angles be- tween the observability subspaces of such models. In this section, we present experiments showing that the proposed Binet-Cauchy kernels outperform the subspace angle based approach when used for compar- ing dynamic textures. In addition, we show that our kernels can be effectively used for clustering video shots in dynamic scenes.

6.1. Setup

We restrict our attention to video sequence of dynamic scenes whose temporal evolution can be modeled as the output of a linear dynamical model. This assumption has proven to be very reasonable when applied to a large number of common phenomena such as smoke, fire, water flow, foliage, wind etc., as corroborated by the impressive results of Soatto and coworkers [14].

More specifically, let It ∈ Rm for t = 1, . . . , τ be a sequence of τ images, and assume that at every time step we measure a noisy version

(20)

yt=It+wtwherewt∈Rmis Gaussian noise distributed asN(0, R). To model the sequences of observed images as a Gaussian ARMA model, we will assume that there exists a positive integer n m, a process xt ∈Rn and symmetric positive definite matrices Q ∈Rn×n and R ∈ Rm×m such that (9) holds. Without loss of generality, the scaling of the model is fixed by requiring that

C>C=1.

Given a sequence ofτ noisy images {yt}, the identification problem is to estimate A, C, Q and R. Optimal solutions in the maximum likelihood sense exist, e.g. the n4sidmethod in systems identification toolbox of MATLAB, but they are very expensive to compute for large images. Instead, following Doretto et al. [14], we use a sub-optimal closed form solution. The advantage of this approach is twofold: a) It is computationally efficient and b) it allows us to compare our results with previous work.

Set Y := [y1, . . . , yτ], X := [x1, . . . , xτ], and W := [w1, . . . wτ], and solve

min||Y −CX||F = min||W||F,

whereC∈Rm×n,C>C =1, and|| · ||F is used to denote the Frobenius norm. The unique solution to the above problem is given by C = Un and X = ΣnVn> where UnΣnVn> is the best rank n approximation of Y.Un, Σn and Vn can be estimated in a straightforward manner from Y =UΣV>, the Singular Value Decomposition (SVD) ofY [17].

In order to estimate A we solve min||AX − X0||F where X0 = [x2, . . . , xτ], which again has a closed form solution using SVD. Now, we can compute vi=xi−Axi−1 and set

Q= 1 τ −1

τ

X

i=1

vivi>.

The covariance matrixRcan also be computed from the columns ofW in a similar manner. For more information including details of efficient implementation we refer the interested reader to [14].

6.2. Parameters

The parameters of the ARMA model deserve some consideration. The initial conditions of a video clip are given by the vector x0. This vec- tor helps in distinguishing scenes which have a different stationary

(21)

background, but the same dynamical texture in the foreground.8 In this case, if we use identical initial conditions for both systems, the similarity measure will focus on the dynamical part (the foreground).

On the other hand, if we make use ofx0, we will be able to distinguish between scenes which only differ in their background.

Another important factor which influences the value of the kernel is the value of λ. If we want to provide more weightage to short range interactions due to differences in initial conditions it might be desir- able to use a large values of λ since it results in heavy attenuation of contributions as timetincreases. On the other hand, when we want to identify samples of the same dynamical system it might be desirable to use small values of λ.

Finally, A, C determine the dynamics. Note that there is no re- quirement that A, A0 share the same dimensionality. Indeed, the only condition on the two systems is that the spaces of observations yt, yt0 agree. This allows us, for instance, to determine the approximation qualities of systems with various levels of detail in the parametrization.

6.3. Comparing video sequences of dynamic textures

In our experiments we used some sequences from the MIT temporal texture database. We enhanced this dataset by capturing video clips of dynamic textures of natural and artificial objects, such as trees, water bodies, water flow in kitchen sinks, etc. Each sequence in this combined database consists of 120 frames. For compatibility with the MIT temporal texture database we downsample the images to use a grayscale colormap (256 levels) with each frame of size 115×170. Also, to study the effect of temporal structure on our kernel, we number each clip based on the sequence in which it was framed. In other words, our numbering scheme implies that clipiwas filmed before clipi+ 1. This also ensures that clips from the same scene get labeled with consecutive numbers. Figure 1 shows some samples from our dataset.

We used the procedure outlined in Section 6.1 for estimating the model parameters A, C, Q, and R. Once the system parameters are estimated we compute distances between models using our trace kernel (11) as well as the Martin distance (18). We varied the value of the down-weighting parameter λand report results for two values λ= 0.1 and λ = 0.9. The distance matrix obtained using each of the above methods is shown in Figure 2. Darker colors imply smaller distance

8 For example, consider sequence with a tree in the foreground with lawn in the background versus a sequence with a tree in the foreground and a building in the background.

(22)

Figure 1. Some sample textures from our dataset

value, i.e., two clips are more related according to the value of the metric induced by the kernel.

As mentioned before, clips that are closer to each other on an axis are closely related, that is, they are either from similar natural phenomena or are extracted from the same master clip. Hence a perfect distance measure will produce a block diagonal matrix with a high degree of correlation between neighboring clips. As can be seen from our plots, the kernel using trajectories assigns a high similarity to clips extracted from the same master clip, while the Martin distance fails to do so.

Another interesting feature of our approach is that the value of the kernel seems to be fairly independent ofλ. The reason for this might be because we consider long range interactions averaged out over infinite time steps. Two dynamic textures derived from the same source might exhibit very different short term behavior due to the differences in the initial conditions. But once these short range interactions are atten- uated we expect the two systems to behave in a more or less similar fashion. Hence, an approach which uses only short range interactions might not be able to correctly identify these clips.

To further test the effectiveness of our method, we introduced some

“corrupted” clips (clip numbers 65 - 80). These are random clips which clearly cannot be modeled as dynamic textures. For instance, we used shots of people and objects taken with a very shaky camera. From Figures 2 and 3 it is clear that our kernel is able to pick up such ran- dom clips as novel. This is because our kernel compares the similarity between each frame during each time step. Hence if two clips are very dissimilar our kernel can immediately flag this as novel.

(23)
(24)

Figure 3. Typical frames from a few samples are shown. The first row shows four frames of a flushing toilet, while the last two rows show corrupted sequences that do not correspond to a dynamic texture. The distance matrix shows that our kernel is able to pick out this anomaly.

6.4. Clustering short video clips

We also apply our kernels to the task of clustering short video clips.

Our aim here is to show that our kernels are able to capture semantic information contained in scene dynamics which are difficult to model using traditional measures of similarity.

In our experiment we randomly sample 480 short clips (120 frames each) from the movie Kill Bill Vol. 1, and model each clip as the output of a linear ARMA model. As before, we use the procedure outlined in [14] for estimating the model parameters A, C, Q, and R. The ker- nels described in Section 3 were applied to the estimated models, and the metric defined by the kernel was used to compute the k-nearest neighbors of a clip. Locally Linear Embedding (LLE) [29] was used to cluster, and embed the clips in two dimensions using the k-nearest neighbor information obtained above. The two dimensional embedding obtained by LLE is depicted in Figure 4. Observe the linear cluster (with a projecting arm) in Figure 4. This corresponds to clips which are temporally close to each other and hence have similar dynamics.

For instance, clips in the far right depict a person rolling in the snow while those in the far left corner depict a sword fight while clips in the center involve conversations between two characters. To visualize this, we randomly select a few data points from Figure 4 and depict the first frame of the corresponding clip in Figure 5.

(25)

Figure 4. LLE Embeddings of random clips from Kill Bill

A naive comparison of the intensity values or a dot product of the actual clips would not be able to extract such semantic information.

Even though the camera angle varies with time our kernel is able to successfully pick out the underlying dynamics of the scene. These ex- periments are encouraging and future work will concentrate on applying this to video sequence querying.

7. Summary and Outlook

The current paper sets the stage for kernels on dynamical systems as they occur frequently in linear and affine systems. By using correlations between trajectories we were able to compare various systems on a behavioral level rather than a mere functional description. This allowed us to define similarity measures in a natural way.

In particular, for a linear first-order ARMA process we showed that our kernels can be computed in closed form. Furthermore, we showed that the Martin distance used for dynamic texture recognition is a kernel. The main drawback of the Martin kernel is that it does not take into account the initial conditions and might be very expensive

(26)

Figure 5. LLE embeddings of a subset of our dataset

to compute. Our proposed kernels overcome these drawbacks and are simple to compute. Another advantage of our kernels is that by ap- propriately choosing downweighting parameters we can concentrate on either short term or long term interactions.

While the larger domain of kernels on dynamical systems is still untested, special instances of the theory have proven to be useful in areas as varied as classification with categorical data [23, 16] and speech processing [11]. This gives reason to believe that further useful appli- cations will be found shortly. For instance, we could use kernels in combination with novelty detection to determine unusual initial condi- tions, or likewise, to find unusual dynamics. In addition, we can use the kernels to find a metric between objects such as Hidden Markov Models (HMMs), e.g., to compare various estimation methods. Preliminary work in this direction can be found in [41].

Future work will focus on applications of our kernels to system identification, and computation of closed form solutions for higher order ARMA processes.

(27)

Acknowledgements

We thank Frederik Schaffalitzky, Bob Williamson, Laurent El Ghaoui, Patrick Haffner, Daniela Pucci de Farias, and St´ephane Canu for useful discussions. We also thank Gianfranco Doretto for providing code for the estimation of linear models. National ICT Australia is supported by the Australian Government’s Program Backing Australia’s Ability.

SVNV and AJS were partly supported by the IST Programme of the European Community, under the Pascal Network of Excellence, IST- 2002-506778. RV was supported by a by Hopkins WSE startup funds, and by grants NSF-CAREER ISS-0447739 and ONR N000140510836.

References

1. G. Aggarwal, A. Roy-Chowdhury, and R. Chellappa. A system identification approach for video-based face recognition. InProceedings Int. Conf. on Pattern Recognition, Cambridge, UK, August 2004.

2. A.C. Aitken. Determinants and Matrices. Interscience Publishers, 4 edition, 1946.

3. F. R. Bach and M. I. Jordan. Kernel independent component analysis.Journal of Machine Learning Research, 3:1–48, 2002.

4. A. Barla, F. Odone, and A. Verri. Hausdorff kernel for 3D object acquisition and detection. InEuropean Conference on Computer Vision ’02, number 2353 in LNCS, page 20. Springer, 2002.

5. Jonathan Baxter and Peter L. Bartlett. Direct gradient-based reinforcement learning: Gradient estimation algorithms. Technical report, Research School of Information, ANU Canberra, 1999.

6. V. Blanz, B. Sch¨olkopf, H. B¨ulthoff, C. Burges, V. Vapnik, and T. Vetter. Com- parison of view-based object recognition algorithms using realistic 3D models.

In C. von der Malsburg, W. von Seelen, J. C. Vorbr¨uggen, and B. Sendhoff, editors,Artificial Neural Networks ICANN’96, volume 1112 ofLecture Notes in Computer Science, pages 251–256, Berlin, 1996. Springer-Verlag.

7. H. Burkhardt. Invariants on skeletons and polyhedrals. Technical report, Universit¨at Freiburg, 2004. in preparation.

8. C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines, 2001. Software available athttp://www.csie.ntu.edu.tw/cjlin/libsvm.

9. O. Chapelle, P. Haffner, and V. Vapnik. SVMs for histogram-based image classification. IEEE Transactions on Neural Networks, 10(5), 1999.

10. K. De Cock and B. De Moor. Subspace angles between ARMA models.Systems and Control Letter, 46:265–270, 2002.

11. C. Cortes, P. Haffner, and M. Mohri. Rational kernels. InAdvances in Neural Information Processing Systems 15, volume 14, Cambridge, MA, 2002. MIT Press.

12. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.

13. B. de Finetti. Theory of probability, volume 1 - 2. John Wiley and Sons, 1990.

reprint of the 1975 translation.

Références

Documents relatifs

These comprehensive interviews aimed at recon- stituting the evolution pattern of the farm (Moulin et al., 2008), the flock and its management, as well as the household activ-

The effects of participant group (three groups by stature and two by BMI) and test configuration (seven configurations) on discomfort, maximum joint torques developed at

اذلاكو قرلاطلا قولاقح ءارلاكو ليصاحملا عيبك ةصاخلا اهكحمأ نم يتلا تاداريلإا كلذكو اايتما جتان و فقوتلا نكامأ ( 1 ) .ةما لا حلاصملا ض ب 2 ( : للاغتسلاا جوتنم (

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Our data processing approach, sensor models and software modules allow us to monitor the urban traffic environment and perform data fusion from stereo vision and lidars, as well as

The increase in temperature of the protein solutions and the increase in hydration degree of dry samples result in the decrease in f ' of the Mossbauer nuclei, entering the

Dans ce contexte, mes travaux de recherches s’inscrivent plus profondément dans le cadre de la micro-mécanique des matériaux granulaires en écoulement dense ou quasi-statique,

active aircraft in the simulation including: (i) Aircraft ID data such as aircraft type, flight number, airline name, and other such data typically found in the