Future Work - Conjugate Mixture Models for the Modeling of Visual and Auditory Perception

The work presented in the current thesis is inspired by biological principles of multimodal integration and contains models that implement low-level multimodal integration bases.

There are numerous directions in which these models can be extended for the task of audio-visual (AV) multiobject tracking or adjusted for other types of applications. Below we outline the prospective directions of research.

Motion cues for AV integration. In our multimodal integration approach we used colo-calization as the core principle, binding different modalities through the 3D object loca-tion. Dynamics information could also be included into the common unobserved parameter space. Motion cues can be extracted from both, auditory [Lu 2010] and visual [Shi 1994]

data. On the one hand, this would reinforce multimodal integration by increasing the di-mensionality of the parameter space and better separating the objects being observed. On the other hand, these cues could occur to be less reliable in the realistic setting, such as found in CAVA database scenarios. As mentioned in Chapter5, the increase in observa-tions covariance leads to significant losses in precision.

Modality-specific features. Multimodal tracking can be improved by extending the model with various modality-specific features. Low-level photometric and spectral charac-teristics and high-level appearance and acoustic models can be added to the audio-visual

integration framework to reinforce clustering and perform more robust tracking. How-ever, the increase of dimensionality of the observation spaces can increase the risk of track losses.

Adding feature spaces. The conjugate clustering model that we developed allows for an arbitrary number of feature spaces. One can include detectors of different nature, such as sonar or infra-red range finders for the localization task, to improve the model performance.

Object statistical models. In this thesis we performed AV tracking under the assump-tion that objects are represented by feature distribuassump-tion and features are independently generated. Other statistical models can be used. One possible generalization would be to consider features to be generated by a marked cluster point process, where the child point processes are governed by some potential function. In fact, conjugate mixtures is the particular case of such a model. This allows for more sophisticated object shapes and appearances. The optimization is usually performed using the variational approach, such as mean field (or force field in physics), simulated field, etc. This kind of model is good to account for sophisticated spatial scene structures with known statistical properties.

Another possibility is to consider partially observed particle diffusion models, governed by drift and diffusion fields, as those considered in Chapter 7, but without any indepen-dency assumptions. These models are potentially capable of reconstructing dependencies between spatial points and thus restituating object forms. Moreover, clustering can be per-formed based on regularity assumption for drift and diffusion fields. Though inference in such models is a hard problem, that requires efficient numerical approximations to be developed.

Considering other applications. The multimodal integration can be useful in various other domains, where temporal parameter inference is performed based on unaligned data arriving from physically different sensors. Examples could include tracking of chemical reaction state in biophysics, airplane tracking by sonar and turbulence data from several independent stations, disease state tracking by multiple biological factors etc.

Appendix

Sommaire

A.1 Manifold Sampling for the ITD function Pre-image. . . . 127 A.2 Parameter Inference for Student-t Mixtures. . . . 129

A.1 Manifold Sampling for the ITD function Pre-image.

The goal is to develop a method to sample isosurfaces of the auditory observation space (ITD) functionGdefined by (2.4):

G(s; s_M_ℓ,s_M_r) = 1 c

ks−s_M_ℓk − ks−s_M_rk

. (A.1)

We assume the system to be fully calibrated and microphones s_M_ℓ ands_M_r to be fixed.

Thus to simplify the notation we further writeG(s)instead ofG(s; s_M_ℓ,s_M_r). The sam-pling technique proposed below follows the general principle of samsam-pling method con-struction described in Chapter 6 of [Zhigljavsky 1991].

Let’s take the orthonormal coordinate system such that itsxaxis goes through the two microphoness_M_ℓ ands_M_r, from the left to the right microphone, and its center is located at (s_M_ℓ +s_M_r)/2. The orientation of the y and z axes can be arbitrary. Microphone coordinatess_M_ℓands_M_r are then(−x_F,0,0)and(x_F,0,0)respectively for somex_F ≥0.

The locusG(s) =g₀is defined by equation

ks−s_M_ℓk − ks−s_M_rk=cg0, (A.2) that can be written in the(xyz)coordinates

p(x+x_F)²+y²+z²−p

(x−x_F)²+y²+z² =cg₀, (A.3) which after some basic algebraic transformations leads to the surface equation

− y²

x²_F−(cg0/2)² − z²

x²_F−(cg0/2)² + x²

(cg₀/2)² = 1. (A.4) A surfaceS defined by (A.4) is a hyperboloid of two sheets with microphone locations being its foci. The sign ofg0 defines which part of the hyperboloid to consider, left or

right. From (A.2) we find that x²_F ≥ (cg₀/2)², so by letting a² = x²_F −(cg₀/2)² and

which is the canonical representation of a two sheet hyperboloid. Its asymptotic cone, known also in auditory analysis as “the cone of confusion” is given by

−y² a² −z²

a² +x²

b² = 0. (A.6)

We parametrize the surface (A.5) by



The goal is to establish a distributionP(ds) = p(s)dsof some pre-defined density p(s) on a hyperboloid (A.5), where p ≥ 0 is such that R

Sp(s)ds = 1 and ds is the surface measure onΩ = s(Θ). We make use of a well-known fact on measure transform (see [Schwarz 1993],§2 of Chapter 6)

Jis the Jacobian matrix ofs(θ)andµis the Lebesgue measure onR². In particular, for the mappings(θ)defined by (A.7) one has

D(θ) =p

a²b²(t²−1) +a⁴t². (A.10) We can define the sampling algorithm forP(ds)on the hyperboloid surface. For that one has to draw realizations of a random vectorζwith distribution

P₂(dθ) =p(s(θ))D(θ)µ(dθ), (A.11) and consider a random vectorξ=s(ζ)that is distributed according toP(ds).

For the important case ofξ being distributed uniformly on a hyperboloids(θ),θ ∈Θ for parameter domainΘ = [1, T]×[0,2π], one should consider

ζ ∼αp

a²b²(t²−1) +a⁴t²dθ (A.12)

with

The latter integral can be readily computed, which gives the following expression α= π|a| The most natural way to sample the random variableζby (A.12) is the acceptance-rejection method [Ermakov 1975].

Dans le document Conjugate Mixture Models for the Modeling of Visual and Auditory Perception (Page 138-142)