Experimental Results - Conjugate Mixture Models for the Modeling of Visual and Auditory Percept

We evaluated the dynamic version of the ConjEM algorithm on themeeting,trackingand cocktail partyscenarios (sequences M1, TTOS1 and CTMS3 of the CAVA database pre-sented in Chapter2). Both auditory activity estimation and tracking accuracy were consid-ered, as previously in Chapter5.

Since the exact system dynamics in the general audio-visual tracking task are not known and one can only assume the speed of dynamic scene changes, we adopt the stochas-tic approximation approach to multimodal multiobject tracking for the diffusion part. The gain function was taken to be constant γ(t) ≡ 0.1. We assumed independent object dy-namics, and took the drift term that coincided with the direction of the ConjEM algorithm optimization. Thus the stationarity condition is asymptotically fulfiled. The diffusion part of (7.8) was not included.

To account for scene configuration changes (objects that enter and exit the scene, com-lete occlusions), we run theInitialize andSelectprocedures to propose new clusters and accept/reject them or to delete existing clusters that no longer receive observations. This strategy resembles a jump-diffusion process [Grenander 1994,Jacobsen 2006], where dif-fusion is carried out through (7.8) and jumps are generated by the initialization and selec-tion procedures. Similar approaches can be found in video-based tracking [Yao 2008].

One advantage of considering the dynamic model is that different time scales can be used for different modalities to estimate the object activity. Considering longer time inter-vals for auditory data leads to the auditory activity detection improvement, see Table7.1 and Table5.4. Some short-term effects of ambient sounds and reverberations are eliminated which decreases ‘false alarm’ probabilities.

Spatial localization results are also improved with respect to those from Chapter5. The dynamic version of the ConjEM algorithm can handle not only partial, but also complete occlusions. Different cases are demonstrated on the cocktail party (CTMS3) sequence in Figure7.2. After the objects are initialized (a), one of them gets completely occluded (b)-(c) which results in track loss and consequent detection (d). Another occlusion happens to be more rapid (e)-(f), so that the object reappears before the cluster was eliminated.

(a) moving target, TTOS1 (b) cocktail party, CTMS3

Figure 7.1: Estimated ambient space trajectories for the (a) moving target (TTOS1), and (b) cocktail party (CTMS3) scenarios. Motion is shown with colour gradient from darker to lighter colours. Points where the algorithm lost/regained track of an object are marked with coloured points. Green object is occluded several times. The first time track is lost and as it is regained, the estimate captures part of the red object that goes nearby and follows it (right green segment). But after the second occlusion the green object is redetected and properly tracked (middle green segment). The third occlusion (leftmost green segment) does not spoil the estimate. Blue object was not lost even after one occlusion.

In general the trajectories obtained with the dynamic version of the ConjEM algorithm are smoother and more precise - the position estimates are within 5cm from the object location in the XZ-plane. The precision in the Y coordinate (vertical axis) is typically worse because of the cluster shapes that are typically elongated in the scenarios we consider and admit greater variability in vertical direction. The summary on estimated trajectories for moving target (TTOS1) and cocktail party (CTMS3) scenarios is given in Figure7.1. See Figure 5.9for comparison. Object motion is shown with colour gradient from darker to lighter colours. Points where the algorithm lost/regained track of an object in the CTMS3 sequence are marked with coloured points.

7.4 Discussion

The multimodal multiobject tracking task is a hard problem due to various strong noise contaminating the observations and scene dynamics that are usually hard to estimate even without noise. In this Chapter we addressed this problem within the ConjEM framework.

We showed how our approach could be efficiently combined with different tracking tech-niques to benefit from integration of both spatial information coming from multiple modal-ities and temporal information kept by a system.

On the one hand, the powerful ConjEM framework with efficientInitializeandSelect procedures provides parameter inference from multiple modalities, automatically weight-ing the data accordweight-ing to the amount of information it contains. It enhances weak multi-modal clusters that can be then detected and tracked and hence is responsible for the scene configuration representation.

(a) frames 281-290 (b) frames 320-329

(e) frames 346-355 (f) frames 351-360

Figure 7.2: Cocktail party scenario tracking results. After the objects are initialized (a), one of them gets completely occluded (b)-(c) which results in track loss and a detection that follows (d). Another occlusion happens to be more rapid (e)-(f), so that the object reappears before the cluster is eliminated. The results are shown projected onto the left image plane. Colours encode the observation-to-cluster assignments and the auditory ac-tivity is shown with the speaker symbol. The “visual” covariance matrix associated with the Gaussian component is projected onto the image plane.

On the other hand, a well-established framework for parameter inference in dynamic systems accounts for proper temporal tracks of multiple multimodal objects. They can become completely invisible for a short period of time and nevertheless still be followed using the estimated trajectory information.

The results show a clear advantage for the joint multimodal tracking over the single ConjEM algorithm, as well as potential benefits over single modality tracking techniques through multimodal enhancement. Both, object auditory activity and ambient space posi-tion estimates were improved with respect to the ConjEM results presented in Chapter5.

We outline the advantages of the dynamic ConjEM framework:

• Fully multimodal: the framework benefits from the ConjEM capability of putting all the modalities on equal basis, weighting them based on the amount of information they provide and integrating the multimodal data;

• Multimodal enhancement: the ability of theInitializeandSelectprocedure to de-tect and enhance stimuli from one modality with stimuli from other modalities to reinforce weak clusters can also be exploited within the dynamic ConjEM frame-work;

• Extensibility: as with the ConjEM framework, various multimodal features can be added to the dynamic ConjEM to improve tracking;

• Robust tracking: ConjEM allows for efficient integration with well-established tracking techniques that can handle temporal invisibility of an object;

Conclusion

Sommaire

8.1 Main Contributions . . . 123 8.2 Future Work . . . 125

The goal of my thesis was to develop a full and efficient framework for audio-visual integration and, in particular, for audio-visual object detection and localization.

I first address the problem making a simplifying assumption of a quasi-stationary or slowly varying object configuration. Under this assumption, I developed a full frame-work possessing attractive theoretical properties that solves a number of important issues:

i) hardware calibration (Chapter3); ii) estimation of the number of objects (Chapter6);

iii) efficient and accurate initialization (Chapter6); iv) consistent multimodal integration (Chapters 4 and 5) with v) guaranteed accuracy and reliability (Chapter 5). The ideas and models that I developed in this framework are general and can be potentially applied to any multimodal clustering task. All the theoretical facts proved about the models are application-independent. However, in the experimental results sections of this thesis I demonstrate how to tune every proposed technique for the particular case of audio-visual integration.

Then I show that this framework could be still used without the assumption on scene dynamics and address the problem of inclusion of object dynamics in the multimodal in-tegration model (Chapter 7). Again, the proposed approach is general and uses a well-established methodology. It can be applied to various multimodal tasks. I believe that this combination of multimodal integration model with system dynamics is very promising in that it further improves the conjugate clustering approach towards the conjugate filtering framework. The latter offers broader range of applications and better performance in terms of robustness to configuration changes (such as visual occlusion) and track losses.

We proceed with the summary of major contributions of the thesis and discussion of perspectives for future research.

8.1 Main Contributions

This thesis contains a number of original contributions that can be split into two groups: i) the theoretical models and facts on multimodal integration, andii) their versions tuned for

the task of the audio-visual integration. Below we provide the summary of both groups.

Theoretical models and facts

• Conjugate mixture models (CMM): the formalism of conjugate mixture models was introduced to address the multimodal integration task. It allows to preserve char-acteristics that are specific to the modalities, while reinforcing integration through the features that are common. Asymptotic identifiability of CMM’s is proved and various extensions are proposed concerning different choices of single modality mix-tures, conjugate random fields and conjugate point processes;

• Kullback Proximal optimization algorithm family for CMM: a class of optimiza-tion algorithms for Gaussian CMM was derived within the Kullback Proximal (KP) framework, their convergence properties are discussed;

• Efficient conjugate EM implementation for CMM: the multimodal EM algorithm (ConjEM) that belongs to the KP family was improved by transforming the opti-mization problem to a more convenient form. Several acceleration strategies were proposed. Attractive convergence properties were proved for a large class of CMM models;

• CMM initialization based on predictive densities: an efficient method for CMM initialization was proposed based on predictive densities constructed from multi-modal data. This method is fully multimulti-modal in the sense that it puts all the multimodal-ities on equal footing. It plays role of a sampling technique for an optimization algorithm for CMM, providing the characteristics of a global optimization method and improving the convergence speed and the final estimate.

• Multimodal criterion for model selection: a multimodal criterion for CMMs was formulated, its consistency properties were proved. Together with the multimodal initialization strategy and the ConjEM optimization algorithm it provides an efficient multimodal integration strategy that enables multimodal enhancement;

• Multimodal filtering algorithms: several possibilities for extending CMMs to the multimodal tracking tasks were offered; their way to efficiently combine filtering algorithms with the CMM initialization and model selection algorithms is described.

Audio-visual (AV) integration contributions

• CAVA database: a set of realistic AV scenarios was designed and acquired to pro-vide the evaluation ground for multimodal algorithms that work with head-like de-vices comprising two microphones and two cameras; annotation was performed for certain scenarios;

• AV calibration: the AV calibration algorithm was developed to ensure proper align-ment of A and V data; its evaluation on synthetic and real data is provided;

• AV localization and activity detection: the theoretical CMM framework was ap-plied to the task of localization of multiple AV objects; we consider the AV integra-tion task in the 3D space which reinforces the integraintegra-tion; the acceleraintegra-tion strategies for the case of AV data are derived and demonstrated, different implementational aspects of the optimization algorithms are discussed; the performance is shown on simulated data and CAVA database scenarios; localization is verified for both, quasi-static and dynamic scenes;

• AV object detection: the proposed AV object detection method is based on the multi-modal initialization and model selection strategies; it demonstrates AV enhancement, efficiently combining input AV data to detect objects that are poorly represented in one of the modalities; AV object detection is demonstrated on simulated and real data from the CAVA database;

• AV object tracking: the AV object tracking task is addressed within the proposed framework of multimodal filtering algorithms; our approach uses all the techniques developed for the case of AV data for multimodal object localization and detection;

the verification is performed on CAVA database recordings, among which we in-cluded the challenging cocktail party scenario.

Dans le document Conjugate Mixture Models for the Modeling of Visual and Auditory Perception (Page 132-138)