Information fusion for multimedia: exploiting feature interactions for semantic feature selection and construction

(1)

Thesis

Reference

Information fusion for multimedia: exploiting feature interactions for semantic feature selection and construction

KLUDAS, Jana

Abstract

The joint processing of multimedia data received a lot of attention by computer science engineers and researchers in the last decade. This is for once due to the flood of multimedia data such as videos and photographs that are easily created nowadays due to ubiquitous capturing devices like cameras in mobile phones and that can be instantly published in the Internet. Secondly, the success of content-only processing approaches that only use the visual modality stayed behind expectations so far. Therefore, the fusion of multiple information sources, so called modalities, came into the focus of interest. Other application areas of multi modal fusion are, for example, biometric identification and emotion recognition systems, where it is hoped to increase accuracy and reliability over mono modal systems.

KLUDAS, Jana. Information fusion for multimedia: exploiting feature interactions for semantic feature selection and construction. Thèse de doctorat : Univ. Genève, 2010, no.

Sc. 4282

URN : urn:nbn:ch:unige-145394

DOI : 10.13097/archive-ouverte/unige:14539

Available at:

http://archive-ouverte.unige.ch/unige:14539

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSIT´ E DE GEN` EVE

D´epartement d’informatique

FACULT´ E DES SCIENCES

Dr. St´ephane Marchand-Maillet

Information Fusion for Multimedia:

exploiting Feature Interactions for Semantic Feature Selection and

Construction

TH` ESE

présenté a la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

Jana Kludas

de

Dessau (Allemagne)

Th`ese No 4282

GEN` EVE

2011

(3)

(4)

(5)

(6)

Resum´ e

Le traitement joint de données multimédia est un domaine de recherche qui a été très actif au cours des dix dernières années. Cet engouement est notamment dû au raz-de-marée de données multimédia telles que les vidéos ou les photos créées de nos jours grâce à des appareils de capture polyvalents tels que les appareils photo sur les téléphones mobiles, et à la facilité de publication sur Internet. De plus, le succès des approches basées uniquement sur les modalités visuelles reste bien inférieur à ce qu’on pouvait espérer. De sorte que la fusion d’informations provenant de sources multiples, les modalités, est devenue de plus en plus intéressante. D’autres champs d’application de la fusion multimodale sont, par exemple, l’identification biométrique et les systèmes de reconnaissance d’émotion, dont on espère améliorer la précision et la fiabilité par rapport aux systèmes unimodaux.

Le défi principal de la fusion d’information multimodale pour les champs d’application mentionnés ci-dessus est que la fusion ne peut avoir lieu qu’en concaténant les caractéristiques extraites de toutes les modalités, ce qui amène à des espaces de grande dimension. La plupart des algorithmes fonctionnent mal dans ces espaces (curse of dimensionality). Ils tendent à baisser en performance alors même qu’ils possèdent plus d’informations. Un autre problème apparaissant souvent avec les données multimédia est que le but exact de l’apprentissage est généralement inconnu. Ainsi, en général, les caractéristiques de bas niveau qui sont extraites tendent à être de grande dimension, éparses, bruitées, et possèdent un faible pouvoir de description. Le fossé entre les caractéristiques de bas niveau et le haut niveau de la signification sémantique des données est appeléesemantic gap ou fossé sémantique.

Le but de ce travail est de détecter les interactions complexes et multivariées des car- actéristiques dans le contexte d’une étiquette de classe. Les attributs fortement pertinents amènent une description sémantique de l’étiquette de la classe qui facilite le calcul des données multimodales, due à la réduction de dimension, et passant au-dessus du fossé sémantique. Les interactions entre les caractéristiques sont détectées l’aide de mesures d’information mutuelles multivariées. Il a été découvert que la capacité à apprendre la représentation d’une étiquette de classe est fortement dépendant de sa complexité, qui peut être déterminée par le type d’interactions qu’elle contient (redondance ou synergie) et le nombre de caractéristiques impliquées.

Contrairement aux autres applications de la fusion d’information, dans la fusion multimodale par concaténation le type d’interactions que sous-tend un apprentissage spécifique n’est pas clair a priori. Ainsi, la réduction de dimension, la sélection de caractéristiques et les autres techniques de fusion d’informations qui réalisent de manière coopérative, complémentaire ou compétitive la stratégie de fusion donnent de bons résultats sur quelques tâches d’apprentissage uniquement, et jamais sur tous. Cela peut être contourné en mettant en œuvre une stratégie de fusion flexible qui s’adapte à la structure de données sous-jacente.

Ce travail propose un algorithme de sélection et de construction de caractéristiques (feature selection and constructionFS/FC) qui réduit la grande dimensionnalité de l’espace de fusion aux caractéristiques pertinentes pour un apprentissage donné, en détectant leur syn-

v

(7)

ergie et/ou redondance. Pour un jeu de données artificiel de quarante fonctions booléennes avec une complexité variable, nous montrons le fonctionnement de l’algorithme et com- ment la complexité de la tâche influence sa capacité d’apprentissage. La comparaison des résultats de classification entre l’algorithme FS/FC et d’autres algorithmes standards de réduction de dimensionnalités et de sélection de caractéristiques utilisés en prétraitement montre la capacité du FS/FC à s’adapter aux différentes structures de données alors que les autres méthodes se spécialisent uniquement pour un seul type.

La théorie des modèles structuraux constitue un paradigme puissant pour déterminer l’interaction des caractéristiques pour des données complexes. Malheureusement, sa mise en œuvre est coûteuse en temps de calcul et peu pratique pour des données de grande dimension. Toutefois, elle peut être utilisée pour montrer un autre avantage de la construction de caractéristiques en dehors de l’élaboration d’une hiérarchie de sous-ensembles d’attributs indépendants. Les expériences montrent que les interactions de caractéristiques de grandes dimensions peuvent être approximées par des modèles structuraux avec des composants de plus petites dimensions qui sont plus faciles à classifier.

Finalement, l’algorithme de sélection et de construction de caractéristiques (FS/FC) est appliqué à des données réelles qui varient fortement en taille et complexité: Corel, une base de données médicale, ainsi qu’un jeu de données utilisé pour la reconnaissance d’émotions.

L’algorithme s’adapte aux grandes dimensions (>1000) grâce à la création d’un index épars et linéaire pour le calcul de la probabilité jointe qui sous-tend la mesure de l’information mutuelle multivariée. Les résultats de la classification montrent que l’algorithme proposé dépasse les performances des autres algorithmes de réduction de dimension et de sélection de caractristiques grâce à sa stratégie de fusion adaptative qui montre, dans la plupart des cas, une amélioration des performances par rapport au point de référence utilisant toutes les caractéristiques. Les autres méthodes souffrent d’une perte de performance quand la structure des données ne correspond pas à la stratégie de fusion, ce qui n’arrive jamais avec l’algorithme FS/FC.

(8)

Abstract

The joint processing of multimedia data received a lot of attention by computer science engineers and researchers in the last decade. This is for once due to the flood of multimedia data such as videos and photographs that are easily created nowadays due to ubiquitous capturing devices like cameras in mobile phones and that can be instantly published in the Internet. Secondly, the success of content-only processing approaches that only use the visual modality stayed behind expectations so far. Therefore, the fusion of multiple information sources, so called modalities, came into the focus of interest. Other application areas of multi modal fusion are, for example, biometric identification and emotion recognition systems, where it is hoped to increase accuracy and reliability over mono modal systems.

The main challenge of multi modal information fusion for the application areas mentioned above is that fusion can only be done by concatenating the features extracted from all modalities, which leads to high dimensional input spaces. Most algorithms do not perform well in high dimensions (curse of dimensionality), because they tend to decrease in performance even though they have more information at their hands. Another problem that occurs mostly for multimedia data is that the exact learning target is unknown. Thus, in general, low level features have to be extracted that tend to be high dimensional, sparse, noisy and of little descriptive value. The gap between the low-level features and the high level semantic meanings contained in the data is called thesemantic gap.

Therefore, the goal of this work is to detect complex, multivariate feature interactions in the context of a class label. The strongly relevant attributes result in a semantic description of the class label that facilitates multi modal data processing due to dimensionality reduction and bridging the semantic gap. The feature interactions are detected on behalf of multivariate mutual information measures. The learnability of the representation of a class label is found to be strongly dependent on its complexity which can be determined by the type of interactions it contains (redundancy or synergy) and the number of features that are involved.

Unlike for other information fusion tasks, in multi modal fusion through concatenation it is not clear beforehand what type of interaction underlies a specific learning target. Thus, dimensionality reduction, feature selection and other information fusion techniques that implement either cooperative, complementary or competitive fusion strategies do perform well on only some of the learning tasks and never on all. This can be circumvented by implementing a flexible fusion strategy that adapts to the underlying data structure.

The work proposes a feature selection and construction (FS/FC) algorithm that reduces the high dimensional fusion space to the features relevant to a given learning target by detecting their synergistic and/or redundant interactions. On behalf of an artificial dataset that consists of 40 boolean functions with varying complexity it is shown how the algorithm works and how the complexity of a learning task influences its learnability. A comparison of the classification errors that result from the pre-processing with the FS/FC algorithm and standard dimensionality reduction and feature selection methods show the superiority

vii

(9)

of the developed algorithm due to its capability to adapt to different data structures.

The theory of structural models is a powerful framework to determine feature interactions for complex data. Unfortunately, its implementation is computational expensive and impractical for high dimensional data. However, it can be used to reveal another bene- fit of feature construction besides building a hierarchy of independent attribute subsets.

Experiments show that high-dimensional feature interactions can be approximated with structural models with components of lower dimensionality which are then also easier to classify.

Finally, the feature selection and construction (FS/FC) algorithm is applied to real world data that vary in their size, complexity and application area: a Corel subset, a medical dataset and an emotion recognition dataset. The algorithm scales to high dimensions (>1000 attributes) due to the implementation of a linear, sparse index for the joint probability calculation that underlies the multivariate mutual information measures. The classification results show that the proposed algorithm outperforms other dimensionality reduction and feature selection methods due to its adaptive fusion strategy that mostly shows high performance improvements over the full feature baseline. Other methods do suffer from performance losses when the data structure does not fit to their fusion strategy, which does not happen with the FS/FC algorithm.

(10)

List of Figures

1.1 Hasse diagram of lattice of partially ordered sets (posets) . . . 5

2.1 Functional model of data fusion system design by the JDL group [Llinaset al.2004] . . . 16

5.1 Search strategy and feature selection/construction heuristic on lattice of posets 42 5.2 Average classification error of FS/FC methods for a complete search on the lattice of posets . . . 45

5.3 Feature interaction detection examples . . . 46

5.4 General feature relevance detection examples . . . 50

5.5 Results for FS with the relevant feature subset . . . 51

5.6 Results for FS with the relevant feature subset divided into synergistic, redundant and mixed concepts . . . 51

5.7 Average classification error of FS/FC method vs the concept variation . . . 53

5.8 Average classification error of FS/FC method vs the number of relevant features . . . 53

5.9 Average calculation time of FS/FC with complete search . . . 53

5.10 Results for FS/FC methods using pruning strategies . . . 55

5.11 Calculation time and number of feature subsets searched for FS/FC with pruning for the different concept types . . . 56

5.12 Results for the Miller Madow and Bayes correction for entropy estimation . 57 5.13 Results for the Q-measures . . . 59

5.14 Feature interaction detection examples with Q-measures . . . 60

6.1 Lattice of structural models . . . 64

6.2 Goodness of fit of structural models with components of fixed ordinality . . 69

6.3 Average classification error of structural models with components of fixed ordinality . . . 71

6.4 Final search and feature selection/construction (FS/FC) heuristic . . . 74

6.5 Results for final FS/FC algorithm with and without previous relevance detection . . . 75

6.6 Comparison of FS/FC algorithm to other feature selection and dimensionality reduction methods . . . 77

6.7 Results for FS/FC algorithm with linear, sparse indexes . . . 80

6.8 Calculation time of FS/FC algorithm vs the number of relevant features for each concept . . . 81

6.9 Results recomputed with a balanced classification error . . . 82

6.10 Balanced classification error vs concept variation and the number of relevant features . . . 84

xi

(13)

7.1 Results for Corel dataset . . . 88 7.2 Average classification error for each class of the Corel dataset . . . 89 7.3 Average search level where minimum classification error is observed for each

class in the Corel dataset . . . 93 7.4 Results for the content-only Corel dataset . . . 95 7.5 Average classification error and average search level with minimum classifi-

cation error for each class of the content-only Corel dataset . . . 96 7.6 Results for the medical dataset . . . 100 7.7 Average classification error and average search level with minimum classifi-

cation error for each class of the medical dataset . . . 103 7.8 Results for the content-only medical dataset . . . 105 7.9 Average classification error and average search level with minimum classifi-

cation error for each class of the content-only medical dataset . . . 105 7.10 Results for emotion recognition dataset . . . 109 7.11 Average classification error for each class of the emotion recognition dataset 112 7.12 Results for reduced emotion dataset with spectral powers of EEG and eye

gaze features . . . 113 7.13 Average classification error and average search level with minimum classi-

fication error for each class of the reduced emotion dataset with spectral powers of EEG and eye gaze features . . . 114 7.14 Results for emotion recognition dataset with only eye gaze features . . . 116 7.15 Average classification error and average search level with minimum classi-

fication error for each class of the reduced emotion dataset with eye gaze features only . . . 117

(14)

List of Tables

3.1 Example for interaction: roundV (redN

green)7→tomato . . . 22 3.2 Example for interaction: roundV

red7→tomato . . . 23 5.1 Artificial concepts with their complexity measures, detection level, multi-

variate feature MI test and training classification error . . . 47 6.1 Average classification error of structural models with components of fixed

ordinality . . . 72 7.1 Relevant features selected with FS/FC for the Corel dataset . . . 91 7.2 Relevant features selected with FS/FC for the content-only Corel dataset . 98 7.3 Relevant features selected with FS/FC for the medical dataset . . . 102 7.4 Features of the emotion recognition dataset that were extracted from phys-

iological signals, Electroencephalography (EEG) and eye gaze recordings . . 108 7.5 Relevant features selected with FS/FC for the emotion recognition dataset . 111

xiii

(15)

(16)

Chapter 1 Introduction

’Information fusion is the study ofefficient methods for automatically or semi-automatically transforming informationfrom different sources and different points in time into a representation that provideseffective support for human or automated decision making.’ [B¨ostrom et al.2008]

As the definition suggests, information fusion is a vast topic with many disparate research areas that utilize and describe some form of information combination in their context of theory. Their common ground is that they all try to merge multi modal information observed from an event or object. This can be generally described as information transformation that captures all possible ways of combining, aggregating and reducing. Furthermore, a successful fusion method should be more efficient in terms of computation time and resources and/or moreeffective in decision making than alternative methods. What exactly a modality, the sources and the decision are depends on the information fusion task. In signal processing, modalities are defined to be signals that originate from the same physical source, but are captured through different means.

In the context of image fusion, a modality is taken from different imaging techniques. For the medical domain this can be X-ray, magnetic resonance image, ultrasound or computed tomography. For aerial and satellite imaging, modalities are images that were acquired in different spectral bands for example visible, infrared or microwave. The fusion is then carried out by bringing the images of the different modalities into spatial alignment, called registration, and merging the image values in each location, called integration. Due to the complementary information in each modality, the combined image will contain more information than each of the mono modal ones.

The same effect is used in sensor fusion where the modalities are different types of sensors that observe the environment simultaneously. For example in robot vision and navigation this can be a video camera (omni directional or normal), infrared and ultrasound distance sensor, GPS etc. The alignment and integration of the different sensors into one map will give a more accurate description of the environment.

Other application areas of information fusion by integration are the multi modal speaker detection, object tracking or localization. Here, the modalities are the audio and the video signal of one or more cameras and/or microphones which are merged over a position estimate. The result of the multi modal system will be more robust than a mono modal one, because when background noise reduces the quality of the audio or occlusion the one of the video the detection can rely on the unaffected modality.

In summary, information fusion by integration consists of the alignment of the information sources in the input space (Chapter 7 in [Thiran et al.2010]) and the consecutive registration or combination of the input values such that the variation of reliability of the

1

(17)

modalities affects the final result as little as possible. The goal of combining different data sources is to improve a system’s accuracy due to the complementarity of the merged signals that have been observed from the same phenomenon but capture a different aspect of the object or event. Another advantage is the increase in robustness due to the redundant observation of the object or event with several modalities.

However, the examples described so far are only a special case of information fusion problems. In these problems the different modalities have a common space where the information fusion takes place: the location in the environment. Another case are problems where no low-dimensional fusion space can be defined for the different modalities. Thus, the fusion space is equal to the concatenated input spaces of the modalities.

Consider for example a multi modal biometrics system that identifies a person out of an enrolled candidate list based on several modalities like face image, voice, fingerprints and iris image. Here, the input information cannot be aligned into a low-dimensional fusion space, but has to be merged in a space that combines all the modality’s input spaces. What can be done for example is that each modality is processed individually and then the final decision of acceptance or rejection of an identity is determined based on the majority of decisions taken for the modalities. This way, the use of several modalities can also decrease the system’s error which can be high for mono modal ones due to the heterogeneity of the information input and noise (i.e. deterioration of speech signal when the person has a cold).

Other examples are multi modal emotion recognition systems that use face expressions, gestures and biometric signals like heart rate, EEG and galvanic skin response to asses a person’s emotional state; as well as the processing and more specifically indexing, retrieval and classification of multimedia documents like videos, text-annotated images, web sites etc. where the modalities can be image, text, audio and meta data.

The problem of information fusion by concatenation is in the focus of this work. This gives rise to new challenges, whereof one is the feature extraction problem (Chapter 7 in [Thiran et al.2010]). For the tasks that have a simple location estimate in space or time as a goal, distance or location features are intuitively extracted from the different modalities. But when the task is to determine a person’s identity or emotional state, the features that are necessary to solve the problem are less evident. From each modality one or more features that are tailored to the problem have to be extracted. For example, one can extract facial features from images to identify a person or extract facial gestures (smiles, frown etc.) to identify a person’s emotional state. These features are called high level, because they abstract from the low level features, i.e. the pixels of the images. In general, they are low dimensional and highly descriptive.

Yet another, slightly different setup is observed for the detection/identification of objects in a video or an annotated photograph. Since the object could be literally anything, for example animals, persons, houses, plants etc., no generic high level features that are tailored to all the possible detection targets can be found. In general, low level features are extracted from the images like color, texture and shape. They are generally high dimensional (e.g.

a histogram of all possible color values), sparse, noisy and of little descriptive value. Here, the problem is not the feature extraction itself, but thesemantic gap between the low-level features and high-level semantic meanings, for example a rectangle of pixels representing a dog to a human viewer. This includes that the low-level features contain many irrelevant features once a specific object is targeted, which then obscure the relevant ones and impair

(18)

3

the detection result.

Another challenge for these information fusion problems is the generally high dimensional input space, which is caused by the concatenation of all the modality’s features due to the lack of a low-dimensional fusion space. Most algorithms do not cope well with high dimensional input data. The often observed performance degradation for approaches with high dimensional (multi modal or not) input data is quite counter intuitive, because in theory they have more information at their hand. A disadvantage is that more training data is needed to obtain an acceptable population of the representation space which is called the curse of dimensionality. Furthermore, high dimensional data needs more space in memory, longer computation times and the problem can quickly become completely intractable. Therefore, the goal of the feature extraction has to be to create a relevant and compact feature set out of the observed information sources, where relevance is always tied to the context of the problem.

In this context, this thesis proposes a feature selection and construction algorithm that exploits the interactions of (low-level) features in the context of a specific learning problem.

To that end, existing multivariate mutual information measures are adapted and further developed. The feature co-occurrences, captured by the joint probability distribution that underlies mutual information, can be seen as to describe feature dependencies that are also named interactions. When features interact strongly with the learning target, they can be utilized as its statistical model. It is claimed that this model represents a semantic description of the learning problem and that it fuses the information over different modalities optimally.

Information fusion is the main concern of this thesis, but in doing so other topics proved to be helpful if not necessary to tackle the problem in a systematic way: data mining and more specifically dimensionality reduction, feature selection and construction, computational learning theory as well as classification of the input data for evaluating the effectiveness of the information fusion.

In the last decade, multimedia data processing has received a lot of attention by research communities due to the ’multimediatisation’ of the World Wide Web as well as private and professional data collections. This flood of multimedia data such as videos and photographs is, amongst others, due to the ubiquitous capturing devices like cameras in mobile phones and that they can be instantly published in the Internet. Another reason for the interest on fusion of multiple information sources is that the success of content-only information processing approaches that only use the visual modality stayed behind expectations so far.

This trend is also visible in the rising number of participants that multimedia evaluation campaigns like TRECVID and ImageCLEF attract every year. The success of these bench- marks is also due to the fact that multimedia data processing is still a challenging topic even though many advances have been made. From experience in the ImageCLEF organization, the task of fusing images and their textual annotation as in the INEX 2006/2007 Multi- media Task [Westerveld and van Zwol2007] or the ImageCLEF 2008/2009/2010 Wikipedia task [Popescuet al.2010], was so far solved best with text-only approaches. Lately, the multi modal approaches have caught up and took over the lead in the ImageCLEF Wikipedia task in 2010. On the other hand, fusing multi modal sources in video retrieval as in the TrecVid¹ campaign is more successful. There, since years, the top ranked methods are all

1http://www-nlpir.nist.gov/projects/trecvid/

(19)

multi modal. An explanation could be that the noisy speech transcript weakens the textual modality.

Nowadays, commercial search engines also start to implement multimedia and content- based retrieval. For example, Google Images² allows, additional to classic keyword-based search, to filter the results by size, type (face, photo, clipart, drawing) and colors.

1.1 Problem definition

The data, comprising attribute vectors of (low-level) features that were extracted from different modalities of a multimedia document, is analyzed in the context of a specific learning problem which is represented by a binary class label that discerns positive from negative instances. This means that the method works in a supervised setting, where a number of documents annotated with the class label is available for training the feature selection and construction algorithm as well as the classifier. Furthermore, it is assumed that every learning target has a unique description whereby it can be discerned from the other classes in the collection.

LetX be an independently distributed, binary or discrete attribute vector in the input space X. Let Y = [0,1] be the binary class label. The algorithm works in a multivariate setting, which meansN attributesX₁, X₂, .., X_N are observed overM instances. Then, the task is to find one or several attribute sets of size N_c < N that sufficiently represent the class label Y. The size Nc of an attribute set is also referred to as ordinality.

The problem of finding an attribute set that is relevant to the learning target and at the same time compact such that information fusion is optimal, can be divided into 3 parts:

• search: systematic search over all possible attribute combinations conditioned on the learning target,

• feature selection: selection of result feature set on behalf of the strength of its interaction with the learning target,

• feature construction: construction of result attribute set by detecting independent feature sets.

1.1.1 Search space as lattice of partially ordered sets (posets)

The full search of all possible attribute combinations that guarantees an optimal solution can first be limited to ordered attribute subsets, because the order of the attributes does not influence the joint probability distribution of the subsets which is needed for computing the feature interactions. It is also unnecessary to analyze an attribute’s dependence on itself, which excludes attribute sets from the search space that contain an attribute more than once.

That is why the search space can be defined as a lattice of partially ordered sets (posets).

That means that for all possible subset sizesN_c= [1..N] all ordered attribute combinations without replacement are considered. This is also known as the power set of N elements, which results in a full search space of size 2^N.

2www.google.ch/images

(20)

1.1 Problem definition 5

Fig. 1.1 :Hasse diagram of lattice of partially ordered sets (posets)

In figure 1.1 the Hasse diagram for a lattice of posets for 3 variables is shown. The hierarchy is defined by set inclusion/exclusion. The bottom of the lattice is defined by the empty set Ø, the second level by the independent variables and the top is the full attribute set. The number of sets of ordinalityNcforN variables isN!/Nc!(N −Nc)! and is largest forNc=N/2. Attribute sets are said to overlap if they share one or more variables. Search strategies on the lattice can be split into top-down and bottom-up approaches as well as into breadth-first (search{X₁},{X₂},{X₃}, ..,{X₁, X2},{X₁, X3}...) or depth-first (search {X₁},{X₁, X2},{X₁, X2, X3}...).

Note, that in this work all attribute sets that are searched have to contain the binary class label, because only feature interactions conditioned on the learning target are of interest.

Hence, to each poset of attributes the class label is added.

One essential problem for this work to cope with is that the search space is exponential in the number of input attributes. This is partially mitigated by the possibility to search the space in parallel on a cluster of machines.

1.1.2 Feature selection

The multivariate feature selection criterion has to detect reliably the dependence between the attribute sets and the class label. Furthermore, its characteristics can facilitate the search in the input space. For example, a monotone feature selection criterion helps to direct the search towards more optimal values and hence reduces the search space. And a criterion that has well defined boundaries allows for pruning in the search space when the optimal or near optimal value for the criterion is reached.

1.1.3 Feature construction

In feature construction, the question is how to combine the features (sets of attributes) that were detected with the help of the feature selection criterion during search. If only one feature is found no feature construction is needed and the attribute set is taken as result.

But if several features are selected as being relevant to the learning target, the strategy for feature construction becomes important. To this end, several strategies can be considered:

the attributes can be aligned into one vector or the attribute sets are used to create a

(21)

feature hierarchy, the construction can rely solely on attribute sets that do not overlap and so on. Another question is, if it is advisable or necessary to limit the complexity of the feature construction result.

The goal is to create a feature construction heuristic that adapts the result to the complexity of the learning problems. It should perform a simple feature selection, where it is sufficient to solve the problem, and construct a more complex feature hierarchy otherwise.

Finally, the result of the feature selection and construction has to be evaluated by performing some form of information fusion. The common method for discriminating classes in supervised settings is classification, for which different algorithms have been developed.

In this work, a support vector machine (SVM) classifier that has the constructed features as input will be used for that purpose. The work focuses on the feature selection and construction; no other approaches of information fusion or other classifiers are considered.

1.2 Information theoretic basics

The developed algorithms that are presented throughout the thesis use multivariate, information theoretic dependence measures to detect complex feature interaction patterns.

That is why here the basics of information theory as developed by Shannon and their no- tations are introduced as detailed in [Cover and Thomas1991].

Theprobability mass function, named also probability density, of a discrete random variable with alphabet X is defined as follows:

p_X(x) =P r{X=x}, x∈ X. (1.1)

In the following the probability mass function will be simply written as p(x). Accord- ingly, the notation of joint probability densities for two or more variables are p(x, y) and p(x₁, .., x_N) respectively. Probability densities always fulfillP

p(x) = 1.

Theentropy of a discrete random variable X is defined as H(X) =−X

x∈X

p(x) logp(x) (1.2)

and

H(X, Y) =−X

x∈X

X

y∈Y

p(x, y) logp(x, y) (1.3)

respectively for a pair of discrete random variables X, Y. If not stated otherwise log refers to the binary logarithm. The joint entropy can also be calculated through the single entropy of one variable plus the conditional entropy of the other

H(X, Y) =H(X) +H(Y|X) (1.4)

with

H(Y|X) =−X

x∈X

X

y∈Y

p(x, y) logp(y|x). (1.5)

(22)

1.3 Research topics and thesis overview 7

Entropy is always H(X) ≥ 0. Another important characteristic is that conditioning reduces entropy H(X|Y)≤ H(X) with equality only, iff X and Y are independent. Fur- thermore, the sum of the marginals equals the joint entropy only iff the variables are independent; otherwise, it is larger: H(X₁, X₂, .., X_N)≤P

H(X_i).

The relative entropy between two probability mass functions defined on the same alphabet is defined as:

D(p||q) = X

x∈X

p(x) logp(x)

q(x) (1.6)

which is also known as the Kullback Leibler distance or divergence. Since the measure is not symmetric and does not satisfy the triangle inequality, it is not a true distance. The KL divergence fulfillsD(p||q)≥0 with equality ifp(x) =q(x).

The mutual information I is the relative entropy between the joint and the product distribution:

I(X;Y) =H(X) +H(Y)−H(X, Y). (1.7) It can also be expressed with the help of the Kullback Leibler divergence between the joint and the product distribution:

I(X;Y) =D(p(x, y)||p(x)p(y)) = X

x∈X

X

y∈Y

p(x, y) log p(x, y)

p(x)p(y) (1.8)

and thus models the error that is made when the variables are considered as independent. Mutual information is always I(X;Y) ≥0 with equality only if X and Y are truly independent. Multivariate extensions of the joint probability distribution and the mutual information will be discussed in later chapters.

1.3 Research topics and thesis overview

The research topics and contributions of this thesis can be roughly divided into:

• discussing the problem of multi modal information fusion by concatenation,

• implementing optimal information fusion by concatenation that handles different data structures through feature interaction detection and adaptive feature selection and construction,

• applying structural information models for approximating feature interactions and more efficient search pruning

• implementing linear, sparse indexes for improved scalability of the feature interaction detection.

(23)

Discussing the problem of multi modal information fusion by concatenation This chapter already gave examples for this kind of information fusion problems that are often found in multimedia data processing. The following two chapters approach the problem from two different view points. In Chapter 2, the fundamentals on information fusion in terms of system design (choice of sensors and information sources, feature extraction, level of fusion and fusion strategy) and well-known problems are presented. In Chapter 3, the above defined information fusion problem is related to computational learning theory and pattern recognition. The subsections discuss the influence of feature relevance, complex feature interactions, their learnability and complexity measures.

Then, Chapter 4 contributes a thorough overview of the information fusion literature, from its greedy beginnings up to current state-of-the-art approaches that rely more and more on the data’s underlying structure like the approach that is developed in this work.

The review includes works from different disciplines like feature selection, dimensionality reduction and pattern recognition, which shows as well how vast the topic of information fusion is.

Implementing optimal information fusion by concatenation that handles different data structures through feature interaction detection and adaptive feature selection and construction In Chapter 3, theoretical basics on computational learning theory, data mining and, most relevant to this work, on boolean association rule learning are reviewed.

In the framework of the latter, it has been shown that the structure of data can be learned and hence represented by boolean algebra, if enough training data is available and if it is not a random problem.

In the present work, this idea is applied to feature interaction detection by the means of multivariate mutual information measures. The advantage is that they are more general than association rules, because they can detect any type of boolean function without explicit testing. The multivariate mutual information criteria that are presented in Chapter 5 allow to detect statistical dependence between the attributes and the class label across different modalities and independent from the data structure. It is shown in detail how they are able to detect directly or indirectly relevant attributes or sets of attributes towards a learning target. A fast feature relevance detector is developed that detects the dataset’s relevant features in linear time. The proposed feature construction heuristic that follows feature selection is developed such that it can automatically adapt to the different types of dependencies. The resulting relevant and compact feature set can then be used to perform information fusion optimally.

Experiments on an artificial dataset that was developed to be challenging for current machine learning algorithms, demonstrate the effectiveness of feature selection and construction based on feature interactions. The accuracy of the developed approach that searches the full exponential search space is payed with a high computation time though.

Search space pruning that exploits the characteristics of the multivariate mutual information criteria (also Chapter 5) reduces the search space, but the computational savings turn out to be small.

(24)

1.3 Research topics and thesis overview 9

Applying structural information models for approximating feature interactions and more efficient search pruning The excessive long computation times of the basic approach led to a search for more efficient information interaction detection methods, that, for example, can make better use of the lattice structure of the search space. The often used Q-measures (see Section 5.6), that are defined on the lattice of posets had to be discarded, based on a proof of their insufficiency as an information measure by [Krippendorff2009].

Krippendorff proposed instead a lattice of structural information models and maximum entropy probability estimation thereon to describe information in complex systems [Krip- pendorff1986]. In Chapter 6, it is shown how the structural information models can be utilized for efficient feature selection and construction. In terms of feature selection, the framework of structural models provides a goodness of fit test for models of any complexity and a significance test for the information measures to help avoid selecting irrelevant features. With the help of structural models it is also shown why and when feature construction is successful. In Section 6.6, the final feature selection and construction (FS/FC) algorithm is presented, that adapts all the findings and tools from the Chapters 5 and 6.

The experiments show that the search on all possible structural models can be pruned more efficiently than on the lattice of posets and, most importantly, without loss in performance. As a matter of fact, when enough training data is available for learning the approach based on structural models even improves the accuracy compared to the results in Chapter 5. Furthermore, it is shown that the FS/FC approach outperforms the full feature baseline, dimensionality reduction and feature selection methods and a state-of-the-art feature interaction detection algorithm.

Implementing linear, sparse indexes for an improved scalability of the feature interaction detection To apply the developed FS/FC approach to truly high dimensional, real world datasets, the implementation of the joint probability density estimation is changed to work on a linear, sparse index instead of a multi-linear one (see Section 6.8). This allows the storage and processing of high order joint probability distributions in form of simple vectors.

For further speed up, the feature interaction detection is parallelized.

The experiments in Chapter 7 on the real world datasets from very different domains show the possibilities as well as the limits of the feature interaction detection algorithm for feature selection and construction. The proposed algorithm is compared to the full feature baseline, classification on the single modalities, late information fusion over different modalities, some well known dimensionality reduction and feature selection methods. Most of which can be outperformed by the proposed FS/FC algorithm.

(25)

(26)

Chapter 2 Information fusion in general and in multimedia in particular

Information fusion has established itself as an independent research area over the last decades, but a general formal theoretical framework to describe information fusion systems is still missing [Kokar et al.2004]. One reason for this is the vast number of disparate research areas that utilize and describe some form of information fusion in their context of theory.

The Introduction presented already two types of information fusion together with example application areas. This chapter intends to give a thorough overview of information fusion in its various forms on behalf of the discussion of information fusion system design.

In some sense, it can be seen as the fundamentals of information fusion that everybody agrees on, whereas other questions like the optimality of information fusion is still under discussion.

2.1 Information fusion system design

A thorough formalization of information fusion system design is presented in [Kokar et al.2004]. The following aspects are considered to be essential for an information fusion system:

• sensors and sources of information

• feature extraction

• fusion level

• fusion strategy

• fusion architecture

The choice of sensors and sources of informationis limited by the application area itself. The available sources should be considered in terms of the amount of noise that they contain, their cost of computation/processing, the diversity in between the sources and - most importantly - their general ability to describe the learning target and distinguish it from other objects in the data set.

When selecting thefeaturesthat are extracted from the source signals, one must realize that feature values of different modalities can encounter a spectrum of different feature types: binary, discrete, continuous and non-numeric/symbolic. Modality fusion dealing

11

(27)

with mixed feature type data is more difficult and complex [Li and Biswas1997], especially for joint fusion at data level, since a meaningful projection of the data to the result space has to be defined. In the case that only continuous or discrete features are extracted from the different modalities, the information fusion is also not trivial, because an appropriate normalization and if needed quantization has to be applied [Ross and Jain2004]. As for the information sources, the extracted features need to be capable to describe the learning target.

Thefusion level can be one of the following (see also [Thiranet al.2010] Chapter 8):

• data, sensor or feature level,

• rank, score or decision level.

Data, sensor and feature level fusion are also called low-level, early information fusion or data fusion. In [Llinaset al.2004], a definition of data fusion is given as: ’... an information process that associates, correlates and combines data and information from single or multiple sensors or sources to achieve refined estimates of parameters, characteristics, events and behaviors’. The concept of data fusion initially occurred in multi-sensor processing.

Typical examples are the sensor fusion for robot vision and navigation, object tracking in videos and image fusion as discussed in the Introduction. In these cases the information fusion is done by aligning and combining the input values from the different sensors or sources (modalities) in the fusion space.

In theory, the information fusion at data or sensor level can achieve the best performance improvements [Koval et al.2007], because the raw data is the richest source of information as it represents the input unaltered. Every further abstraction and hence data processing like feature extraction, classification or ranking will lead to a loss of information according to the data processing inequality. The information richness of the raw data is tampered by the noise that it contains; for example, for images the illumination changes and background clutter.

Information fusion problems that do not have the possibility to align the data values in a low-dimensional fusion space are affected by another problem. For example in multimedia document processing, the data of the different modalities needs to be concatenated. Usually, the result space is very high-dimensional and every single data point contributes little information. This leads to blatant problems in practice due to high computational costs in terms of calculation time and storage, the need for enormous amounts of training data and difficulties to model the highly complex data structure. These drawbacks can be - to some extent - reduced by feature extraction and information fusion on feature level.

However, fusion by concatenation is the simplest and also weakest form of information fusion. It is weak, because the utilization of the full, high dimensional, low informative feature set for further data processing is doomed to fail for the reasons that were named above. For successful data or feature fusion, the feature set’s size has to be reduced by dimensionality reduction or feature selection. To do so the inter-modal and/or inter-attribute relationships like dependency, correlation, co-occurrence, causality or mutual information can be exploited. This is the common approach to combine for example textual and visual information resources for subsequent information retrieval [Wu and McClean2006]. The feature selection and construction approach that is proposed in this work aims at con- tributing a new feature fusion solution to this type of information fusion problems.

(28)

2.1 Information fusion system design 13

Score, rank and decision level fusion are also called high-level, late information fusion.

Here, each modality/sensor/source/feature is first processed individually. The results are so called experts and can be scores in classification or ranks for retrieval. The expert’s values are then combined for determining the final decision. The combination of the expert’s decisions is also possible, but it is seen as a very rigid solution, because at this abstract level of processing only limited information is left. This means that late information fusion is done hierarchically and on an abstract level.

This type of information fusion is faster and easier to implement than early fusion, because each feature or modality is processed independently which leads to a reduction of the problem’s complexity and no data alignment or feature selection is performed.

The roots of score and decision fusion can be found in the neural network literature, where the idea of combining neural network outputs was published as early as 1965 [Tumer and Gosh1999]. Later, its application expanded into other fields like econometrics as fore- cast combining, machine learning as evidence combination and also information retrieval as rank aggregation or meta-search [Wu and McClean2006].

In general, a decision between early and late fusion must be taken, but also hybrid algorithms that fuse on several levels are possible. In [Kokaret al.2004], the authors proved with the help of the category theory framework, that feature, score, rank and decision fusion are just special cases of data fusion, which means that they are not fundamentally different.

In [Ross and Jain2004], the fusion level is approached by what about information can be fused. The paper is focused on multi modal biometrics, but the listing also applies to general information fusion problems. Note, that the first three points are in line with the fusion levels that were presented before:

• single modality and multiple sensors

• single modality and multiple features

• single modality and multiple classifiers

• single modality and multiple sample sets (in [Poh and Bengio2005])

• multiple modalities

The first two fuse information not over different modalities, but over different sensors belonging to the same modality as for example done in image fusion or over different features extracted from the same modality as for example color, shape, texture extracted from images. They can be referred to as mono modal, low-level information fusion methods.

The third method is a mono modal, late information fusion strategy when high-level features like face and specific object classifiers are combined in image processing or retrieval rankings in meta-search. The fourth scenario also belongs to the late fusion methods and has not been mentioned before. It was added to the overview because of its importance in machine learning approaches like bootstrap aggregation (bagging). The information fusion is done over different models or more specifically their parameters, that were each built on a different sample set (bootstrap) which contains a subset of the training data.

(29)

Finally, the last scenario of the list is multi modal information fusion, which is in the focus of this work. Each modality can be any of the other 4 scenarios with the simplest case being that each modality contains only a single sensor, feature, classifier or sample set.

The fusion strategy is the most important point in information fusion system design because it defines how the data structure is exploited for successful information fusion. The different strategies are as follows [Llinas et al.2004]:

• complementary fusion

• cooperative fusion

• competitive fusion.

The fusion strategy basically explains how information fusion works and when it is successful. The fusion success and hence the achievable performance improvements compared to single source systems have been determined theoretically for complementary and cooperative fusion. In that context, researchers have also investigated empirically suspected influence factors such as diversity, dependence, number, accuracy and relevance of the information sources.

Thecomplementary fusion strategyis generally applied by all methods that exploit the diversity between the information sources. Researchers that investigated the performance improvement of complementary fusion [Rosen1996] found that decorrelated neural network ensembles outperformed independently trained ones. They found that the overall error reduction is achieved due to the negatively correlated errors in the different neural networks that average out in combination. [Brown and Yao2001] confirmed that diverse classifiers improve the performance of an ensemble, even if weak classifiers were involved.

These works are examples for late, complementary fusion, where fusion success is due to the expert’s diverse errors.

More formally, this can be explained with the bias-variance decomposition of the mean square error of the information fusion result: training with more diverse sources gives rise to the bias (ambiguity) of the result, but lowers its variance. The bias-variance- covariance relation in [Ueda and Nakano1996, Poh and Bengio2005] is an extension of the previous one, that shows theoretically that dependencies between classifiers increase the generalization error compared to independent ones. This means that the fusion result in late, complementary fusion is mostly affected by the dependency of the used sources and not so much by its number or accuracy [Tumer and Gosh1999].

In early, complementary fusion, the idea is that the most independent and diverse features or sensors represent the information best in the data/environment and hence also able to describe best the learning targets. This is used, for example, in principal and independent component analysis (PCA/ICA) and single value decomposition (SVD). In [Poh and Bengio2005] an early, complementary information fusion system for multi modal biometrics was studied in terms of performance improvement boundaries. The system’s lower bound of performance improvement was observed, when highly correlated modalities were used and the upper bound for independent ones.

In summary, this strategy generates a more complete representation of the world by combining multiple complementary information sources and generalizing over them.

(30)

2.1 Information fusion system design 15

The cooperative fusion strategy generally exploits the dependence between the information sources. An example for late, cooperative fusion is the majority vote method as it is used in meta-search approaches. It relies on the correlation of the expert’s results when making a final decision. Early, cooperative fusion can be performed for example with latent semantic analysis (LSA), where the relationships between variables are detected to derive latent features that relate to a hidden, underlying relevance.

The success of cooperative fusion can be explained again in terms of the bias-variance decomposition of the mean square error of the fusion result: training with correlated sources lowers the bias of the result, but gives rise to its variance. This strategy achieves a more accurate expectation value of the result due to the combination of dependent inputs. In general, by combining features cooperatively a more precise representation of the world is achieved.

Complementary vs cooperative information fusion In the beginning of information fusion research, the complementary and cooperative fusion strategies were thought to be contradictory and not exploitable at the same time. But it turns out that it is more like most information fusion problems are complementary and cooperative at the same time.

For example, consider the information fusion problems of sensor integration, multi modal tracking and image fusion: in one sense the fusion is cooperative, because it exploits the temporal or spatial co-occurrence of an object or event in each modality; in another sense it is complementary, because it exploits the different information that every modality captures. This statement is supported by the findings in [Taylor and Kleeman2003]. In empir- ical tests the researchers investigated optimal features in an application for fusing visual clues from different sources. It is shown that the best results were obtained with features that are redundant in their values, but diverse in their errors.

This misunderstanding also led to contradictory research results. In multi modal biometrics, two groups independently determined performance improvement boundaries for the underlying data fusion problem: [Koval et al.2007] claims that with dependent inputs the best results can be obtained and with independent ones the worst; whereas [Poh and Bengio2005] finds the exact opposite result. This was not the only time that there was a vivid discussion, whether it is better to utilize dependent or independent sources. Gener- ally, current information fusion approaches are tailored to one or the other - so in practice a decision has to be made.

Another discussion that is related to the previous, is on the optimality of early vs late information fusion. Often it is said that cooperative fusion is best solved with early fusion, because then the dependency of the input sources can be best exploited.

It is claimed that no fusion strategy is optimal in general. The fusion strategy has to be selected according to the fusion task which includes considering the data structure of the learning problem (dependent or independent sources). If the data structure can not be determined or is variable over the task, the information fusion algorithm needs to be adaptive and to be able to perform complementary and cooperative fusion.

The competitive strategy has been rarely investigated and is more related to source selection than to the classical combination of sources in the previous strategies. This strategy uses only the most accurate source(s) to create the ’fusion’ result. Obviously, the

(31)

largest problem is to determine the quality of the sources. If more than one resource is selected, one of the previous strategies has to be applied to combine them.

As an example, the probabilistic learning algorithm the ’winner-takes-all’ can be named [Osman and Fahmy1994]. Each competitor first calculates the probability to have caused the input and then only the competitor with the highest probability is allowed to learn.

The same principle can be applied to a late fusion approach. Probabilistic approaches in general are appropriate for the competitive fusion strategy, because with their help the accuracy or reliability of the data models on the different inputs can be estimated. The goal of this strategy is to reduce the influence of weak performing inputs which can severely harm the performance of the other two fusion strategies, if a majority of inputs is under performing at the same time.

Another aspect of information fusion is thesystem architecturethat can be distributed or centralized. This is not directly relevant to the current work; further details on the fusion architecture can be found in [Valet et al.2000].

In general, there are two main advantages of information fusion, presuming a properly designed fusion system. The first is that the influence of unreliable sources can be lowered compared to reliable ones [Aarabi and Dasarathy2004]. This is of high practical relevance, because during system design, it is often not clear how the different features and modalities will perform in real-world environments. By applying information fusion, the dependency on the reliability of information sources can be reduced to some extent which increases the system’s robustness.

The second advantage of an appropriate fusion system is that it is always at least as effective as any of its parts, because more information about the problem is available due to the combination of several sources [Kokar et al.2004]. Hence, information fusion can improve the system’s accuracy. At the same time, the problem is that an inappropriate fusion system can lead to performance losses.

Fig. 2.1 :Functional model of data fusion system design by the JDL group [Llinaset al.2004]

(32)

2.2 Functional model of data fusion 17

2.2 Functional model of data fusion

A functional model of information fusion system design was presented in [Llinaset al.2004].

This approach of system design is closely related to learning theory and limited to data- and feature-level fusion. Since this work is focused on this kind of information fusion the developed algorithm fits this model. Figure 2.1 depicts the model that consists of 4 core levels (L0–L3) and 2 extension levels (L4,L5).

The first step (L0) is the discovery of patterns and meaningful relationships in the data through abductive, innovative reasoning. This is best approached by data mining and data association techniques. Thereafter, the discovered patterns are generalized and validated in models by applying inductive generalization (L1). The model parameters are estimated via a set of training data during situation refinement (L2). The last step (L3), which uses deduction, is the application of the model or template to real-time raw data in order to detect evidences for patters similar to those the model was trained with. The extension levels (L4,L5) represent the process and the user refinement level.

This model is very much refined. Generally, data fusion algorithms implement only level (L0) to (L3) that can also be combined in processing steps. For example, the feature selection and construction algorithm that is developed here covers levels (L0) to (L2) and its evaluation represents level (L3).

2.3 Information fusion for multimedia document retrieval and classification

For many tasks of information fusion, it is clear which fusion strategy has to be exploited to achieve optimal results. However, in multimedia document retrieval and classification the optimal fusion strategy is situation dependent and can be basically any of the ones mentioned above. In the case of a rank aggregation task for retrieval, three different effects were observed that can be exploited for fusion [Vogt and Cottrell1999]:

1 Skimming effect: the lists include diverse and relevant items - complementary fusion, 2 Chorus effect: the lists contain similar and relevant items - cooperative fusion, 3 Dark Horse effect: unusually accurate result of one source - competitive fusion.

Similar effects can be observed for general multimedia document processing tasks. For example, consider the availability of some example multimedia documents that can be used to learn a model for subsequent retrieval or classification. The Skimming effect translates to diverse features that are relevant to describe a semantic object with high concept variation.

It is like all instances show another important aspect of the semantic object.

The Chorus effect can be observed for semantic objects with small concept variation, where many of the relevant features are similar for the different instances. The Dark Horse effect appears when one instance of the multimedia documents describes the semantic object very well, whereas the other instances are less representative and hence distort the object’s description.

Thus, it comes as no surprise that the task of multimedia document information retrieval and classification, i.e. joint processing of images and texts or videos, was approached in

(33)

the past with some success using cooperative as well as complementary strategies. But it was found that none of the strategies was generally successful, which is due to the different data structures that can underly the learning task.

2.4 Summary

So far, the design of information fusion systems has been intensively investigated in terms of the selection of sensors and sources of information, feature extraction and fusion level.

The importance of the fusion strategy was widely ignored.

Most approaches that are used for information fusion (see Chapter 4) are designed such that they perform complementary, cooperative or competitive fusion. It is claimed though that the optimal strategy depends on the data structure of the learning target. Further- more, it is suggested that for information fusion by concatenation the data structure is different from task to task and can even vary from learning target to learning target in the same task.

The feature selection and construction algorithm that is developed in this work tries to overcome this shortcoming. It is designed such that it can adapt the fusion strategy to the data structure of the learning problem through detection and exploitation of dependent and/or diverse features.

Information fusion for multimedia: exploiting feature interactions for semantic feature selection and construction

Thesis

Reference

Information fusion for multimedia: exploiting feature interactions for semantic feature selection and construction

UNIVERSIT´ E DE GEN` EVE

FACULT´ E DES SCIENCES

Information Fusion for Multimedia:

exploiting Feature Interactions for Semantic Feature Selection and

Construction

TH` ESE

Jana Kludas

GEN` EVE

2011

Resum´ e

Abstract

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Problem definition

1.2 Information theoretic basics

1.3 Research topics and thesis overview

Chapter 2

Information fusion in general and in multimedia in particular

2.1 Information fusion system design

2.2 Functional model of data fusion

2.3 Information fusion for multimedia document retrieval and classification

2.4 Summary