Complexity measures - Information fusion for multimedia: exploiting feature interactions for se

As was said in the previous section, the complexity of a boolean expression that is used to describe the learning target determines the PAC-learnability of the concept. In the following, common complexity measures found in literature are listed.

A complexity measure is for example the class entropy. It predicts the difficulty of a concept by always guessing the most frequent label. Another one is blurring that averages the class entropy over all attributes conditioned on each attribute. Both of them are early and simplistic measures that have not proved very predictive. One problem is that they do not account for irrelevant features and the number of relevant features on which the complexity of a learning task depends [Perez and Rendell1996].

Another complexity measure has been derived from the view point of concept varia-tion [Perez and Rendell1996]. It measures the roughness of the target concept on behalf of its surface in the instance space. If two neighboring examples (e.g. a Hamming dis-tance of 1) do not belong to the same class, then concept variation is high. Or it can be said that different classes overlap in the instance space when the pure visual similarity is computed. The measure detects reliably the difficulty of parity and random problems be-cause it describes the complexity in terms of the type of boolean operation and the number of attributes involved. Concept variation can be analytically related to DNFs and their properties.

[Abu-Mostafa1988] uses the Kolmogorov complexity as a measure of randomness or lack of structure. It is defined as the shortest possible length of a program of an universal Turing machine that can generate the truth table of the machine learning problem. Since the Kolmogorov complexity is small for short programs that run exceedingly long (as long as they eventually terminate) the time-bounded randomness that penalizes long computation times was introduced. A further modification is the introduction of a fault tolerance which allows to detect random problems that have a structured approximation. Furthermore, it has been used to proof that the majority of all possible 2²^N problems that can be defined withN boolean attributes is actually random.

[Blumeret al.1989, Almuallim and Dietterich1994, Laird and Laird1990] use the Vapnik-Chervonenkis (VC) dimension to determine the PAC-learnability and hence the complexity of classes of concepts. The VC-dimension is a measure of capacity of a learning machine in statistical classification. It defines the number of points that it can shatter in the input space when all possible labelings are considered, which means that the learning machine results in a zero-training error for all class distributions. To determine the number of instances needed to make a problem PAC-learnable, the VC-dimension is defined on the

3.4 Summary 25

concept/class label and its instance space. It is equal to the largest number of instances that are needed to shatter the instances correctly.

The latter two measures are mostly interesting from a theoretical point of view, but have little value for practical applications. Later on, some experiments will investigate the concept variation’s ability to describe the complexity of concepts.

3.4 Summary

In the practical application of fusion by concatenation complete relevance and complete irrelevance of the features are not assumed to occur. On the other hand, partial and indirect feature relevance are equally likely to occur. Therefore, the algorithm that is developed in this work is designed such that it can perform selective and constructive inductive learning and thus feature selection and construction.

Furthermore, it is assumed that feature relevance is related to feature interactions. In this sense, redundant feature interactions describe the direct relevance of some features in the selective inductive learning scenario. This type of interaction is easy to detect and is best exploitable with the cooperative fusion strategy.

Then, it is suggested that hidden feature relevance is related to synergistic feature inter-actions which is best exploitable with the complementary fusion strategy. As was shown in this section, synergies are difficult to learn due to the multivariate testing needed to detect the indirect relevancies. Therefore, a systematic search of the multivariate search space is needed to detect all hidden feature relevancies.

Chapter 4 Related work

The main problem of multi modal information fusion by concatenation is to determine high-level meanings or semantics from uninformative, generic, low-high-level and high dimensional input attribute sets. This problem is called the ’semantic gap’ and solving it helps to improve classification and retrieval performance on multi modal data. Otherwise, multi modal learning tasks, even though they have more information at their hands, can show poor performances instead of the intuitively expected improvements. This can be explained with the actual relevant information being obscured by many irrelevant features [Koller and Sahami1996], hidden relevance and noise. The influence of redundant features is even worse as was shown in [Liu and Motoda2008], since they can cause over-fitting of the data. Other important side effects are the reduction of the computational complexity, memory usage and the number of training examples needed for robust learning.

This chapter reviews works on classical information fusion divided by greedy and non-greedy information fusion as well as works on dimensionality reduction, feature selection and boolean association rule learning that can be used to facilitate multi modal information fusion.

4.1 Greedy information fusion

In general, an information fusion method is greedy or myopic when [Freitas2001]:

1 it assumes independence and/or full relevance of the features,

2 it iteratively considers only one attribute at a time - full search on bivariate relations, 3 it uses at each step the locally best choice - heuristic search on multivariate relations.

(1) is the obvious case. Here, all relations in between the features and between the fea-tures and the class label are ignored. The methods that fall into (2) exploit feature-class or feature-feature relations, but they are still greedy, because they only take bivariate depen-dencies into account even though they work in a multivariate setting [Nemenman2004]. In (3) multivariate relations between the features and the class are evaluated, but the search heuristic is guided by local and thus greedy decisions, which can lead to sub-optimal re-sults. In summary, it can be said that greedy methods can not cope with complex attribute interactions and, therefore, can also not detect high-level semantic meanings in data.

In early years of information fusion research, scientists always fused different information sources by assuming independence between them. One of the first works on classifier and decision fusion used this principle, where they fused neural network outputs [Tumer

and Gosh1999]. The independence assumption is still widely used in machine learning as for example most famously for the naive Bayes classifier. Its success is based on its simplicity in calculation and the learned models, as well as its robustness in estimating the evidence [Jakulin and Bratko2003a].

The independence assumption can be appropriate for tasks that fuse truly independent or complementary sources. This strategy has been applied successfully for example for multimedia retrieval [Wuet al.2004, Bruno et al.2008], multi modal object recognition [Wu et al.2002] and multi-biometrics [Poh and Bengio2005].

Despite the successful application of the independence assumption for some fusion tasks, it fails completely for others - cooperative fusion tasks for example. In [Koval et al.2007], it is shown that the violation of the independence assumption hurts the information fusion performance. That loss in performance was empirically proven in [Dasset al.2005], where the authors showed that the maximum performance improvement in their multi-biometrics application is achieved only if the statistical dependencies between the modalities were taken into account.

Examples for cooperative data fusion through greedy methods are found for web doc-ument retrieval [Zhao and Grosky2002] and web image retrieval [Vinokurow et al.2003].

These approaches exploit some form of feature dependence like co-occurrence (LSI [Zhao and Grosky2002, Deerwester et al.1990], PLSA [Hofmann1999]), correlation (kCCA [Vi-nokurowet al.2003, Fortuna2004]) or pairwise mutual information (greedy similarity based learning - SBL [Quinlan1993]). The idea of latent semantic indexing (LSI) is to project features into a common semantic space. Pairwise mutual information expresses the co-occurrence of features by the joint probability density which then can be used for data aggregation.

To circumvent the problem of feature dependencies in data, other approaches try to create independence with the help of linear transformation methods like independent component analysis (ICA) as reviewed in [Hyvarinen and Oja2000]. Unfortunately, these methods are not able to eliminate all dependencies between data, since they target only pairwise and linear feature dependencies [Vasconcelos and Carneiro2002]. The authors showed em-pirically that their multi modal object recognition problem is affected by higher order dependency patterns and hence the independence creation is incomplete. A similar result was found in [Vinokurow et al.2003]. In their multimedia classification task the Support Vector Machine (SVM) approach that pre-processed the data with ICA was outperformed by a SVM on the original dataset. This shows that independence creation is not necessarily a successful strategy for multi modal information fusion.

Dans le document Information fusion for multimedia: exploiting feature interactions for semantic feature selection and construction (Page 39-43)