• Aucun résultat trouvé

Comparison to other feature selection and dimensionality reduction methods 76

This section aims at comparing the final FS/FC algorithm to other methods that perform dimensionality reduction or feature selection. Representative for the first are singular value decomposition (svd), principal component analysis (pca) and random projection with gaussian (rpgauss) and sparse (rpsparse) projection matrices. For the feature selection methods a greedy sequential forward and backward selection (SFS,SBS) and a FS with the maximum relevancy minimum redundancy (mrmr) criterion were chosen. Furthermore, the proposed FS/FC method is compared to the multivariate relational projections (mrp) that were developed to detect feature interactions in complex learning tasks [Perez1997].

Thus, it is a state-of-the-art algorithm that has a goal similar to the one of this work. The MRP results are taken from the publication, so only the classification error is available for comparison.

The random projection method uses once a gaussian and once a sparse matrix for pro-jection as was presented in [Bingham and Mannila2001]. For PCA and SVD, the functions provided by MATLAB are used which are simple, non-optimized implementations. The fea-ture selection with the maximum relevancy minimum redundancy criterion was run with the help of a MATLAB script provided by [Peng et al.2005]. All these methods were val-idated for k = [1..11] feature subset sizes or sizes of the projection space, because they do not determine automatically the optimal result size. Subsequently, their minimum test classification error was averaged over 8 runs that are based on different training sample sets.

The sequential forward and backward selection is also run by using a function provided by MATLAB, where the selection criterion is the best fit with a binomial function. The

6.7 Comparison to other feature selection and dimensionality reduction methods 77

(a) all (b) all

(c) synergistic (d) synergistic

(e) redundant (f) redundant

(g) mixed (h) mixed

Fig. 6.6 :Average classification error and calculation time for the full feature set, multivariate relational projections (mrp), FS with the relevant features (relfeat), singular value decomposition (svd), principal component analysis (pca), random projection with gaussian (rpgauss) and sparse (rpsparse) projection matrix, the FS/FC algorithm with and without previous relevance detec-tion (mifeatFin, mifeatFinNR), FS with maximum relevancy minimum redundancy (mrmr) and sequential forward and backward selection (SFS,SBS)

method automatically determines the optimal size of the relevant feature subset.

The average test classification errors and the average calculation times for all these methods are plotted in Figure 6.6 for all concepts and divided over the different types of concepts. For the average on all concepts 6.6(a), it can be seen that the dimensionality reduction methods (svd, pca, rpgauss, rpsparse) and the greedy sequential forward and backward selection (SFS,SBS) perform worst.

The random projection methods do improve the full feature baseline for small TD sizes, which is exactly the situation where they are supposed to work well. But when more training data is used, their performance decreases and they are outperformed by the clas-sification on the full feature set. SVD and PCA have a very similar clasclas-sification error and improve the result of the baseline consistently, though only slightly. These methods also work better for smaller amounts of TD.

The plots of the classification error averaged over the different concept types (Figures 6.6(c),6.6(e) and 6.6(g)) display that the dimensionality reduction methods perform best on synergistic data, where it has to be stated that the mixed concepts contain synergies too.

On the contrary, their performance on the redundant concepts is very bad. This is due to the nature of the methods that try to detect diverse features in the input space. Generally, the dimensionality reduction methods are disadvantaged over the feature selection methods, because they do not take the learning target into account.

Therefore, the performance of the sequential forward and backward selection can be considered as really bad and it demonstrates the insufficiency of greedy feature selection methods. The SFS and SBS can improve the full feature baseline only for the redundant concepts, but fail completely for the synergistic and mixed ones, where they always ’select’

the full feature set.

The feature selection with the MRMR criterion and the quick relevance detector that was proposed in Section 5.5.4 reach much lower classification errors than the previous methods.

With only a few instances available for training, the quick relevance detector happens to select features that are not relevant for the redundant concepts (see Figure 6.6(e)), which degrades its overall performance. The comparison of the approaches’ performance for the different concept types shows that they improve all of them significantly over the baseline.

This contrasts with the previous approaches, that perform well on one or the other.

The best performance is achieved with the algorithms that detect and exploit feature interactions namely the MRP and the FS/FC proposed in this work. Overall, the FS/FC algorithm outperforms the MRP for settings with little amount of TD; for 10% TD their classification error is similar. When the results on the different concept types are considered, performance differences between the two methods can be observed. For the mixed concepts the results are very similar to the one over all concepts. But then the performances on the synergistic and redundant concepts reveal the differences of the approaches: the FS/FC method targets the redundant and synergistic interactions equally and for this reason shows a better performance on the redundant concepts which the MRP does not seem to target.

The multivariate relational projections create a model of the complex feature interactions that explicitly uses boolean operators while the FS/FC does only detect the interactions.

This is probably why the MRP outperforms the FS/FC on the synergistic concepts.

The graph with the method’s average calculation times 6.6(b) demonstrate that usually a gain in accuracy is payed with an increased calculation time. The FS/FC algorithm that searches the full feature set is by far the slowest. The feature selection with the MRMR

6.8 Improved scalability through implementation of linear, sparse indexes 79

criterion and the quick relevance detector are faster and additionally, their computation time is independent of the amount of TD that is used. The calculation times of the greedy FS and the dimensionality reduction methods, SVD, PCA and RP, are negligibly small.

The feature interaction detection methods can not compete in terms of calculation time with the dimensionality reduction methods, but in terms of accuracy they surely do. The results show that learning the data structure significantly improves the classification result for redundant and synergistic data structures. Another advantage over dimensionality reduction methods is that the results of the feature selection and construction are easily understood by humans, which then also helps to better understand the learning problem.

6.8 Improved scalability through implementation of linear, sparse indexes

So far, the joint probability distribution estimation is done by counting the co-occurrences of the attributes and saving them in a multi-linear contingency table. This multi-dimensional representation takes up a lot of space in memory and, depending on the size of the memory, leads to memory overflow for datasets with around 15 variables. This section shows how the scalability of the FS/FC algorithm can be improved.

In a first step, the multi-linear contingency table is represented and accessed by a linear indexilin:

ilin =a0∗20+a1∗21+a1∗22+...+an∗2n, (6.10) where [a0, a1, ..., an] is a binary code and each element represents one of the binary attribute values. This transfers the multi-dimensional contingency table into a simple vector that has the same number of elements as the multi-linear representation, but takes up less space in memory. It thus allows the computation of joint mutual information of higher dimensional problems without causing memory overflow, but it is still limited by the size of the full joint probability density. More specifically, the dataset size that can be handled is doubled when applying the linear index instead of the multi-linear one for representing a full joint probability density. But dataset sizes of around 30 can still not be called truly high dimensional. This means that methods that do need the full probability density like for example the iterative scaling method are impractical for high dimensional real-world data.

The feature interaction detection in the FS/FC algorithm is based on information quanti-ties and hence only needs the entropies and not the full probability distributions. Therefore, a sparse, linear index can be adapted that represents the part of the probability distribu-tions that is observed. When the multi-linear contingency table can be accessed by a linear index that represents binary code, then it can also be accessed directly with binary code.

This means, that in a kind of top-down fashion, one can also count how often the binary code of an instance is observed in the sample set to obtain its joint probability distribu-tion. The result is also a simple vector of counts that is assigned to a sparse binary code that contains all instances that have been observed. To obtain a proper distribution the counts have to be normalized to sum up to one. The sparse description of the contingency table takes up a smaller space in memory than the full linear index when only parts of the

(a) average classification error (b) average calculation time

Fig. 6.7 :Average classification error and computation time for FS with the relevant features (relfeat) and the FS/FC algorithm with and without previous relevance detection (mifeatFin, mifeatFinNR), each of them with and without the utilization of the linear, sparse index

input space are observed. This is generally the case when only a few training samples are available.

This top-down Maximum Likelihood estimation is most efficient for computing high dimensional joint probability distributions and their entropies. That is why the implemen-tation of the FS/FC algorithm uses the classic bottom-up counting and the linear index representation for joint mutual information of size |I(.)|<9. Only for higher dimensional joint mutual information, the top-down counting and the linear, sparse indexes are used.

Figure 6.7 plots the average classification error and computation time for the FS with the relevant features (relfeat) and the FS/FC algorithm with and without previous relevance detection (mifeatFin, mifeatFinNR). Each of them is computed once with a multi-linear index and once with a linear, sparse index (relfeatlin, mifeatFinlin, mifeatFinNRlin).

In terms of average classification error, it can be seen that the implementation of the linear index does not influence the algorithm’s accuracy. The adaption of the linear, sparse index achieves only slight reductions of the calculation times as can be seen in Figure 6.10(b). This can be explained with the small size of the dataset, where the FS/FC search heuristic does not need to calculate many high dimensional mutual information criteria.

Chapter 7 presents results on datasets with up to 1400 attributes, that could not have been performed without the improvements in scalability that were discussed here.

The optimized computation time of the FS/FC algorithm still rises slightly with an increase of the TD size. Furthermore, it strongly depends on the feature interaction type of the learning task as well as on its number of relevant features. Figure 6.8 plots the calculation time of the FS/FC method without previous relevance detection (mifeatFinNR) for the synergistic, redundant and mixed concepts vs their problem size for different sizes of the training set.

The dependence between the computation time and the problem size can be described with a square root function for many training samples (10%). Low dimensional learning tasks are solved in little time, whereas high dimensional concepts need exponentially more