Chapter 4 Meta-mining as a Classification Problem 39

4.6 Experiments

4.6.2 Meta-mining Results

The main limitation of the meta-learning task is that it is not possible to generalize over the learning workflows. There is no way we can predict the performance of a DM workflow w unless we have meta-learned a model based on training meta-data gathered through systematic experimentation with w itself. To address this limitation, we introduce now the three meta-mining tasks described in Section 4.4.2 which exploit both dataset and workflow descriptions, and provide the means to generalize not only over datasets but also over workflows, or both.

In all the three meta-mining tasks 1, 2 and 3, we have a single meta-mining problem in which each instance corresponds to a base-level DM experiment where some workflow w is applied to a dataset x; the class label is either best or rest, determined with the same rule as described above. We thus have 65×35 = 2275 meta-mining instances.

The description of an instance combines both dataset and workflow meta-features, i.e.

we simply concatenate for a learning instance its dataset and workflow descriptors asso-ciated with the respective class label. This representation makes it possible to predict the performance of workflows or datasets which have not been encountered in previous DM experiments, provided they are represented with the set of descriptors used in the predictive meta-model. The quality of performance predictions for such learning entities clearly depends on how similar they are to the past ones based on which the meta-model was trained. However, under this description, the instances of this meta-mining dataset are not independent, since they overlap both with respect to the dataset descriptions and the workflow descriptions. Despite this violation of the learning instance independence assumption, we also applied standard classification algorithms as a first approach where we handled performance evaluation with precaution for each of the three meta-mining tasks.

In the meta-mining task 1 (which corresponds to the algorithm selection task), we evaluated the predictive performance using leave-one-dataset-out, i.e., we removed all meta-instances associated with a given dataset x and placed them in the test set. We

4.6. Experiments

built a predictive model from the remaining instances and applied it to the test instances.

In this way, we avoided the risk of information leakage incurred in standard leave-out-out or cross-validation, where both training and test sets are likely to contain instances (experiments) concerning the same dataset. Under this formulation, the total number of models built was equal to the number of datasets; for each datasetx∈X, the meta-level training set contained 64×35 = 2240 instances and the test set 35, corresponding to the application of the 35 workflows tox. With this method, the overall error is estimated by average over the 65 datasets. We denote this error by:

ErrorM M Aalgo = 1 the actual class, and algo is the classification algorithm that was used to construct the meta-mining modelsh which combine dataset and workflow descriptors.

In the meta-mining task 2 (which corresponds to the dataset selection task), we did exactly the reverse procedure to task 1 where we evaluated predictive performance us-ing leave-one-workflow-out, i.e., we removed all meta-instances associated with a given workflow w and placed them in the test set. We built then a predictive model from the remaining instances and applied it to the test instances. Under this task, the total number of models built was equal to the number of workflows; for each workflowwinW, the meta-level training set contained 65×34 = 2210 instances and the test set 65, corresponding to the 65 datasets on which we applied the wworkflow. Now, the overall error is estimated by average over the 35 workflows that we denote by:

ErrorM M Balgo = 1 the actual class, and againalgo is the classification algorithm that was used to construct the meta-mining modelsh.

In the meta-mining task 3, we investigated the possibility of predicting the performance of pairs of (dataset, workflows) that have not been included in the training set of the meta-mining model. The evaluation was done as follows: in addition to leave-one-dataset-out, we also performed leave-one-workflow-out, removing from the training set all instances of a given workflow. In other words, for each training-test set pair, we removed all

meta-mining instances associated with a specific datasetxas well as all instances associated with a specific workfloww. We thus did 65×35 = 2275 iterations of the train-test separation, where the training set consisted of 64×34 = 2176 instances and the test set of the single meta-mining instance (xi,wj, label). We denote the overall error thus estimated by:

ErrorM M Calgo = 1

whereh(xi,wj) is the predicted class of thewj left-out workflow on thexileft-out dataset, y(xi,wj) the actual class, andalgois as before the classification algorithm that was used to construct the meta-mining modelsh.

Table 4.2 gives the estimated errors for the three meta-mining tasks 1, 2 and 3, in which the meta-models were also built usingJ48 with the same parameters as in the meta-learning task, but this time using both dataset and workflow characteristics. McNemar’s test was also used to compare their performance against the default rule. Note that the default error is the same for all the four tasks since it is estimated with the same rule for all class labels regardless of the learning entity. Column 2 shows the error rate using leave-one-dataset-out error estimation (ErrorM M AJ48), which is significantly lower than that of the default rule, but also lower than ErrorM LJ48, the error rate of the meta-learning model built using dataset characteristics alone. Though there is no statistical difference between these two experiments for algorithm selection, this provides some evidence of the discriminatory power of the frequent patterns discovered in the DM workflows.

Column 3 shows the error rate using leave-one-workflow-out error estimation (ErrorM M BJ48), which is by far the lowest error among the four tasks and which is thus significantly better than that of the default rule. Compared to algorithm selection, dataset selection appears thus to be a less difficult task, providing the fact that the performance of a new workflow can be more easily predicted using a combination of dataset and workflow characteris-tics than predicting the performance on a new dataset. In other words, the meta-mining models built in this task were more robust in modeling the performance behavior of new workflows according to their conceptual similarities with experimented past workflows than in the first task where workflow performances had to be predicted for a new dataset with respect to its characteristics. These results provide hence again evidence of the util-ity of the frequent workflow patterns discovered in the DM workflows and illustrate how far the representational power of dmop can uniquely characterize algorithms/workflows in terms of their learning behaviors.

Finally, column 4 shows the error rate of the meta-mining models built using the

4.7. Discussion

combined leave-one-out-dataset plus leave-one-workflow-out error estimation procedure (ErrorM M CJ48), which was meant to test the ability of the meta-mining models to predict the performance of completely new pairs of (dataset, workflow). The estimated error rate is again lower than the baseline error, though not significantly. However, it demonstrates the point that our approach can predict the performance of pairs of (dataset, workflow) never yet encountered in previous DM experiments. As a matter of fact, these new pairs can even contain in their workflow operators that implement algorithms never seen in previous experiments, provided these algorithms have been described indmop.

4.7 Discussion

In this Chapter, we have described the main setting of our meta-mining framework. We have seen how to extend the original Rice model by taking into account relational workflow descriptors that have been extracted from the graph structure of DM workflows with the help of thedmopontology. This makes our framework a better approach to meta-analyze the whole DM process than the meta-learning approach since a meta-mining model will naturally blend together the learning of dataset and workflow characteristics under which a given pair of dataset-workflow will perform well. As a result, we are no more restricted to learn a model over the specific learning entities of algorithms. Instead we can address with the same model the three selection tasks of selecting workflows for a given new dataset, selecting datasets for a given new workflow, and selecting the best new pairs of dataset-workflow, where the latter two tasks can not be addressed by meta-learning.

The experiments conducted on our real-world meta-mining problem in the previous section have also showed that we can achieve a higher classification accuracy by training the same classification algorithm, hereC4.5, using both dataset and workflow descriptors than using only dataset descriptors. In other words, we have been able to discover with the meta-mining model discriminative combinations of dataset and workflow characteristics with which we can describe the learning conditions for which algorithms/workflows will perform well on our biological datasets.

In addition, we should note that a meta-mining model has a better, more intuitive, interpretation than a model solely built on dataset descriptors; that is, with a meta-mining model, we can better understand which type of workflows is the most appropriate for which type of datasets and vice-versa, by looking at the model’s correlations between frequent workflow patterns and dataset characteristics (we discuss this in more details in Section 5.5.3). We think that such understanding is very important in meta-mining since we would like to understand why a particular workflow is better suited for a given mining problem

than one other. A more in-depth discussion of the quality of a meta-mining problem, together with an example of a C4.5 meta-mining model, can be found in (Hilario et al., 2011).

4.7. Discussion

Meta-mining as a Recommendation Problem

5.1 Introduction

In this Chapter, we will present a metric-learning approach (Wang et al., 2012; Weinberger and Saul, 2009) for hybrid recommendation in meta-mining. In Section 5.3, we discuss related works, principally those related to recommender systems. Then we describe in Section 5.4 three metric-learning algorithms to address the three selection tasks proposed in this thesis. We provide experimental results in Section 5.5 where we compare our methods with a number of baselines. Finally we discuss our approach in Section 5.6.

5.2 Motivation

In Chapter 4, we introduced the three meta-mining tasks which we want to address with our framework. These were: i) selecting the best performing DM workflows for a given new dataset, ii) selecting the datasets on which a given new workflow will perform the best, and iii) measuring the performance of a new workflow on a new dataset to select the best never-experimented pairs, all of which were based on building a meta-mining model that combines dataset and workflow characteristics, see Section 4.4.3. We approached these tasks by modeling them as standard classification problems: the class label was determined on the basis of the performance measure estimated by the application of the

5.2. Motivation

workflow on the dataset and was indicating the appropriateness or not of a workflow on a dataset by either bestor rest, see Section 4.5.2. Although binary classification can model the appropriateness of workflows on datasets, it oversimplifies the problem, i.e. the used labels do not give us an appropriate estimate on how good we are in fitting the performance of a workflow on a dataset.

In this chapter, we will take a different approach on the modeling of our meta-mining tasks; we will view them as recommendation problems between datasets and DM work-flows, where the task is to deliver a ranked list of workflows (or datasets for Task 2, or pairs of dataset-workflow for Task 3) based on the estimated scores of the target instance given by the built model. This follows the meta-learning approach to treat algorithm selection as a ranking problem (Brazdil, Soares, and Da Costa, 2003; Kalousis and Theoharis, 1999;

Soares and Brazdil, 2000). But differently from the previous meta-learning approaches which were using mainly non-parametric methods like instance-based learning, we will present here a new metric-learning-based approach to hybrid recommendation for meta-mining, in which we will learn how to match dataset descriptors to workflow descriptors.

More specifically we will learn three different metrics: one on the dataset descriptor space which will reflect the fact that similar datasets will have similar workflow preferences, as these are given by a performance-based preference matrix; one on the workflow descriptor space which will reflect the fact that similar workflows will have similar dataset preferences again as these are given by a performance-based preference matrix; and a last heteroge-neous metric over the two spaces of dataset and workflow descriptors, which will directly give the similarity/appropriateness of a given dataset for a given workflow.

To the best of our knowledge, the metric learning approach that we present here is the first of its kind, not only for meta-mining but also for the general context of hybrid recommendation problems (Adomavicius and Tuzhilin, 2005; Burke, 2002); even though it was developed to address the specific requirements of the meta-mining setting, it is not specific to it and it can be used in any kind of recommendation system that has similar requirements, i.e. preference-based matching of users to items with description on them, side-information, with which we can address the well-known cold-start problem for users and/or items (Schein, Popescul, Ungar, and Pennock, 2002), that is the problem of giving item recommendations to users in the absence of historical data for any of these two.

The main contribution of this Chapter is as follows:

We propose to address meta-mining as a hybrid recommendation problem with the cold-start setting. Now the meta-mining tasks will be to recommend workflows for given a new dataset, to recommend datasets for given a new workflow, and eventually

to recommend a new workflow for a given new dataset, where the recommendation criterion will be a performance measure like classification accuracy of a workflow on a dataset. We will address these three recommendation tasks with metric-learning – an approach that has never been proposed in the meta-learning literature – where we will learn three different metrics in the feature spaces of datasets and workflows to reflect a performance-based preference matrix between these two.

This Chapter is based on the following publication.

Phong Nguyen, Jun Wang, Melanie Hilario, and Alexandros Kalousis. Learning heterogeneous similarity measures for hybrid-recommendations in meta-mining. InIEEE 12th International Conference on Data Mining (ICDM), pages 1026 1031, dec. 2012.

5.3 Related Work

The meta-mining problem formulation we will give in this chapter is closely related with the work of hybrid recommender systems (Agarwal and Chen, 2009, 2010; Burke, 2002; Stern, Herbrich, and Graepel, 2009; Stern, Samulowitz, Herbrich, Graepel, Pulina, and Tacchella, 2010). There the goal is to accurately recommend items to users using information on historical user preferences and descriptors of items and users. Examples of recommender user and item descriptors can be found for instance on the MovieLens dataset1 where we have demographic or activity information on users such as age, gender and occupation, and taxonomic information on movies such as genre and release date.

State of the art recommender methods (Agarwal and Chen, 2009, 2010; Stern et al., 2009) rely on matrix factorization methods to directly approximate a preference matrix made of the rating that users give to items. In Stern et al. (2009), the authors propose MatchBox, a Bayesian approach where a probabilistic bi-linear rating model is inferred by a combination of expectation propagation and variational message passing. Users and items features are modelled with Gaussian priors into two matricesUandVof latent traits, the inner product of which defines user-item similarities. Variational approximation on users and items is then used to regularize the latent factors. Their experiments on the MovieLens dataset showed that including user and item descriptors improves performance.

In Stern et al. (2010), the same authors propose to applyMatchBox on hard combina-torial problems like Quantified Boolean Formulas (QBF) and combinacombina-torial auctions to


5.4. Learning Similarities for Hybrid Recommendations

automatically select for the given problem the best algorithm from a given portfolio. They use descriptions of the problem together with past algorithm performances to train on-line a bi-linear model such that the model can be continuously adapted over time to changing domains. Experiment results show increase in performance against self-adaptive solvers.

Agarwal and Chen (2009, 2010) propose also a generative probabilistic model where the model fitting is done by a Monte Carlo EM algorithm with no variational approximations.

They regularize their model using a regression-based approach on user and item factors, where the latter are determined using topic modelling (Agarwal and Chen, 2010). They also experimented on the MovieLens dataset and showed that a model based on the meta-data had a weak predictive performance while their regression-based approach to the latent factors regularization gave the best performance improvements.

Our metric-learning-based approach to the problem of hybrid recommendation will use a very different regularization approach in learning the projection matrices, we will constraint them so that they will reflect in the original feature spaces the similarities of the respective preference scores, an approach which in our meta-mining experiments had the best performance. As we will see in the next Section, one additional advantage of the use of the two linear projection matrices learned in the dataset and workflow spaces is that we can now naturally handle the out-of-sample problem, i.e. the cold-start problem in recommender systems, which is not the case with the typical matrix factorization models.

5.4 Learning Similarities for Hybrid Recommendations

Before starting to describe in detail how we will address the three meta-mining tasks let us take a step back and give a more abstract picture of the type of learning setting that we want to address. In the followings, we will use the same notation as these are provided in Section 4.3. Recall that, in meta-mining, we have two types of learning instances, x∈ X and w∈ W, and two training matrices, X:n×dand W:m× |P|, respectively.

Additionally, we also have an instance alignment or preference matrixY:n×m, theYij

entry of which gives some measure of appropriateness, preference, or match of thexi and wj instances, see Section 4.5.2. We will denote byyxi theith line of theY matrix which

entry of which gives some measure of appropriateness, preference, or match of thexi and wj instances, see Section 4.5.2. We will denote byyxi theith line of theY matrix which

Dans le document Meta-mining: a meta-learning framework to support the recommendation, planning and optimization of data mining workflows (Page 75-0)