Chapter 7 Meta-mining as a Learning to Rank Problem 101

7.5 Regularization

7.6.4 MovieLens

As we already discussed above the MovieLens recommendation dataset is quite different from the meta-mining. The descriptions of the users and the items are much more limited.

Here users are described only by four features: sex, age, occupation and location, and movies by their genre which takes 19 different values, such as comedy, sci-fi to thriller, etc. We model these descriptors using one-of-N representation.

In MovieLens, side information has been thus mostly used to regularize latent profiles not to predict them, see e.g the works of Abernethy, Bach, Evgeniou, and Vert (2006);

Agarwal and Chen (2010). Solving the cold start problem using such low informative side information is a very difficult problem and typical approaches prefer to rely only the Y preference matrix.

Evaluation setting

We use the same baseline methods as in the meta-mining problem. We train Lamb-daMART (LM) with the following hyper-parameter setting: we fix the maximum number of nodes for the regression trees to 100, the maximum percentage of instances in each leaf

node to 1% and the learning rate, η, of the gradient boosted tree algorithm to 10−2. As before for the construction of the ensemble of regression trees in LM-MF and LM-MF-Reg we use the same setting as in LambdaMART. We now optimize theirµ1 and µ2 regular-ization parameters within the grid [0.1,1,5,10,15,20,25,50,80,100]2 using five-fold inner cross validation. We fix the number of factorsr to 50 and the number of nearest neighbors in the Laplacian matrices to five. As in the meta-mining problem we used the NDCG similarity measure to define output-space similarities with the truncation level k set to the same value as the one at which we evaluate.

We will evaluate and compare the two methods under the two cold start scenarios described in Section 7.6.1. In theUser Cold Start problem, we randomly select 50% of the users for training and we evaluate the recommendations that the learned models generate on the remaining 50% which we use for testing. In the Full Cold Start problem, we still separate the users in the same manner in a training and testing set. However now in addition we also randomly separate the item sets in two sets of size 50% each and we use only one of the sets for training, the other set is left for testing. When we test the recommendations of the learned models we test over the testing users and over the testing items. For the full cold start scenario we report results on the 100K dataset, we expect results to be very similar in the 1M dataset. We give the results on the User Cold start scenario in Table C.4 and in Table C.5 for theFull Cold Start.


In the case of MovieLens,User Cold Startscenario, Table C.4, we can see first that the two methods we propose, LM-MF and LM-MF-Reg, beat in a statistical significant manner the simple UB baseline for both datasets, with the only exception of LM-MF at 100k for k= 5 which is not significantly better. In comparison, LambdaMART and its weighted NDCG variant, LMW, are never able to beat UB where the former is even significantly worse at 1M for k = 5. In addition both LM-MF and its regularized variant beat in a statistically significant manner LambdaMART in both 100K and 1M and for both values of k. LM-MF-Reg has a small advantage over LM-MF for 100K, it achieves a higher average NDCG score, however this advantage disappears when we move to the 1M dataset, which is normal because given the much larger size of available data there is no more need for an additional regularization. Moreover using a low-rank matrix factorization, we haver = 50 for the roughly 6000 provides already a strong regularization which makes unnecessary the need for the input/output space regularizers.

A similar picture arises in theFull Cold Startscenario, Table C.5. Except for 100K,k=

7.7. Discussion

10, our two methods, LM-MF and LM-MF-Reg, are both able to always beat significantly the FB baseline. Also LambdaMART and its weighted NDCG variant are now significantly better than FB in the 1M dataset where the latter is significantly better than the first in two cases: in 100k, k = 5 and in 1M, k = 10. When we compare LambdaMART with our two methods, both beat in a statistically significant manner this baseline in the two dataset with the little exception of LM-MF-Reg which is close to beat LambdaMART in a significant manner at 100K,k= 10,p-value=0.0722. However, the method which performs overall the best is no more LM-MF-Reg as before but LM-MF. There is no clear reason why we get this result for the full cold start scenario but this can be explained by the low description of movies in MovieLens which are only described by their genre, resulting in movie input/output space regularizers which are not as useful as the user regularizers in the user cold start setting.

7.7 Discussion

In this chapter, we have explored the use of the state-of-the-art learning to rank algorithm lambdaMART in a recommendation setting. Since only top items are observable by users in real recommendation systems, rank loss functions that focus heavily on the correctness of the top item predictions are more appropriate for them. We focused on the cold start problem in the presence of side information describing users and items. Motivated by the fact that in recommendation problems users and items can be described by a small number of hidden factors we learn this hidden representation and factorize the preference scores as inner products over it. In addition we constraint the learned representations by forcing them to reflect the user/item similarities as they are given by their descriptors in the input space and their preference vectors over the preference matrix. We experimented with two very different recommendation problems and showed that the methods we proposed beat in a significant manner the original LambdaMART, under a number of evaluation settings.

In what concerns meta-mining, we viewed this problem as a learning to rank problem which is more appropriate than the standard classification and the least-square recommen-dation approaches, see Chapters 4 and 5. In meta-mining, what is of most importance is to provide for a given new dataset top-k workflow from which the DM user can select the ones that she/he wants to experiment. By optimizing directly the NDCG ranking loss function, LambdaMART-MF and its regularized variant deliver both the best top-k rec-ommendations among the different methods and baselines, with significant performance improvement under the three evaluated recommendation tasks ofMatrix Completion,User Cold StartandFull Cold Start. Note that in theUser Cold Startsetting the user

memory-based corresponds to the typical approach used in meta-learning in which workflow recom-mendations are provided by a dataset similarity measure (Kalousis and Theoharis, 1999;

Soares and Brazdil, 2000). As a result, this standard approach which only makes use of the dataset descriptors delivers the worst performance compared to the other methods which additionally make use of the workflow descriptors.

We should note that there are actually few works in meta-learning that are able to build a single model from pairwise algorithm performance comparison. Most of them require to build m(m−1)2 different models, one for each algorithm pair, and to aggregate them in order to deliver a ranked list of algorithms for a given new dataset, see e.g. the works of Kalousis and Hilario (2001); Kalousis and Theoharis (1999). For instance, the recent approach described in Sun and Pfahringer (2013) proposes a two-step approach to build a model which can predict the preference vector for a given new dataset. In this approach, the authors build a random forest ensemble of approximate ranking trees, ART, from a propositionalized version of a set of pairwise meta-rules, each of which describes if algorithmA should be preferred over algorithm B on a given base-level dataset. Their motivation is that ensemble algorithms usually outperform base algorithms in pairwise algorithm performance comparison (Kalousis and Hilario, 2001). Here we also use ensemble learning but we do this in a single step in which we directly optimize a ranking loss function using dataset and workflow descriptors to predict their latent profiles, the inner product of which gives the preference score of a dataset-workflow pair. As a result, our approach is very robust and can address the different cold start settings of providing recommendations for new datasets, new workflows, or even both.

7.7. Discussion

Conclusion and Future Work

In this Thesis, we presented a new framework for the meta-analysis of DM processes, referred asmeta-mining or DM process-oriented meta-learning. The proposed framework extends in several directions the state-of-the-art approaches in automating the DM pro-cess, including meta-learning and ontology-based DM workflow planning systems, two approaches which are not designed to support the selection, recommendation and build-ing of DM workflows in terms of performance optimization, see Chapter 2. By relybuild-ing on a novel DM ontology, dmop, which we used to extract structural characteristics of DM workflows that we associated with dataset characteristics, our framework extends the traditional algorithm selection task in meta-learning to workflow selection, where it is no more restricted to select DM workflows from a fixed pool of base-level workflows but it can eventually select workflows that have never been experimented before. It can also address the task of selecting datasets for a given new workflow, as well as selecting the best pair of dataset-workflow, we describe these three meta-mining tasks in Chapter 4. In addition, we also addressed the task of planning automatically DM workflows in a novel manner by combining meta-mining with an AI-planner, providing thus to our knowledge the first planning system that is able to design DM workflows which best match a given dataset in terms of the relative performance of the former on the latter, see Chapter 6.

It is worth noting that the above meta-mining tasks are all novel tasks that have never been considered in the meta-learning literature, including the task of optimizing the plan-ning of DM workflows. In our framework, we elaborated them as specific recommendation tasks with the overall goal to provide to the DM user a few number but relevant work-flows (datasets) for a given dataset (workflow respectively); i.e. to select a portfolio of

DM workflows that will work well on his/her datasets, and vice-versa. Since meta-mining bears several similarities with recommender systems, such as recommending workflows that have similar performances on a given dataset, we built thus our framework by fol-lowing the recommendation approach where we combined together dataset and workflow descriptors with respect to a performance-based preference matrix that relates the latter with the former, in a similar manner to collaborative and content-based filterings (Su and Khoshgoftaar, 2009).

More specifically, we devised in Chapters 5 and 7 two new recommendation algorithms, one relying on metric-learning and the other one on learning-to-rank, both of which learn dataset and workflow low-rank latent profiles, similarities, over heterogeneous spaces. We followed in both approaches the meta-learning assumption that the similarity of datasets is given by the relative performance of DM workflows when applied on them (Kalousis and Hilario, 2001; Kalousis and Theoharis, 1999). In the same manner, we defined the similarity of workflows by the relative performance they have on datasets. We combined different similarity measures of datasets and workflows, one defined over their descriptors which we called input space similarity, and one defined over their respective performances which we called output space similarity, in order to regularize the low-rank latent profiles of datasets and workflows. As a result, the inner product of the dataset and workflow latent profiles can robustly estimate the performance-based preference score that the latter will deliver on the former. In addition, we designed our recommendation algorithms in such a manner that they can learn a function that maps dataset and workflow descriptors to the respective latent space in order to address the cold start problem, i.e. the problem of providing recommendations to datasets, respectively workflows, with which we do not have any historical data. To predict the profiles of new datasets and workflows with respect to their descriptions, we learned a linear mapping in the metric-learning algorithm, and an additive piece-wise linear mapping in the learning-to-rank algorithm. In the latter case, this gave us remarkable results on the different cold start scenarios.

In what concerns the planning of DM workflows, we proposed a novel planning ap-proach which combined an AI-planner together with our meta-mining framework, see Chapter 6. We viewed our planning task as optimizing a sequence of locally-based cold start recommendation problems; in each planning step, the meta-mining module can se-lect or recommend those valid partial candidate workflows submitted by the AI-planner whose descriptions best match the descriptions of the input dataset for the given mining goal. As a result, our planning system is the first one which can automatically design potentially new DM workflows whose expected performance on a given mining problem are maximized.

We evaluated our framework on a real-world problem under the meta-mining tasks described above with a number of different evaluation metrics such as mean average error, Kendall correlation and NDCG, to assess the quality of the recommendations provided by our algorithms. In most of the cases, we were always able to deliver the best results for the evaluation metric under consideration, with significant performance improvement against the different baselines and other methods including standard meta-learning approaches and state-of-the-art methods for matrix completion and learning-to-rank. We also ex-perimented our learning-to-rank algorithm on a large scale benchmark recommendation dataset, MovieLens, where we delivered very good results in the user cold start setting.

Overall, we investigated in this Thesis a novel framework that can robustly support the DM user in her/his DM process modeling. We believe that it is a promising research avenue which we would like to further explore in future work. As immediate work, we would like to analyze the learned models in order to have a better understanding of the quality of the dataset and workflow descriptors. This could give us feedback on which descriptors should be either refined or removed in order to improve our domain knowledge and thus our meta-mining framework. To do this, a first step is to investigate other specific domains, like physics or other medical/biological problems, than the cancer diagnostic/prognostic problem we experimented here, to see if our framework can also be applied on these domains. We would like also to investigate other recommendation problems with similar requirements, i.e. with descriptors on users and items. For instance, to predict if a user will retweet a tweet or not based on its social network and her/his past tweets in the Twitter blogosphere1. For more longer term work, we will describe in the following the most important research directions we would like to explore.

Multi-Task Learning First, we would like to explore meta-mining with regularized multi-task learning (MTL) (Evgeniou and Pontil, 2004). In MTL, we are givenT different tasks and we want to build for each taskt∈T a modelft(x) such that similar tasks will have similar models. Learning multiple tasks simultaneously generally leads to improved performance. We assume that each task t has nt instances and that they all share some d-dimensional feature space X. We also assume ft(x) to be a linear model, i.e. we have ft(x) = wTtx where we define wt = w0+vt and w0 is a global model learned over the instances of all tasks. The regularized MTL optimization problem is defined by:


whereLis a given loss function such as the square loss. The second term in Eq.(8.1) is the l2,1-regularization or group lasso which is known to introduce group sparsity and the third term thel2-regularization on thew0 global model. By solving Eq.(8.1) with e.g. SVM, we will learn thevt vectors such that their difference,||vj−vk||2, between two tasksj andk will be ”small” when the two tasks are similar to each other (Evgeniou and Pontil, 2004).

The multi-task approach can clearly profit to meta-mining. In meta-mining, we will say that each task t corresponds to a specific application domain, e.g. biology, physics, etc., each of which has a number nt of base-level datasets over which we apply exactly the same set of workflows. Thus, all domains will share the same meta-feature spacesX and W of datasets and workflows. Now, we want to learnt meta-mining models, one for each application domain, such that similar domains will have similar models, plus a global model U0 over all the base-level experiments. We will follow the matrix factorization model defined in this thesis where we will defineUt=U0+Wt whereWtwill be ant×r matrix and we keep intact them×rmatrixV,mis the number of workflows andr is the (low) rank of the matrices. So now Eq.(8.1) becomes:

U0min,Wt,V whereLcan be the factorized LambdaMART loss function, see Eq.(7.13), and|| · ||F is the Froebenius norm. The only difference with a standard MTL approach is that to measure the similarity between the Wt matrices we can not simply use their difference since we will have a different numberntof datasets for each domain. To solve this problem, we can simply use the expectation of the difference, E(||Wj −Wk||2), for two domains j and k by assuming thatWj and Wk follow both ar-multivariate normal distribution.

Multi-Criteria Recommendation System So far, we have been mainly considering the relative predictive performance of workflows on datasets to measure how good are the former on the latter. The reason was that this measure is the one that reflects the best the association between datasets and workflows. Also, we wanted to build recommendations for a given mining problem such that the DM user can obtain good predictive performance results with a number of experiments. There are however additional metrics which can be also relevant for the meta-mining recommendations such as speed and memory consump-tion of the DM workflows when applied on a given dataset. For instance, the DM user can prefer to have recommendations that give good performance but that are also relatively fast in their execution.

To take into account these additional measures, we will extend our meta-mining model,

To take into account these additional measures, we will extend our meta-mining model,

Dans le document Meta-mining: a meta-learning framework to support the recommendation, planning and optimization of data mining workflows (Page 140-0)