Chapter 7 Meta-mining as a Learning to Rank Problem 101
As we already discussed above the MovieLens recommendation dataset is quite diﬀerent from the meta-mining. The descriptions of the users and the items are much more limited.
Here users are described only by four features: sex, age, occupation and location, and movies by their genre which takes 19 diﬀerent values, such as comedy, sci-ﬁ to thriller, etc. We model these descriptors using one-of-N representation.
In MovieLens, side information has been thus mostly used to regularize latent proﬁles not to predict them, see e.g the works of Abernethy, Bach, Evgeniou, and Vert (2006);
Agarwal and Chen (2010). Solving the cold start problem using such low informative side information is a very diﬃcult problem and typical approaches prefer to rely only the Y preference matrix.
We use the same baseline methods as in the meta-mining problem. We train Lamb-daMART (LM) with the following hyper-parameter setting: we ﬁx the maximum number of nodes for the regression trees to 100, the maximum percentage of instances in each leaf
node to 1% and the learning rate, η, of the gradient boosted tree algorithm to 10−2. As before for the construction of the ensemble of regression trees in LM-MF and LM-MF-Reg we use the same setting as in LambdaMART. We now optimize theirµ1 and µ2 regular-ization parameters within the grid [0.1,1,5,10,15,20,25,50,80,100]2 using ﬁve-fold inner cross validation. We ﬁx the number of factorsr to 50 and the number of nearest neighbors in the Laplacian matrices to ﬁve. As in the meta-mining problem we used the NDCG similarity measure to deﬁne output-space similarities with the truncation level k set to the same value as the one at which we evaluate.
We will evaluate and compare the two methods under the two cold start scenarios described in Section 7.6.1. In theUser Cold Start problem, we randomly select 50% of the users for training and we evaluate the recommendations that the learned models generate on the remaining 50% which we use for testing. In the Full Cold Start problem, we still separate the users in the same manner in a training and testing set. However now in addition we also randomly separate the item sets in two sets of size 50% each and we use only one of the sets for training, the other set is left for testing. When we test the recommendations of the learned models we test over the testing users and over the testing items. For the full cold start scenario we report results on the 100K dataset, we expect results to be very similar in the 1M dataset. We give the results on the User Cold start scenario in Table C.4 and in Table C.5 for theFull Cold Start.
In the case of MovieLens,User Cold Startscenario, Table C.4, we can see ﬁrst that the two methods we propose, LM-MF and LM-MF-Reg, beat in a statistical signiﬁcant manner the simple UB baseline for both datasets, with the only exception of LM-MF at 100k for k= 5 which is not signiﬁcantly better. In comparison, LambdaMART and its weighted NDCG variant, LMW, are never able to beat UB where the former is even signiﬁcantly worse at 1M for k = 5. In addition both LM-MF and its regularized variant beat in a statistically signiﬁcant manner LambdaMART in both 100K and 1M and for both values of k. LM-MF-Reg has a small advantage over LM-MF for 100K, it achieves a higher average NDCG score, however this advantage disappears when we move to the 1M dataset, which is normal because given the much larger size of available data there is no more need for an additional regularization. Moreover using a low-rank matrix factorization, we haver = 50 for the roughly 6000 provides already a strong regularization which makes unnecessary the need for the input/output space regularizers.
A similar picture arises in theFull Cold Startscenario, Table C.5. Except for 100K,k=
10, our two methods, LM-MF and LM-MF-Reg, are both able to always beat signiﬁcantly the FB baseline. Also LambdaMART and its weighted NDCG variant are now signiﬁcantly better than FB in the 1M dataset where the latter is signiﬁcantly better than the ﬁrst in two cases: in 100k, k = 5 and in 1M, k = 10. When we compare LambdaMART with our two methods, both beat in a statistically signiﬁcant manner this baseline in the two dataset with the little exception of LM-MF-Reg which is close to beat LambdaMART in a signiﬁcant manner at 100K,k= 10,p-value=0.0722. However, the method which performs overall the best is no more LM-MF-Reg as before but LM-MF. There is no clear reason why we get this result for the full cold start scenario but this can be explained by the low description of movies in MovieLens which are only described by their genre, resulting in movie input/output space regularizers which are not as useful as the user regularizers in the user cold start setting.
In this chapter, we have explored the use of the state-of-the-art learning to rank algorithm lambdaMART in a recommendation setting. Since only top items are observable by users in real recommendation systems, rank loss functions that focus heavily on the correctness of the top item predictions are more appropriate for them. We focused on the cold start problem in the presence of side information describing users and items. Motivated by the fact that in recommendation problems users and items can be described by a small number of hidden factors we learn this hidden representation and factorize the preference scores as inner products over it. In addition we constraint the learned representations by forcing them to reﬂect the user/item similarities as they are given by their descriptors in the input space and their preference vectors over the preference matrix. We experimented with two very diﬀerent recommendation problems and showed that the methods we proposed beat in a signiﬁcant manner the original LambdaMART, under a number of evaluation settings.
In what concerns meta-mining, we viewed this problem as a learning to rank problem which is more appropriate than the standard classiﬁcation and the least-square recommen-dation approaches, see Chapters 4 and 5. In meta-mining, what is of most importance is to provide for a given new dataset top-k workﬂow from which the DM user can select the ones that she/he wants to experiment. By optimizing directly the NDCG ranking loss function, LambdaMART-MF and its regularized variant deliver both the best top-k rec-ommendations among the diﬀerent methods and baselines, with signiﬁcant performance improvement under the three evaluated recommendation tasks ofMatrix Completion,User Cold StartandFull Cold Start. Note that in theUser Cold Startsetting the user
memory-based corresponds to the typical approach used in meta-learning in which workﬂow recom-mendations are provided by a dataset similarity measure (Kalousis and Theoharis, 1999;
Soares and Brazdil, 2000). As a result, this standard approach which only makes use of the dataset descriptors delivers the worst performance compared to the other methods which additionally make use of the workﬂow descriptors.
We should note that there are actually few works in meta-learning that are able to build a single model from pairwise algorithm performance comparison. Most of them require to build m(m−1)2 diﬀerent models, one for each algorithm pair, and to aggregate them in order to deliver a ranked list of algorithms for a given new dataset, see e.g. the works of Kalousis and Hilario (2001); Kalousis and Theoharis (1999). For instance, the recent approach described in Sun and Pfahringer (2013) proposes a two-step approach to build a model which can predict the preference vector for a given new dataset. In this approach, the authors build a random forest ensemble of approximate ranking trees, ART, from a propositionalized version of a set of pairwise meta-rules, each of which describes if algorithmA should be preferred over algorithm B on a given base-level dataset. Their motivation is that ensemble algorithms usually outperform base algorithms in pairwise algorithm performance comparison (Kalousis and Hilario, 2001). Here we also use ensemble learning but we do this in a single step in which we directly optimize a ranking loss function using dataset and workﬂow descriptors to predict their latent proﬁles, the inner product of which gives the preference score of a dataset-workﬂow pair. As a result, our approach is very robust and can address the diﬀerent cold start settings of providing recommendations for new datasets, new workﬂows, or even both.
Conclusion and Future Work
In this Thesis, we presented a new framework for the meta-analysis of DM processes, referred asmeta-mining or DM process-oriented meta-learning. The proposed framework extends in several directions the state-of-the-art approaches in automating the DM pro-cess, including meta-learning and ontology-based DM workﬂow planning systems, two approaches which are not designed to support the selection, recommendation and build-ing of DM workﬂows in terms of performance optimization, see Chapter 2. By relybuild-ing on a novel DM ontology, dmop, which we used to extract structural characteristics of DM workﬂows that we associated with dataset characteristics, our framework extends the traditional algorithm selection task in meta-learning to workﬂow selection, where it is no more restricted to select DM workﬂows from a ﬁxed pool of base-level workﬂows but it can eventually select workﬂows that have never been experimented before. It can also address the task of selecting datasets for a given new workﬂow, as well as selecting the best pair of dataset-workﬂow, we describe these three meta-mining tasks in Chapter 4. In addition, we also addressed the task of planning automatically DM workﬂows in a novel manner by combining meta-mining with an AI-planner, providing thus to our knowledge the ﬁrst planning system that is able to design DM workﬂows which best match a given dataset in terms of the relative performance of the former on the latter, see Chapter 6.
It is worth noting that the above meta-mining tasks are all novel tasks that have never been considered in the meta-learning literature, including the task of optimizing the plan-ning of DM workﬂows. In our framework, we elaborated them as speciﬁc recommendation tasks with the overall goal to provide to the DM user a few number but relevant work-ﬂows (datasets) for a given dataset (workﬂow respectively); i.e. to select a portfolio of
DM workﬂows that will work well on his/her datasets, and vice-versa. Since meta-mining bears several similarities with recommender systems, such as recommending workﬂows that have similar performances on a given dataset, we built thus our framework by fol-lowing the recommendation approach where we combined together dataset and workﬂow descriptors with respect to a performance-based preference matrix that relates the latter with the former, in a similar manner to collaborative and content-based ﬁlterings (Su and Khoshgoftaar, 2009).
More speciﬁcally, we devised in Chapters 5 and 7 two new recommendation algorithms, one relying on metric-learning and the other one on learning-to-rank, both of which learn dataset and workﬂow low-rank latent proﬁles, similarities, over heterogeneous spaces. We followed in both approaches the meta-learning assumption that the similarity of datasets is given by the relative performance of DM workﬂows when applied on them (Kalousis and Hilario, 2001; Kalousis and Theoharis, 1999). In the same manner, we deﬁned the similarity of workﬂows by the relative performance they have on datasets. We combined diﬀerent similarity measures of datasets and workﬂows, one deﬁned over their descriptors which we called input space similarity, and one deﬁned over their respective performances which we called output space similarity, in order to regularize the low-rank latent proﬁles of datasets and workﬂows. As a result, the inner product of the dataset and workﬂow latent proﬁles can robustly estimate the performance-based preference score that the latter will deliver on the former. In addition, we designed our recommendation algorithms in such a manner that they can learn a function that maps dataset and workﬂow descriptors to the respective latent space in order to address the cold start problem, i.e. the problem of providing recommendations to datasets, respectively workﬂows, with which we do not have any historical data. To predict the proﬁles of new datasets and workﬂows with respect to their descriptions, we learned a linear mapping in the metric-learning algorithm, and an additive piece-wise linear mapping in the learning-to-rank algorithm. In the latter case, this gave us remarkable results on the diﬀerent cold start scenarios.
In what concerns the planning of DM workﬂows, we proposed a novel planning ap-proach which combined an AI-planner together with our meta-mining framework, see Chapter 6. We viewed our planning task as optimizing a sequence of locally-based cold start recommendation problems; in each planning step, the meta-mining module can se-lect or recommend those valid partial candidate workﬂows submitted by the AI-planner whose descriptions best match the descriptions of the input dataset for the given mining goal. As a result, our planning system is the ﬁrst one which can automatically design potentially new DM workﬂows whose expected performance on a given mining problem are maximized.
We evaluated our framework on a real-world problem under the meta-mining tasks described above with a number of diﬀerent evaluation metrics such as mean average error, Kendall correlation and NDCG, to assess the quality of the recommendations provided by our algorithms. In most of the cases, we were always able to deliver the best results for the evaluation metric under consideration, with signiﬁcant performance improvement against the diﬀerent baselines and other methods including standard meta-learning approaches and state-of-the-art methods for matrix completion and learning-to-rank. We also ex-perimented our learning-to-rank algorithm on a large scale benchmark recommendation dataset, MovieLens, where we delivered very good results in the user cold start setting.
Overall, we investigated in this Thesis a novel framework that can robustly support the DM user in her/his DM process modeling. We believe that it is a promising research avenue which we would like to further explore in future work. As immediate work, we would like to analyze the learned models in order to have a better understanding of the quality of the dataset and workﬂow descriptors. This could give us feedback on which descriptors should be either reﬁned or removed in order to improve our domain knowledge and thus our meta-mining framework. To do this, a ﬁrst step is to investigate other speciﬁc domains, like physics or other medical/biological problems, than the cancer diagnostic/prognostic problem we experimented here, to see if our framework can also be applied on these domains. We would like also to investigate other recommendation problems with similar requirements, i.e. with descriptors on users and items. For instance, to predict if a user will retweet a tweet or not based on its social network and her/his past tweets in the Twitter blogosphere1. For more longer term work, we will describe in the following the most important research directions we would like to explore.
Multi-Task Learning First, we would like to explore meta-mining with regularized multi-task learning (MTL) (Evgeniou and Pontil, 2004). In MTL, we are givenT diﬀerent tasks and we want to build for each taskt∈T a modelft(x) such that similar tasks will have similar models. Learning multiple tasks simultaneously generally leads to improved performance. We assume that each task t has nt instances and that they all share some d-dimensional feature space X. We also assume ft(x) to be a linear model, i.e. we have ft(x) = wTtx where we deﬁne wt = w0+vt and w0 is a global model learned over the instances of all tasks. The regularized MTL optimization problem is deﬁned by:
whereLis a given loss function such as the square loss. The second term in Eq.(8.1) is the l2,1-regularization or group lasso which is known to introduce group sparsity and the third term thel2-regularization on thew0 global model. By solving Eq.(8.1) with e.g. SVM, we will learn thevt vectors such that their diﬀerence,||vj−vk||2, between two tasksj andk will be ”small” when the two tasks are similar to each other (Evgeniou and Pontil, 2004).
The multi-task approach can clearly proﬁt to meta-mining. In meta-mining, we will say that each task t corresponds to a speciﬁc application domain, e.g. biology, physics, etc., each of which has a number nt of base-level datasets over which we apply exactly the same set of workﬂows. Thus, all domains will share the same meta-feature spacesX and W of datasets and workﬂows. Now, we want to learnt meta-mining models, one for each application domain, such that similar domains will have similar models, plus a global model U0 over all the base-level experiments. We will follow the matrix factorization model deﬁned in this thesis where we will deﬁneUt=U0+Wt whereWtwill be ant×r matrix and we keep intact them×rmatrixV,mis the number of workﬂows andr is the (low) rank of the matrices. So now Eq.(8.1) becomes:
U0min,Wt,V whereLcan be the factorized LambdaMART loss function, see Eq.(7.13), and|| · ||F is the Froebenius norm. The only diﬀerence with a standard MTL approach is that to measure the similarity between the Wt matrices we can not simply use their diﬀerence since we will have a diﬀerent numberntof datasets for each domain. To solve this problem, we can simply use the expectation of the diﬀerence, E(||Wj −Wk||2), for two domains j and k by assuming thatWj and Wk follow both ar-multivariate normal distribution.
Multi-Criteria Recommendation System So far, we have been mainly considering the relative predictive performance of workﬂows on datasets to measure how good are the former on the latter. The reason was that this measure is the one that reﬂects the best the association between datasets and workﬂows. Also, we wanted to build recommendations for a given mining problem such that the DM user can obtain good predictive performance results with a number of experiments. There are however additional metrics which can be also relevant for the meta-mining recommendations such as speed and memory consump-tion of the DM workﬂows when applied on a given dataset. For instance, the DM user can prefer to have recommendations that give good performance but that are also relatively fast in their execution.
To take into account these additional measures, we will extend our meta-mining model,
To take into account these additional measures, we will extend our meta-mining model,