Chapter 5 Meta-mining as a Recommendation Problem 61

6.6 Experimental Evaluation

6.6.6 Result Analysis

In the previous sections we evaluated the two workflow planning strategies in two settings:

planning only over seen operators, and planning with both seen and unseen operators. In the first scenario the P2 planning strategy that makes use of the heterogeneous metric learning model, which directly connects dataset to workflow characteristics, clearly stands out. It outperforms the default strategy in terms of the Kendall Similarity gain, in a statistically significant, or close to statistically significant, manner for 24 values of k ∈ [2, . . . ,35]; in terms of the average accuracy of its top-kworkflows it outperforms it for 20 values of k in a statistically significant, or close to statistically significant, manner. All

6.6. Experimental Evaluation

the other methods, including P1, follow with a large performance difference from P2.

When we allow the planners to include in the workflows operators which have not been used in the baseline experiments, but which are annotated in thedmop, P2’s performance advantage is smaller. In terms of the Kendall similarity gain this is statistically significant, or close to, for k ∈ [10, . . . ,20]. With respect to the average accuracy of its top-k lists this is better than the default only for very small lists, k= 3,4; for k > 23 it is in fact significantly worse. P1 fairs better in the second scenario, however its performance is not different from the default method. Keep in mind though that the baseline used in the second scenario is a quite optimistic one.

In fact what we see is that we are able to generalize and plan well over datasets, as evidenced by the good performance of P2 in the first setting. However when it comes to generalizing both over datasets and operators as it is the case of the second scenario the performance of the planned workflows is not good, with the exception of the few top work-flows. If we take a look at the new operators we added in the second scenario these were a feature selection algorithm,Information Gain Ratio, and four classification algorithms, namelyLDA, a linear discriminant algorithm, theRipperrule induction algorithm, a Neu-ral Netalgorithm, and aRandom Treealgorithm. Out of them only forInformation Gain Ratio we have seen during the base level set of experiments an algorithm, Information Gain, that has a rather similar learning bias to it. TheRipperrule induction is a sequen-tial covering algorithm, the closest operators to which in our set of training operators are the two decision tree algorithms which are recursive partitioning algorithms. With respect to thedmopRippershares a certain number of common characteristics with decision trees, however the meta-mining model contains no information on how the set-covering learning bias performs over different datasets. This might lead to it being selected for a given dataset based on its common features with the decision trees, while its learning bias is not in fact appropriate for that dataset. Similar observations hold for the other algorithms, for exampleLDA shares a number of properties withSVMl and LR however its learning bias, maximizing the between to within class distances ratio, is different from the learning biases of these two, as before the meta-mining model contains no information on how its bias associates with dataset characteristics.

Overall the extent to which the system will be able to plan, over seen and unseen operators, workflows that achieve a good performance depends on the extend to which the properties of the latter have been seen during the training of the meta-mining models within the operators with which we have experimented with, as well as on whether the unseen properties affect critically the final performance. In the case that all operators are well characterized experimentally, as we did in scenario one, then the performance of

the workflows designed by the P2 strategy is very good. Note that it is not necessary that all operators or workflows are applied to all datasets, it is enough to have a sufficient set of experiments for each operator. The heterogeneous metric learning algorithm can handle incomplete preference matrices, using only the available information. Of course it is clear that the more the available information, whether in the form of complete preference matrices or in the form of extensive base-level experiments over large number of datasets, the better the quality of the learned meta-mining model will be. It will be interesting to explore the sensitivity of the heterogeneous metric learning method over different levels of completeness of the preference matrix.

6.7 Discussion

In this chapter, we have presented what is, to the best of our knowledge, the first system that is able to plan data mining workflows, for a given task and a given input dataset, that are expected to optimize a given performance measure. The system relies on the tight interaction of a hierarchical task network planner with a learned meta-mining model to plan the workflows. The meta-mining model, an heterogeneous learned metric, associates datasets characteristics with workflow characteristics which are expected to lead to good performance. The workflow characteristics describe relations between the different compo-nents of the workflows, capturing global interactions of the various operators that appear within them, and incorporate domain knowledge as the latter is given in a data mining on-tology (dmop). We learn the meta-mining model on a collection of past base-level mining experiments, data mining workflows applied on different datasets. We carefully evaluated the system on the task of classification and we showed that it outperforms in a significant manner a number of baselines and the default strategy when it has to plan over operators with which we have experimented with in the base-level experiments. The performance advantage is less pronounced when it has to consider also during planning operators with which we have not experimented with in the base-level experiments, especially when the properties of these operators were not present within other operators with which we have experimented with in the base-level experiments.

The system is directly applicable to other mining tasks e.g. regression, clustering.

The reasons for which we focused on classification were mainly practical: extensive an-notation of the classification task and the related concepts in the data mining ontology, large availability of classification datasets, and extensive relevant work on meta-learning and dataset characterization for classification. The main hurdle in experimenting with a different mining task is the annotation of the necessary operators in the dmop and the

6.7. Discussion

set up of a base-level collection of mining experiments for the specific task. Although the annotation of new algorithms and operators is a quite labor intensive task, many of the concepts currently available in the dmop are directly usable in other mining tasks, e.g. cost functions, optimization problems, feature properties etc. In addition there is a small active community1 maintaining and augmenting the ontology with new tasks and operators, significantly reducing the deployment barrier for a new task.

There are a number of issues that we still need to explore in a finer detail. We would like to gain a deeper understanding and a better characterization of the reduced performance in planning over unseen operators; for example, under what conditions we can be relatively confident on the suitability of an unseen operator within a workflow. We want to experiment with the strategy we suggested for parameter tuning, in which we treat the parameters as yet another property of the operators, in order to see whether it gives better results; we expect it will. We want to study in detail how the level of missing information in the preference matrix affects the performance of the system, as well as whether using ranking based loss functions in the metric learning problem instead of sum of squares would lead to even better performance.


Meta-mining as a Learning to Rank Problem

7.1 Introduction

In this Chapter, we will provide a learning to rank algorithm for cold start recommenda-tion. In Section 7.3, we will present LambdaMART (Burges, 2010), the state-of-the-art learning to rank algorithm. Then we will see in Section 7.4 how to factorize this algorithm in order to learn low-rank user and item profiles. In Section 7.5, we describe a number of methods to regularize the latent profiles including a weighted NDCG to smooth the optimization function. We experiment our methods in Section 7.6 on meta-mining but also on MovieLens, a benchmark movie recommendation problem. Finally we discuss the methods in Section 7.7.

7.2 Motivation

In Chapter 5, we have presented a recommendation algorithm which learns a matrix fac-torization of the preference matrixYover side information, additional features on datasets and workflows, to address the cold start problem and which uses output space similari-ties to robustly regularize the latent profiles. We followed there the standard approach in matrix factorization to minimize a point-wise loss function such as the mean squared error between the inner product of learned low-rank representations and the true pref-erence score of a user for an item, see e.g. (Abernethy, Bach, Evgeniou, and Vert, 2006;

7.2. Motivation

Agarwal and Chen, 2010; Srebro, Rennie, and Jaakkola, 2005). Such a cost function is not really appropriate for recommendation problems since what really matters there is the rank order of the preference scores and not their absolute values. It is only recently that recommendation methods have started using ranking loss functions in their optimiza-tion problems. For instance, Cofirank (Weimer, Karatzoglou, Le, and Smola, 2007) is a maximum-margin matrix factorization algorithm which optimizes an upper bound of the NDCG measure (J¨arvelin and Kek¨al¨ainen, 2000). It is one of the best known approaches in what it is known now as the collaborative ranking approach. However like many rec-ommendation algorithms Cofirank cannot address the cold start problem, i.e. it cannot recommend potentially new items movies to new users.

Preference learning or learning to rank (F¨urnkranz and H¨ullermeier, 2010) on the other hand has investigated the question of learning the preference order of documents for a given query in search engines and IR systems by exploiting ranking loss functions such as NDCG and developing algorithms that optimize them. The best such example is LambdaMART (Burges, 2010), the state-of-the-art learning to rank algorithm. The success of LambdaMART is due to its ability to model preference orders using features describing side-information of the query-document pairs in a very flexible manner. Despite its great success in recent challenges and studies (Burges, Svore, Bennett, Pastusiak, and Wu, 2011; Donmez, Svore, and Burges, 2009; Yue and Burges, 2007), it is however prone to overfitting because it does not come with a rigorous regularization formalism.

In this Chapter, we propose an algorithm to address these limitations. It is customized for the cold start setting in recommendation problems and extends LambdaMART by learning low-rank latent factors to describe users and items. Preference scores are then computed as inner products over the learned latent representations. In addition we also regularize the learned representations to reflect data-based similarities computed over the descriptions of the users and items as well as over their preference vectors. We experi-ment our algorithm on our meta-mining problem as well as on the MovieLens problem, a benchmark dataset for movie recommendation, in a number of cold start settings.

The main contributions of this Chapter are as follows:

(i) Motivated by the fact that very often the user and item preference behavior can be well summarized by a small number of hidden factors, we propose a novel algo-rithm, LambdaMART Matrix Factorization (LM-MF) that learns a low rank hidden representation of users and items using gradient boosted trees (Friedman, 2001), an approach with which we can address the cold start problem of new users and item using side information. The algorithm factorizes LambdaMART by defining

rele-vance scores as the inner product of the learned representations of users and items.

It outperforms significantly LambdaMART and a number of different baselines in the two experimented recommendation problems, meta-mining and MovieLens.

(ii) We propose additional regularizations to smooth the learned latent representations.

First, we regularize the low rank representations with respect to an input-output-space similarity which controls the smoothness of the learned representations. Sec-ond, we propose to use a weighted variant of NDCG to reduce the penalty for sim-ilar items with large rating discrepancy. This second algorithm, regularized Lamb-daMART Matrix Factorization (LM-MF-Reg), gives the best results on our meta-mining problem.

This Chapter is based on the following publication.

Phong Nguyen, Jun Wang, and Alexandros Kalousis. Factorizing LambdaMART for cold start recommendations. InMachine Learning, Special Issue of the ECML-PKDD 2016 Journal Track, Springer, Vol. 105, 2016.

7.3 Learning to Rank

In different preference learning, we are given a (sparse) preference matrix,Y, of sizen×m;

a non-missingyij entry represents either the relevance (or preference) score between theith user and thejth item in recommendation, the performance score between the ith dataset and the jth workflow in meta-mining, or the relevance score between the ith query and the jth document in learning to rank. Note that, to simplify the presentation, we will only refer to user-item relevance in the followings. In addition to the given preference matrix, we also assume that the descriptions of user and item are available. Similarly to the notations introduced in Section 4.3, we denote by xi = (xi1, . . . , xid)T ∈ Rd, the d-dimensional description of the ith user, and by X the n×d user description matrix, the ith row of which is given by the xi. We denote by wj = (wj1, . . . , wjl)T ∈ Rl, the l-dimensional description of thejth item and byWthem×l item description matrix, the jth row of which is given by thewj item (in meta-mining, we havel=|P|, the number of workflow patterns).

Since only the rank order of yij matters for the recommendation of items to a user, an ideal preference learning algorithm does not predict exactly the value ofyij. Instead, it predicts the rank order induced by preference vector y. We will denote by rxi the

7.3. Learning to Rank

target rank vector of the ith user or dataset, given by ordering theith user’s non-missing preference scores over items; its kth entry, rxik, is the rank or position of the kth non-missing relevance score. Note that in the converse problem, i.e. the recommendation of users to an item or workflow, we will denote byrwj the target rank vector of thejth item, given by ordering thejth item’s non-missing preference scores over users.

7.3.1 Evaluation Metric

As only a few top ranked items are finally shown to users in real application of preference learning, e.g. recommendation, information retrieval, etc., the evaluation metric of pref-erence learning emphasizes heavily on the correctness of top items. One very often used metric is the Discounted Cumulative Gain (DCG) (J¨arvelin and Kek¨al¨ainen, 2000). It is defined as follows:

DCG(r,y)@k =





log2(ri+ 1)I(ri ≤k) (7.1) wherekis the truncation level at which DCG is computed andI is the indicator function which returns 1 if its argument holds otherwise 0. The vector y is the m dimensional ground truth relevance andris a rank vector that we will learn. The DCG score measures the match between given rank vector r and the rank vector of relevance score y; the larger the better. It is easy to check that, according to Eq.(7.1), if the rank vector r correctly preserves the order induced byy, the DCG score will achieve its maximum. The most important property of DCG score is that, it will incur larger penalty for misplacing top items than end items due to the log denominator. Such that, it emphasizes on the correctness of top items. Since the DCG score also depends on the length of relevance vector y, it is often normalized with respect to its maximum score, namely Normalized DCG (NDCG), defined as

NDCG(y,y)@kˆ = DCG(r(ˆy),y)@k

DCG(r(y),y)@k (7.2)

wherer(·) is a rank function, its output is the rank position of ·ordering in a decreasing manner. The vector r(y) is the rank of ground truth relevance vector y and r(ˆy) is the rank of predicted relevance ˆy provided by the learned model. With normalization, the value of NDCG ranges from 0 to 1, the larger the better. In this paper, we will also use this metric as our evaluation metric.

The main difficulty in learning preferences is that rank functions are not continuous

and have combinatorial complexity. Thus most often the rank of the preference scores is approximated through the respective pairwise constraints.

7.3.2 LambdaMART

LambdaMART (Burges, 2010) is one of the most popular algorithms for preference learning which follows exactly this idea. Its optimization problem relies on a distance distribution measure, cross entropy, between a learned distribution1 that gives the probability that item j is more relevant than item k from the true distribution which has a probability mass of one if item i is really more relevant than item j and zero otherwise. The final loss function of LambdaMart defined over all usersi and overall the respective pairwise preferences for items j,k, is given by:

L(Y,Y) =ˆ






|∆N DCGijk|log(1 +e−σ(ˆyij−ˆyik)) (7.3)

whereZ is the set of all possible pairwise preference constraints such that in the ground truth relevance vector holdsyij >yik, and ∆N DCGijkis given by:

∆N DCGijk=N DCG(yi,yˆi)−N DCG(yjki ,yˆi)

where yjki is the same as the ground truth relevance vector yi except that the values of yij and yik are swapped. This is also equal to the NDCG difference that we get if we swap the ˆyij, ˆyik, estimates. Thus the overall loss function of LambdaMART, Eq.(7.3), is the sum of the logistic losses on all pairwise preference constraints weighted by the respective NDCG differences. Since the NDCG measure penalizes heavily the error on the top items, the loss function of LambdaMART has also the same property. LambdaMART minimizes its loss function with respect to all ˆyij,yˆik, and its optimization problem is:


L(Y,Y)ˆ (7.4)

Yue and Burges (2007) have shown empiricially that solving this problem also optimizes the NDCG metric of the learned model. The partial derivative of LambdaMART’s loss

1This learned distribution is generated by the sigmoid function Pjki = 1

1+eσ(ˆyij−ˆyik) of the estimated preferences ˆyij,ˆyik.

7.3. Learning to Rank

function with respect to the estimated scores ˆyij is


∂ˆyijij = X


λijk− X


λikj (7.5)

andλijk is given by:

λijk= −σ

1 +eσ(ˆyij−ˆyik)|∆N DCGijk| (7.6) With a slight abuse of notation below we will write ∂L(y∂ˆyij,yˆij)

ij instead of ∂L(Y,∂ˆy Y)ˆ

ij , to make explicit the dependence of the partial derivative only on yij,yˆij due to the linearity of L(Y,Y).ˆ

LambdaMART uses Multiple Additive Regression Trees (MART) (Friedman, 2001) to solve its optimization problem. It does so through a gradient descent in some functional space that generates preference scores from item and user descriptions, i.e. ˆyij =f(xi,wj), where the update of the preference scores at thetstep of the gradient descent is given by:

LambdaMART uses Multiple Additive Regression Trees (MART) (Friedman, 2001) to solve its optimization problem. It does so through a gradient descent in some functional space that generates preference scores from item and user descriptions, i.e. ˆyij =f(xi,wj), where the update of the preference scores at thetstep of the gradient descent is given by:

Dans le document Meta-mining: a meta-learning framework to support the recommendation, planning and optimization of data mining workflows (Page 117-0)