Chapter 7 Meta-mining as a Learning to Rank Problem 101

7.3 Learning to Rank

In different preference learning, we are given a (sparse) preference matrix,Y, of sizen×m;

a non-missingyij entry represents either the relevance (or preference) score between theith user and thejth item in recommendation, the performance score between the ith dataset and the jth workflow in meta-mining, or the relevance score between the ith query and the jth document in learning to rank. Note that, to simplify the presentation, we will only refer to user-item relevance in the followings. In addition to the given preference matrix, we also assume that the descriptions of user and item are available. Similarly to the notations introduced in Section 4.3, we denote by xi = (xi1, . . . , xid)T ∈ Rd, the d-dimensional description of the ith user, and by X the n×d user description matrix, the ith row of which is given by the xi. We denote by wj = (wj1, . . . , wjl)T ∈ Rl, the l-dimensional description of thejth item and byWthem×l item description matrix, the jth row of which is given by thewj item (in meta-mining, we havel=|P|, the number of workflow patterns).

Since only the rank order of yij matters for the recommendation of items to a user, an ideal preference learning algorithm does not predict exactly the value ofyij. Instead, it predicts the rank order induced by preference vector y. We will denote by rxi the

7.3. Learning to Rank

target rank vector of the ith user or dataset, given by ordering theith user’s non-missing preference scores over items; its kth entry, rxik, is the rank or position of the kth non-missing relevance score. Note that in the converse problem, i.e. the recommendation of users to an item or workflow, we will denote byrwj the target rank vector of thejth item, given by ordering thejth item’s non-missing preference scores over users.

7.3.1 Evaluation Metric

As only a few top ranked items are finally shown to users in real application of preference learning, e.g. recommendation, information retrieval, etc., the evaluation metric of pref-erence learning emphasizes heavily on the correctness of top items. One very often used metric is the Discounted Cumulative Gain (DCG) (J¨arvelin and Kek¨al¨ainen, 2000). It is defined as follows:

DCG(r,y)@k =

m

X

i=1

2yi−1

log2(ri+ 1)I(ri ≤k) (7.1) wherekis the truncation level at which DCG is computed andI is the indicator function which returns 1 if its argument holds otherwise 0. The vector y is the m dimensional ground truth relevance andris a rank vector that we will learn. The DCG score measures the match between given rank vector r and the rank vector of relevance score y; the larger the better. It is easy to check that, according to Eq.(7.1), if the rank vector r correctly preserves the order induced byy, the DCG score will achieve its maximum. The most important property of DCG score is that, it will incur larger penalty for misplacing top items than end items due to the log denominator. Such that, it emphasizes on the correctness of top items. Since the DCG score also depends on the length of relevance vector y, it is often normalized with respect to its maximum score, namely Normalized DCG (NDCG), defined as

NDCG(y,y)@kˆ = DCG(r(ˆy),y)@k

DCG(r(y),y)@k (7.2)

wherer(·) is a rank function, its output is the rank position of ·ordering in a decreasing manner. The vector r(y) is the rank of ground truth relevance vector y and r(ˆy) is the rank of predicted relevance ˆy provided by the learned model. With normalization, the value of NDCG ranges from 0 to 1, the larger the better. In this paper, we will also use this metric as our evaluation metric.

The main difficulty in learning preferences is that rank functions are not continuous

and have combinatorial complexity. Thus most often the rank of the preference scores is approximated through the respective pairwise constraints.

7.3.2 LambdaMART

LambdaMART (Burges, 2010) is one of the most popular algorithms for preference learning which follows exactly this idea. Its optimization problem relies on a distance distribution measure, cross entropy, between a learned distribution1 that gives the probability that item j is more relevant than item k from the true distribution which has a probability mass of one if item i is really more relevant than item j and zero otherwise. The final loss function of LambdaMart defined over all usersi and overall the respective pairwise preferences for items j,k, is given by:

L(Y,Y) =ˆ

n

X

i=1

X

{jk}∈Z

|∆N DCGijk|log(1 +e−σ(ˆyij−ˆyik)) (7.3)

whereZ is the set of all possible pairwise preference constraints such that in the ground truth relevance vector holdsyij >yik, and ∆N DCGijkis given by:

∆N DCGijk=N DCG(yi,yˆi)−N DCG(yjki ,yˆi)

where yjki is the same as the ground truth relevance vector yi except that the values of yij and yik are swapped. This is also equal to the NDCG difference that we get if we swap the ˆyij, ˆyik, estimates. Thus the overall loss function of LambdaMART, Eq.(7.3), is the sum of the logistic losses on all pairwise preference constraints weighted by the respective NDCG differences. Since the NDCG measure penalizes heavily the error on the top items, the loss function of LambdaMART has also the same property. LambdaMART minimizes its loss function with respect to all ˆyij,yˆik, and its optimization problem is:

minYˆ

L(Y,Y)ˆ (7.4)

Yue and Burges (2007) have shown empiricially that solving this problem also optimizes the NDCG metric of the learned model. The partial derivative of LambdaMART’s loss

1This learned distribution is generated by the sigmoid function Pjki = 1

1+eσ(ˆyij−ˆyik) of the estimated preferences ˆyij,ˆyik.

7.3. Learning to Rank

function with respect to the estimated scores ˆyij is

∂L(Y,Y)ˆ

∂ˆyijij = X

{k|jk}∈Z

λijk− X

{k|kj}∈Z

λikj (7.5)

andλijk is given by:

λijk= −σ

1 +eσ(ˆyij−ˆyik)|∆N DCGijk| (7.6) With a slight abuse of notation below we will write ∂L(y∂ˆyij,yˆij)

ij instead of ∂L(Y,∂ˆy Y)ˆ

ij , to make explicit the dependence of the partial derivative only on yij,yˆij due to the linearity of L(Y,Y).ˆ

LambdaMART uses Multiple Additive Regression Trees (MART) (Friedman, 2001) to solve its optimization problem. It does so through a gradient descent in some functional space that generates preference scores from item and user descriptions, i.e. ˆyij =f(xi,wj), where the update of the preference scores at thetstep of the gradient descent is given by:

ˆ

y(t)ij = ˆy(t−1)ij −η∂L(yij,yˆ(t−1)ij )

∂ˆy(t−1)ij (7.7)

or equivalently:

f(t)(xi,wj) =f(t−1)(xi,wj)−η∂L(yij, f(t−1)(xi,wj))

∂f(t−1)(xi,wj) (7.8)

where η is the learning rate. We terminate the gradient descent when we reach a given number of iterations T or when the validation loss NDCG starts to increase. We approx-imate the derivative ∂L(yijy

(t−1) ij )

∂ˆy(t−1)ij by learning at each step t a regression tree h(t)(x,w) minimizing the sum of squared errors. Thus at each update step we have

f(t)(xi,wj) =f(t−1)(xi,wj) +ηh(t)(xi,wj) (7.9) which if we denote byγtk the prediction of the kth terminal node of the h(t) tree and by htk the respective partition of the input space, we can rewrite as:

f(t)(xi,wj) =f(t−1)(xi,wj) +ηγtkI((x,w)∈htk) (7.10) we can further optimize over theγtk values to minimize the loss function of Eq.(7.3) over

the instances of each htk partition using Newton’s approximation. The final preference

LambdaMart is a very effective algorithm for learning to rank problems, see e.g the works of Burges et al. (2011); Donmez et al. (2009). It learns non-linear relevance scores, ˆ

yij, using gradient boosted regression trees. The number of the parameters it fits is given by the number of available preference scores (this is typically some fraction of n×m);

there is no regularization on them to prevent overfitting. The only protection against overfitting can come from rather empirical approaches such as constraining the size of the regression trees or by selecting learning rateη.

Dans le document Meta-mining: a meta-learning framework to support the recommendation, planning and optimization of data mining workflows (Page 123-127)