**Chapter 7 Meta-mining as a Learning to Rank Problem 101**

**7.3 Learning to Rank**

In diﬀerent preference learning, we are given a (sparse) preference matrix,Y, of sizen×m;

a non-missingy_{ij} entry represents either the relevance (or preference) score between thei^{th}
user and thej^{th} item in recommendation, the performance score between the i^{th} dataset
and the j^{th} workﬂow in meta-mining, or the relevance score between the i^{th} query and
the j^{th} document in learning to rank. Note that, to simplify the presentation, we will
only refer to user-item relevance in the followings. In addition to the given preference
matrix, we also assume that the descriptions of user and item are available. Similarly
to the notations introduced in Section 4.3, we denote by x_{i} = (x_{i1}, . . . , x_{id})^{T} ∈ R^{d}, the
d-dimensional description of the ith user, and by X the n×d user description matrix,
the i^{th} row of which is given by the x_{i}. We denote by w_{j} = (w_{j1}, . . . , w_{jl})^{T} ∈ R^{l}, the
l-dimensional description of thej^{th} item and byWthem×l item description matrix, the
j^{th} row of which is given by thew_{j} item (in meta-mining, we havel=|P|, the number of
workﬂow patterns).

Since only the rank order of y_{ij} matters for the recommendation of items to a user,
an ideal preference learning algorithm does not predict exactly the value ofy_{ij}. Instead,
it predicts the rank order induced by preference vector y. We will denote by r_{x}_{i} the

7.3. Learning to Rank

target rank vector of the ith user or dataset, given by ordering thei^{th} user’s non-missing
preference scores over items; its k^{th} entry, r_{x}_{i}_{k}, is the rank or position of the k^{th}
non-missing relevance score. Note that in the converse problem, i.e. the recommendation of
users to an item or workﬂow, we will denote byrwj the target rank vector of thej^{th} item,
given by ordering thej^{th} item’s non-missing preference scores over users.

7.3.1 Evaluation Metric

As only a few top ranked items are ﬁnally shown to users in real application of preference learning, e.g. recommendation, information retrieval, etc., the evaluation metric of pref-erence learning emphasizes heavily on the correctness of top items. One very often used metric is the Discounted Cumulative Gain (DCG) (J¨arvelin and Kek¨al¨ainen, 2000). It is deﬁned as follows:

DCG(r,y)@k =

m

X

i=1

2^{y}^{i}−1

log_{2}(r_{i}+ 1)I(ri ≤k) (7.1)
wherekis the truncation level at which DCG is computed andI is the indicator function
which returns 1 if its argument holds otherwise 0. The vector y is the m dimensional
ground truth relevance andris a rank vector that we will learn. The DCG score measures
the match between given rank vector r and the rank vector of relevance score y; the
larger the better. It is easy to check that, according to Eq.(7.1), if the rank vector r
correctly preserves the order induced byy, the DCG score will achieve its maximum. The
most important property of DCG score is that, it will incur larger penalty for misplacing
top items than end items due to the log denominator. Such that, it emphasizes on the
correctness of top items. Since the DCG score also depends on the length of relevance
vector y, it is often normalized with respect to its maximum score, namely Normalized
DCG (NDCG), deﬁned as

NDCG(y,y)@kˆ = DCG(r(ˆy),y)@k

DCG(r(y),y)@k (7.2)

wherer(·) is a rank function, its output is the rank position of ·ordering in a decreasing manner. The vector r(y) is the rank of ground truth relevance vector y and r(ˆy) is the rank of predicted relevance ˆy provided by the learned model. With normalization, the value of NDCG ranges from 0 to 1, the larger the better. In this paper, we will also use this metric as our evaluation metric.

The main diﬃculty in learning preferences is that rank functions are not continuous

and have combinatorial complexity. Thus most often the rank of the preference scores is approximated through the respective pairwise constraints.

7.3.2 LambdaMART

LambdaMART (Burges, 2010) is one of the most popular algorithms for preference learning
which follows exactly this idea. Its optimization problem relies on a distance distribution
measure, cross entropy, between a learned distribution^{1} that gives the probability that
item j is more relevant than item k from the true distribution which has a probability
mass of one if item i is really more relevant than item j and zero otherwise. The ﬁnal
loss function of LambdaMart deﬁned over all usersi and overall the respective pairwise
preferences for items j,k, is given by:

L(Y,Y) =ˆ

n

X

i=1

X

{jk}∈Z

|∆N DCG^{i}_{jk}|log(1 +e^{−σ(ˆ}^{y}^{ij}^{−ˆ}^{y}^{ik}^{)}) (7.3)

whereZ is the set of all possible pairwise preference constraints such that in the ground
truth relevance vector holdsy_{ij} >y_{ik}, and ∆N DCG^{i}_{jk}is given by:

∆N DCG^{i}_{jk}=N DCG(y_{i},yˆ_{i})−N DCG(y^{jk}_{i} ,yˆ_{i})

where y^{jk}_{i} is the same as the ground truth relevance vector y_{i} except that the values of
y_{ij} and y_{ik} are swapped. This is also equal to the NDCG diﬀerence that we get if we swap
the ˆy_{ij}, ˆy_{ik}, estimates. Thus the overall loss function of LambdaMART, Eq.(7.3), is the
sum of the logistic losses on all pairwise preference constraints weighted by the respective
NDCG diﬀerences. Since the NDCG measure penalizes heavily the error on the top items,
the loss function of LambdaMART has also the same property. LambdaMART minimizes
its loss function with respect to all ˆy_{ij},yˆ_{ik}, and its optimization problem is:

minYˆ

L(Y,Y)ˆ (7.4)

Yue and Burges (2007) have shown empiricially that solving this problem also optimizes the NDCG metric of the learned model. The partial derivative of LambdaMART’s loss

1This learned distribution is generated by the sigmoid function P_{jk}^{i} = ^{1}

1+e^{−}^{σ(ˆ}^{y}^{ij−}^{ˆ}^{y}^{ik}^{)} of the estimated
preferences ˆy_{ij},ˆy_{ik}.

7.3. Learning to Rank

function with respect to the estimated scores ˆy_{ij} is

∂L(Y,Y)ˆ

∂ˆy_{ij} =λ^{i}_{j} = X

{k|jk}∈Z

λ^{i}_{jk}− X

{k|kj}∈Z

λ^{i}_{kj} (7.5)

andλ^{i}_{jk} is given by:

λ^{i}_{jk}= −σ

1 +e^{σ(ˆ}^{y}^{ij}^{−ˆ}^{y}^{ik}^{)}|∆N DCG^{i}_{jk}| (7.6)
With a slight abuse of notation below we will write ^{∂L(y}_{∂ˆ}_{y}^{ij}^{,}^{y}^{ˆ}^{ij}^{)}

ij instead of ^{∂L(Y,}_{∂ˆ}_{y} ^{Y)}^{ˆ}

ij , to make
explicit the dependence of the partial derivative only on y_{ij},yˆ_{ij} due to the linearity of
L(Y,Y).ˆ

LambdaMART uses Multiple Additive Regression Trees (MART) (Friedman, 2001) to
solve its optimization problem. It does so through a gradient descent in some functional
space that generates preference scores from item and user descriptions, i.e. ˆy_{ij} =f(x_{i},w_{j}),
where the update of the preference scores at thetstep of the gradient descent is given by:

ˆ

y^{(t)}_{ij} = ˆy^{(t−1)}_{ij} −η∂L(y_{ij},yˆ^{(t−1)}_{ij} )

∂ˆy^{(t−1)}_{ij} (7.7)

or equivalently:

f^{(t)}(x_{i},w_{j}) =f^{(t−1)}(x_{i},w_{j})−η∂L(y_{ij}, f^{(t−1)}(x_{i},w_{j}))

∂f^{(t−1)}(x_{i},w_{j}) (7.8)

where η is the learning rate. We terminate the gradient descent when we reach a given
number of iterations T or when the validation loss NDCG starts to increase. We
approx-imate the derivative ^{∂L(y}^{ij}^{,ˆ}^{y}

(t−1) ij )

∂ˆy^{(t−1)}_{ij} by learning at each step t a regression tree h^{(t)}(x,w)
minimizing the sum of squared errors. Thus at each update step we have

f^{(t)}(x_{i},w_{j}) =f^{(t−1)}(x_{i},w_{j}) +ηh^{(t)}(x_{i},w_{j}) (7.9)
which if we denote byγ_{tk} the prediction of the kth terminal node of the h^{(t)} tree and by
h_{tk} the respective partition of the input space, we can rewrite as:

f^{(t)}(x_{i},w_{j}) =f^{(t−1)}(x_{i},w_{j}) +ηγ_{tk}I((x,w)∈h_{tk}) (7.10)
we can further optimize over theγ_{tk} values to minimize the loss function of Eq.(7.3) over

the instances of each h_{tk} partition using Newton’s approximation. The ﬁnal preference

LambdaMart is a very eﬀective algorithm for learning to rank problems, see e.g the works of Burges et al. (2011); Donmez et al. (2009). It learns non-linear relevance scores, ˆ

y_{ij}, using gradient boosted regression trees. The number of the parameters it ﬁts is given
by the number of available preference scores (this is typically some fraction of n×m);

there is no regularization on them to prevent overﬁtting. The only protection against overﬁtting can come from rather empirical approaches such as constraining the size of the regression trees or by selecting learning rateη.