Chapter 7 Meta-mining as a Learning to Rank Problem 101
As described in Section 4.5.2, the preference score that corresponds to each dataset-workﬂow pair of dataset-dataset-workﬂow is given by a number of signiﬁcance tests on the perfor-mances of the diﬀerent workﬂows that are applied on a given dataset. More concretely, if a workﬂow signiﬁcantly better than another one on the given dataset it gets one point, if their is now signiﬁcant diﬀerence between the two workﬂows then each get half a point, and if it is signiﬁcantly worse it gets zero points. Thus if a workﬂow outperforms in a statistically signiﬁcant manner all other workﬂows it will getm−1 points, wheremis the total number of workﬂows (here 35). In the Matrix Completion and the Full Cold Start settings, we take care of computing the preference scores with respect to the training workﬂows since for these two scenarios the total number of workﬂows is less than 35. In addition, we have rescaled the performance measure from the 0-34 interval to the 0-5 inter-val to avoid large preference scores from overwhelming the NDCG due to the exponential nature of the latter with respect to the preference score. That is we have transformed our meta-mining problem to a ﬁve stars recommendation problem.
We have 113 datasets characteristics that give statistical, information-theoretic, geo-metrical and topological, landmarking and model-based descriptors of the datasets. And 214 workﬂows characteristics derived from a propositionalization of a set of 214 tree-structured generalized workﬂow patterns extracted from the ground speciﬁcations of DM workﬂows, see Section 4.5.2 for more details.
We ﬁx the parameters of the LambdaMART baseline for the construction of the regres-sion trees to the following values: the maximum number of nodes for the regresregres-sion trees is three, the maximum percent of instances in each leaf node is 10% and the learning rate, η, of the gradient boosted tree algorithm to 10−2. To build the ensemble trees of LM-MF and LM-MF-Reg we use the same parameter settings as the ones we use in Lamb-daMART. We optimize their input and output regularization parametersµ1and µ2 in the grid [0.1,1,5,7,10]2, by three-fold inner cross-validation. We ﬁx the number of nearest neighbors for the Laplacian matrices to ﬁve. To compute the output-space similarities, we used the NDCG similarity measure deﬁned in Eq.(7.21) where the truncation level k
is set to the truncation level at which each time we report the results.
To build the diﬀerent recommendation scenarios, we have proceeded as follows. In Matrix Completion, we randomly selectN workﬂows for each dataset to build the training set. We chooseN in the range [5,10,15] which gives 86%, 71% and 57% of missing values in the target matrix respectively. From these missing values, we keep ten workﬂows per dataset for validation and we use the rest for testing. In this task, the numbers of users and items are ﬁxed, so we ﬁx the number of hidden factors r to min(n, m) = 35 for the three matrix factorization algorithms, LM-MF, LM-MF-Reg and CoﬁRank. For this baseline we used the default parameters as these are provided in Weimer et al. (2007). We report the average testing NDCG@5 measure which we apply on the test workﬂows of each dataset.
In the User Cold Start scenario, we will evaluate the performance of the system in providing accurate recommendations for new datasets. To do so we do leave-one-dataset-out. We train the diﬀerent methods on 64 datasets and we evaluate their recommendations on the left-out dataset. Since there is no missing workﬂows, we ﬁx the number of hidden factorsr to min(n, m) = 35.
In the Full Cold Start scenario, in addition to do leave-one-dataset-out, we also ran-domly partition workﬂows where we use 70% for training and we use the other 30% to deﬁne the test workﬂows for the left-out dataset. That is the number of workﬂows in the train set is equal to ⌊0.7×35⌋ = 24 which deﬁnes the number of hidden factors r. We use the other 11 remaining workﬂows in the test set. For the two cold start scenarios, we report the average testing NDCG measure. We compute the average NDCG performance at the truncation levels ofk= 1,3,5
For each method applied on a given recommendation task, we report the number of times we have a performance win or loss compared to the performance of the baselines. On these win/loss pairs we do a McNemar’s test of statistical signiﬁcance and report its results, we set the signiﬁcance level atp= 0.05. ForMatrix Completion, we collectively denote the results of these performance comparisons byδCR and δLM. We give the complete results in table C.1. ForUser Cold Start, these are denoted byδU B and δLM, the results of which is given in Table C.2. ForFull Cold Start, these are denoted by δF B and δLM, the results of which is given Table C.3.
We will discuss now the results we have for the three recommendation scenarios. ForMatrix Completion, the ﬁrst observation we can make looking at Table C.1 is that CoﬁRank is the method which achieves the lowest performance for all the values of N. The other
methods have similar results where they signiﬁcantly beat CoﬁRank for N = 10 with the little exception of LM-MF-Reg which is close to signiﬁcantly beat this baseline too, p-value=0.0824. In addition, the two LM-MF and LM-MF-Reg methods achieve the best results forN = 5 where the regularized variant beats signiﬁcantly CoﬁRank and LM-MF is close to beat in a signiﬁcant manner this baseline too, p-value=0.0824. Thus, since these methods use side information to constrain the learned model, they are able to gain substantial performance improvement over CoﬁRank, especially when the target matrix is very sparse.
Looking now at theUser Cold Startsetting, Table C.2, we can see that the two methods that achieve the best performances for all the three values ofk are LM-MF and LM-MF-Reg. Fork= 1 and 3, these two methods signiﬁcantly beat the UB baseline with a NDCG performance drop of 0.12 and 0.7 respectively. They are also close to beat LambdaMART in a signiﬁcant manner, p-value=0.0636 for k= 1 and p-value=0.0714 with LM-MF and 0.1056 with LM-MF-Reg for k = 3. For k = 5, they also outperform the two baselines on a majority of datasets but not in a signiﬁcant manner. This is consistent with our ﬁndings (Chapter 6, Section 6.6.5) where the meta-learning approachEuclhas the highest correlation gain with k = 5, see Figure 6.4a. Finally, the LambdaMART variant that makes use of the weighted NDCG, LMW, does not seem to bring an improvement over the plain vanilla NDCG, LM.
For the Full Cold Start setting, Table C.3, the two methods, LM-MF and LM-MF-Reg, achieve now very good performances. Both beat signiﬁcantly LambdaMART for all values of the truncation level k with a performance drop of 0.7 in average. The two methods are also able to beat signiﬁcantly FB for k = 1,3 with the little exception of the regularized variant which is close to be signiﬁcant with k = 3, p-value=0.1041. The performance drop compared to this baseline is now quite high where LM-MF-Reg has a drop of 0.13 for k = 1 and LM-MF has a drop of 0.8 for k = 3. However for k = 5, FB achieves relatively good performance where it is close to beat in a signiﬁcant manner LambdaMART and signiﬁcantly beats its weighted NDCG variant. Note that the meta-mining problem we have is a small dataset so it is rather diﬃcult to beat the FB baseline.
In addition remember that this recommendation task is the most diﬃcult one among the three tasks, we already saw in Chapter 5, Section 5.5.2, that the metric-learning approach was not able to beat the default baseline under themae evaluation measure.
0.1 1 5 7 10
Figure 7.1: µ1 and µ2 heatmap parameter distribution at the diﬀerent truncation levels k= 1,3,5 of NDCG@k from theUser Cold Startmeta-mining experiments. In the y-axis we have theµ1parameter and in the x-axis theµ2parameter. We validated each parameter with three-folds inner-cross validation to ﬁnd the best value in the range [0.1,1,5,7,10].
0.1 1 5 7 10
Figure 7.2: µ1 and µ2 heatmap parameter distribution at the diﬀerent truncation levels k = 1,3,5 of NDCG@k from the Full Cold Start meta-mining experiments. The ﬁgure explanation is as before.
From the results, the overall best method is the regularized variant, LM-MF-Reg. In order to study in more detail the eﬀect of regularization we give in Figures 7.1 and 7.2 the frequency with which the twoµ1 andµ2 input and output space regularization parameters are set to the diﬀerent possible values over the 65 repetitions of the leave-one-dataset-out for the two cold start settings. The ﬁgures show the heatmap of those pairwise frequency, in the y-axis we have the µ1 parameter and in the x-axis the µ2 parameter. A yellow to white cell means a upper mid to high frequency whereas a orange to red cell means a bottom mid to low frequency.
In what concerns theUser Cold Startexperiments, Figure 7.1, the heatmaps show that fork= 1 input space regularization is more important than output space regularization, see the white column in the (0.1,·) cells. This makes sense since the latter regularizer uses only the top-one user or item preference to compute the pairwise similarities. As such, it does not carry much information to regularize appropriately the latent factors. Fork= 3 the distribution spreads more over the diﬀerent pairs of values which means that we have
more combinations of the two regularizers. But in the other hand the most frequent pair is the (0.1, 0.1) one which means almost no regularization. Then fork = 5 we have two peaks, one at (0.1, 10) and one other at (7, 1). Here there is no clear advantage for one or the other regularizer and both seem important.
Looking now at theFull Cold Start experiments, Figure 7.2, we have a quite diﬀerent picture than before. For k = 1 there is as before an advantage for the input space regularizer where the (0.1, 5) cell is the most frequent one. However it also appears that we need to regularize more the latent factors in this cold start scenatio, see the peak in the (10, 10) cell. This is conﬁrmed for k= 3 where we have now a distribution which spans well over the diﬀerent cells with diﬀerent peaks and which means that the two regularizers are well combined together. Fork= 5 we have now three peaks at (5, 1), (7, 0.1) and (10, 5) which shows that the output space regularizer is more important than the input space regularizer.
Overall in the Full Cold Start setting the regularization parameters have more im-portance, i.e. the sum of their diﬀerent values is higher, than in the User Cold Start setting. This is logic since in the former scenario we have to predict the preference of a new user-item pair, the latent proﬁles of which have to be regularized appropriately to avoid performance degradation.