• Aucun résultat trouvé

Chapter 7 Meta-mining as a Learning to Rank Problem 101

7.5 Regularization

7.6.3 Meta-mining

As described in Section 4.5.2, the preference score that corresponds to each dataset-workflow pair of dataset-dataset-workflow is given by a number of significance tests on the perfor-mances of the different workflows that are applied on a given dataset. More concretely, if a workflow significantly better than another one on the given dataset it gets one point, if their is now significant difference between the two workflows then each get half a point, and if it is significantly worse it gets zero points. Thus if a workflow outperforms in a statistically significant manner all other workflows it will getm−1 points, wheremis the total number of workflows (here 35). In the Matrix Completion and the Full Cold Start settings, we take care of computing the preference scores with respect to the training workflows since for these two scenarios the total number of workflows is less than 35. In addition, we have rescaled the performance measure from the 0-34 interval to the 0-5 inter-val to avoid large preference scores from overwhelming the NDCG due to the exponential nature of the latter with respect to the preference score. That is we have transformed our meta-mining problem to a five stars recommendation problem.

We have 113 datasets characteristics that give statistical, information-theoretic, geo-metrical and topological, landmarking and model-based descriptors of the datasets. And 214 workflows characteristics derived from a propositionalization of a set of 214 tree-structured generalized workflow patterns extracted from the ground specifications of DM workflows, see Section 4.5.2 for more details.

Evaluation Setting

We fix the parameters of the LambdaMART baseline for the construction of the regres-sion trees to the following values: the maximum number of nodes for the regresregres-sion trees is three, the maximum percent of instances in each leaf node is 10% and the learning rate, η, of the gradient boosted tree algorithm to 10−2. To build the ensemble trees of LM-MF and LM-MF-Reg we use the same parameter settings as the ones we use in Lamb-daMART. We optimize their input and output regularization parametersµ1and µ2 in the grid [0.1,1,5,7,10]2, by three-fold inner cross-validation. We fix the number of nearest neighbors for the Laplacian matrices to five. To compute the output-space similarities, we used the NDCG similarity measure defined in Eq.(7.21) where the truncation level k

is set to the truncation level at which each time we report the results.

To build the different recommendation scenarios, we have proceeded as follows. In Matrix Completion, we randomly selectN workflows for each dataset to build the training set. We chooseN in the range [5,10,15] which gives 86%, 71% and 57% of missing values in the target matrix respectively. From these missing values, we keep ten workflows per dataset for validation and we use the rest for testing. In this task, the numbers of users and items are fixed, so we fix the number of hidden factors r to min(n, m) = 35 for the three matrix factorization algorithms, LM-MF, LM-MF-Reg and CofiRank. For this baseline we used the default parameters as these are provided in Weimer et al. (2007). We report the average testing NDCG@5 measure which we apply on the test workflows of each dataset.

In the User Cold Start scenario, we will evaluate the performance of the system in providing accurate recommendations for new datasets. To do so we do leave-one-dataset-out. We train the different methods on 64 datasets and we evaluate their recommendations on the left-out dataset. Since there is no missing workflows, we fix the number of hidden factorsr to min(n, m) = 35.

In the Full Cold Start scenario, in addition to do leave-one-dataset-out, we also ran-domly partition workflows where we use 70% for training and we use the other 30% to define the test workflows for the left-out dataset. That is the number of workflows in the train set is equal to ⌊0.7×35⌋ = 24 which defines the number of hidden factors r. We use the other 11 remaining workflows in the test set. For the two cold start scenarios, we report the average testing NDCG measure. We compute the average NDCG performance at the truncation levels ofk= 1,3,5

For each method applied on a given recommendation task, we report the number of times we have a performance win or loss compared to the performance of the baselines. On these win/loss pairs we do a McNemar’s test of statistical significance and report its results, we set the significance level atp= 0.05. ForMatrix Completion, we collectively denote the results of these performance comparisons byδCR and δLM. We give the complete results in table C.1. ForUser Cold Start, these are denoted byδU B and δLM, the results of which is given in Table C.2. ForFull Cold Start, these are denoted by δF B and δLM, the results of which is given Table C.3.

Results

We will discuss now the results we have for the three recommendation scenarios. ForMatrix Completion, the first observation we can make looking at Table C.1 is that CofiRank is the method which achieves the lowest performance for all the values of N. The other

7.6. Experiments

methods have similar results where they significantly beat CofiRank for N = 10 with the little exception of LM-MF-Reg which is close to significantly beat this baseline too, p-value=0.0824. In addition, the two LM-MF and LM-MF-Reg methods achieve the best results forN = 5 where the regularized variant beats significantly CofiRank and LM-MF is close to beat in a significant manner this baseline too, p-value=0.0824. Thus, since these methods use side information to constrain the learned model, they are able to gain substantial performance improvement over CofiRank, especially when the target matrix is very sparse.

Looking now at theUser Cold Startsetting, Table C.2, we can see that the two methods that achieve the best performances for all the three values ofk are LM-MF and LM-MF-Reg. Fork= 1 and 3, these two methods significantly beat the UB baseline with a NDCG performance drop of 0.12 and 0.7 respectively. They are also close to beat LambdaMART in a significant manner, p-value=0.0636 for k= 1 and p-value=0.0714 with LM-MF and 0.1056 with LM-MF-Reg for k = 3. For k = 5, they also outperform the two baselines on a majority of datasets but not in a significant manner. This is consistent with our findings (Chapter 6, Section 6.6.5) where the meta-learning approachEuclhas the highest correlation gain with k = 5, see Figure 6.4a. Finally, the LambdaMART variant that makes use of the weighted NDCG, LMW, does not seem to bring an improvement over the plain vanilla NDCG, LM.

For the Full Cold Start setting, Table C.3, the two methods, LM-MF and LM-MF-Reg, achieve now very good performances. Both beat significantly LambdaMART for all values of the truncation level k with a performance drop of 0.7 in average. The two methods are also able to beat significantly FB for k = 1,3 with the little exception of the regularized variant which is close to be significant with k = 3, p-value=0.1041. The performance drop compared to this baseline is now quite high where LM-MF-Reg has a drop of 0.13 for k = 1 and LM-MF has a drop of 0.8 for k = 3. However for k = 5, FB achieves relatively good performance where it is close to beat in a significant manner LambdaMART and significantly beats its weighted NDCG variant. Note that the meta-mining problem we have is a small dataset so it is rather difficult to beat the FB baseline.

In addition remember that this recommendation task is the most difficult one among the three tasks, we already saw in Chapter 5, Section 5.5.2, that the metric-learning approach was not able to beat the default baseline under themae evaluation measure.

0.1 1 5 7 10

Figure 7.1: µ1 and µ2 heatmap parameter distribution at the different truncation levels k= 1,3,5 of NDCG@k from theUser Cold Startmeta-mining experiments. In the y-axis we have theµ1parameter and in the x-axis theµ2parameter. We validated each parameter with three-folds inner-cross validation to find the best value in the range [0.1,1,5,7,10].

0.1 1 5 7 10

Figure 7.2: µ1 and µ2 heatmap parameter distribution at the different truncation levels k = 1,3,5 of NDCG@k from the Full Cold Start meta-mining experiments. The figure explanation is as before.

Analysis

From the results, the overall best method is the regularized variant, LM-MF-Reg. In order to study in more detail the effect of regularization we give in Figures 7.1 and 7.2 the frequency with which the twoµ1 andµ2 input and output space regularization parameters are set to the different possible values over the 65 repetitions of the leave-one-dataset-out for the two cold start settings. The figures show the heatmap of those pairwise frequency, in the y-axis we have the µ1 parameter and in the x-axis the µ2 parameter. A yellow to white cell means a upper mid to high frequency whereas a orange to red cell means a bottom mid to low frequency.

In what concerns theUser Cold Startexperiments, Figure 7.1, the heatmaps show that fork= 1 input space regularization is more important than output space regularization, see the white column in the (0.1,·) cells. This makes sense since the latter regularizer uses only the top-one user or item preference to compute the pairwise similarities. As such, it does not carry much information to regularize appropriately the latent factors. Fork= 3 the distribution spreads more over the different pairs of values which means that we have

7.6. Experiments

more combinations of the two regularizers. But in the other hand the most frequent pair is the (0.1, 0.1) one which means almost no regularization. Then fork = 5 we have two peaks, one at (0.1, 10) and one other at (7, 1). Here there is no clear advantage for one or the other regularizer and both seem important.

Looking now at theFull Cold Start experiments, Figure 7.2, we have a quite different picture than before. For k = 1 there is as before an advantage for the input space regularizer where the (0.1, 5) cell is the most frequent one. However it also appears that we need to regularize more the latent factors in this cold start scenatio, see the peak in the (10, 10) cell. This is confirmed for k= 3 where we have now a distribution which spans well over the different cells with different peaks and which means that the two regularizers are well combined together. Fork= 5 we have now three peaks at (5, 1), (7, 0.1) and (10, 5) which shows that the output space regularizer is more important than the input space regularizer.

Overall in the Full Cold Start setting the regularization parameters have more im-portance, i.e. the sum of their different values is higher, than in the User Cold Start setting. This is logic since in the former scenario we have to predict the preference of a new user-item pair, the latent profiles of which have to be regularized appropriately to avoid performance degradation.