Solving the apparent diversity-accuracy dilemma of recommender systems

(1)

Supporting Information

Zhou et al. 10.1073/pnas.1000488107

0 0.5 1

λ 0.75

0.80 0.85 0.90 0.95

h(20)

Eq. 6 W_′ W_″

0 0.5 1

λ 125

150 175 200

eP(20)

Eq. 6 W_′ W_″

0 0.5 1

λ 6

8 10

I(20)

Eq. 6 W_′ W_″

0 0.5 1

λ 0.07

0.08

r

Eq. 6 W_′ W″

Fig. S1. Elegant hybrids of the HeatS and ProbS algorithms can be created in several ways besides that given in Eq.6 of the paper. For example W⁰_αβ¼ ð^1−λ_k_αþ_k^λ_βÞ∑^u_j¼1aαjaβj∕ki, orW⁰⁰_αβ¼_ð1−λÞk¹_α_þλk_β∑^u_j¼1aαjaβj∕kj. WhileW⁰_αβperforms well only with respect toIð20Þ, Eq.6andW⁰⁰_αβboth have their advantages.

However, Eq.6is somewhat easier to tune to different requirements since it varies more slowly and smoothly withλ. The results shown here are for the RateYourMusic dataset.

0 0.5 1

λ 0

0.005 0.010 0.015

P(L)

L = 10 L = 20 L = 50

0 0.02 0.04 0.06

R(L)

Delicious

0 0.5 1

λ 0

0.05 0.10

P(L)

L = 10 L = 20 L = 50

0 0.1 0.2 0.3 0.4 0.5

R(L)

Netflix

Fig. S2. PrecisionPðLÞand recallRðLÞprovide complementary but contrasting measures of accuracy: the former considers what proportion of selected objects (in our case, objects in the topLplaces of the recommendation list) are relevant, the latter measures what proportion of relevant objects (deleted links) are selected. Consequently, recall (red) grows withL, whereas precision (blue) decreases. Here we compare precision and recall for the HeatSþProbS hybrid algorithm on the Delicious and Netflix datasets. While quantitatively different, the qualitative performance is very similar for both measures.

1

(2)

0 0.5 1 λ

0 200 400 600

eP(L)

L = 10 L = 20 L = 50

0 200 400 600

eR(L) Delicious

0 0.5 1

λ 0

50 100

eP(L)

L = 10 L = 20 L = 50

0 50 100

eR(L) Netflix

Fig. S3. A more elegant comparison can be obtained by considering precision and recallenhancement, that is, their values relative to that of randomly sorted recommendations:ePðLÞ ¼PðLÞ·ou∕DandeRðLÞ ¼RðLÞ·o∕L(Eqs.7a,bin the paper). Again, qualitative performance is close, and both of these measures decrease with increasingL, reflecting the inherent difficulty of improving either measure given a long recommendation list.

0 0.5 1

λ 0.75

0.80 0.85 0.90 0.95

h(20)

probe users only all users

0 0.5 1

λ 6

8 10

I(20)

probe users only all users

Fig. S4. Comparison of the diversity-related metricshð20ÞandIð20Þwhen two different averaging procedures are used: averaging only over users with at least one deleted link (as displayed in the paper) and averaging over all users. The different procedures do not alter the results qualitatively and make little quantitative difference. The results shown are for the RateYourMusic dataset.

0 50 100

100 200 300

0 100 200 300

0.5 0.7 0.9

0.7 0.8 0.9

0.8 0.9 1.0

0 0.5 1

λ 0

5 10

0 0.5 1

λ 6

8 10

0 0.5 1

λ 6

8 10 12

Netflix RYM Delicious

eP(L)h(L)I(L)

Fig. S5. Comparison of performance metrics for different lengthsL of recommendation lists:L¼10(red),L¼20(green), and L¼50(blue). Strong quantitative differences are observed for precision enhancementePðLÞand personalizationhðLÞ, but their qualitative behavior remains unchanged. Much smaller differences are observed for surprisalIðLÞ.

2

(3)

0 0.5 1 λ

10^-2 10⁰ 10²

eP(20)

original probe degree < 100 degree < 200

0 0.5 1

λ 0.5

0.6 0.7 0.8 0.9 1

h(20)

0 0.5 1

λ 0

5 10 15

I(20)

0 0.5 1

λ 0

0.1 0.2 0.3

r

Fig. S6. Our accuracy-based metrics all measure in one way or another the recovery of links deleted from the dataset. Purely random deletion will inevitably favor high-degree (popular) objects, with their greater proportion of links, and consequently methods that favor popular items will appear to provide higher accuracy. To study this effect, we created two special probe sets consisting of links only to objects whose degree was less than some threshold (either 100 or 200): links to these objects were deleted with probability 0.5, while links to higher-degree objects were left untouched. The result is a general decrease in accuracy for all algorithms—unsurprisingly, since rarer links are inherently harder to recover—but also a reversal of performance, with the low-degree-favoring HeatS now providing much higher accuracy than the high-degree-oriented ProbS, USim, and GRank. The results shown here are for the Netflix dataset.

0 0.5 1

λ 50

100 150 200

eP(20)

ProbS HeatS + ProbS HeatS + USim HeatS + GRank ProbS + GRank

0 0.5 1

λ 0.4

0.6 0.8 1.0

h(20)

0 0.5 1

λ 4

6 8 10

I(20)

0 0.5 1

λ 0.04

0.06 0.08 0.10 0.12

r

Fig. S7. In addition to HeatSþProbS, various other hybrids were created and tested using the method of Eq.5in the paper, where for hybrid XþY,λ¼0 corresponds to pure X andλ¼1pure Y. The results shown here are for the Netflix dataset. The HeatSþUSim hybrid offers similar but weaker performance compared to HeatSþProbS; combinations of GRank with other methods produce significant improvements inr, the recovery of deleted links, but show little or no improvement of precision enhancementePðLÞand poor results in diversity-related metrics. We can conclude that the proposed HeatSþProbS hybrid is not only computationally convenient but also performs better than combinations of the other methods studied.

3