HAL Id: hal-01256033
https://hal.inria.fr/hal-01256033
Submitted on 14 Jan 2016
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Bandits and Recommender Systems
Jérémie Mary, Romaric Gaudel, Philippe Preux
To cite this version:
Jérémie Mary, Romaric Gaudel, Philippe Preux. Bandits and Recommender Systems. First Interna-
tional Workshop on Machine Learning, Optimization, and Big Data (MOD’15), Jul 2015, Taormina,
Italy. pp.325-336, �10.1007/978-3-319-27926-8_29�. �hal-01256033�
J´ er´ emie Mary, Romaric Gaudel, and Philippe Preux Universit´ e de Lille, CRIStAL (UMR CNRS), Villeneuve d’Ascq, France
{jeremie.mary,romaric.gaudel,philippe.preux}@univ-lille3.fr
Abstract This paper addresses the on-line recommendation problem facing new users and new items; we assume that no information is avail- able neither about users, nor about the items. The only source of infor- mation is a set of ratings given by users to some items. By on-line, we mean that the set of users, and the set of items, and the set of ratings is evolving along time and that at any moment, the recommendation system has to select items to recommend based on the currently avail- able information, that is basically the sequence of past events. We also mean that each user comes with her preferences which may evolve along short and longer scales of time; so we have to continuously update their preferences. When the set of ratings is the only available source of in- formation, the traditional approach is matrix factorization. In a decision making under uncertainty setting, actions should be selected to balance exploration with exploitation; this is best modeled as a bandit problem.
Matrix factors provide a latent representation of users and items. These representations may then be used as contextual information by the ban- dit algorithm to select items. This last point is exactly the originality of this paper: the combination of matrix factorization and bandit algo- rithms to solve the on-line recommendation problem. Our work is driven by considering the recommendation problem as a feedback controlled loop. This leads to interactions between the representation learning, and the recommendation policy.
1 Introduction
We consider the online version of the problem of the recommendation of items to users as faced by websites. Items may be ads, news, music, videos, movies, books, diapers, ... Being live, these systems have to cope with users about whom we have no information, and new items introduced in the catalog which attractive- ness is unknown. Appetence of new users towards available items, and appeal of new items towards existing users have to be estimated as fast as possible.
Currently, this situation is handled thanks to side information available on the
users, and on the items (see [2,21]). In this paper, we consider this problem from
a different perspective. Though perfectly aware of the potential utility of side in-
formation, we consider the problem without any side information, only focussing
on estimating the appetences of new users and the appeal of new items as fast
as possible; the use of side information can be mixed with the ideas presented
in this paper. Side information being unavailable, we learn a latent representa- tion of each user and each item using the currently available ratings. As already argued by others (e.g. [16]), this problem fits perfectly into the sequential de- cision making framework, and more specifically, the bandit setting [20,10,9]. A sequential decision making problem under uncertainty faces an exploration vs.
exploitation dilemma: the exploration is meant to acquire information in order to perform better subsequently by exploiting it; collecting the information has a cost that can not be merely zeroed, or simply left as an unimportant matter.
However, in rather sharp contrast with the traditional bandit setting, here the set of bandits is constantly being renewed; the number of bandits is not small, though not being huge (from a few dozens to hundreds arms in general, up to dozens of millions in some applications): this makes the problem very different from the 2-armed bandit problem; we look for efficient and effective ways to address this task, since we want the proposed solution to be able to cope with real applications on the web. For obvious practical and economical reasons, the strategy can not merely consist in repeatedly presenting all available items to users until their appetences seem accurately estimated. We have to consider the problem as an exploration vs. exploitation problem in which exploration is a necessary evil to acquire information and eventually improve the performance of the recommendation system (RS for short). To summarize, we learn a latent representation of each user and each item, from which a recommendation policy is deduced, based on the available ratings. This learning process is continuous:
the representation and the recommendation policy are updated regularly, as new ratings are observed, new items are introduced into the set of items, new users flow-in, and the preferences of already observed users change.
This being said, comes the problem of the objective function to optimize.
Since the Netflix challenge, at least in the machine learning community, the recommendation problem is often reduced to a matrix factorization problem, performed in batch, learning on a training set, and minimizing the root mean squared error (RMSE) on a testing set. However, the RMSE comes with heavy flaws. Other objective functions have been considered to handle certain of these flaws [7,19].
Based on these ideas, our contribution in this paper is the following:
we propose an original way to handle new users and new items in recom- mendation systems: we cast this problem as a sequential decision making problem to be played online that selects items to recommend in order to optimize the exploration/exploitation balance; our solution is then to perform the rating matrix factorization driven by the policy of this sequential decision problem in order to focus on the most useful terms of the factorization. This is the core idea of the contributed algorithm we name BeWARE.
The reader familiar with the bandit framework can think of this work
as a contextual bandit learning side information for each user and each
item from the observed ratings, assuming the existence of a latent space
of dimension k for both users and items. We stress the fact that learning
and updating the representation of users and items at the same time rec- ommendations are made is something very different from the traditional batch matrix factorization approach, or the traditional bandit setting.
We also introduce a methodology to use a classical partially filled rating matrices to assess the online performance of a bandit-based recommen- dation algorithm.
After introducing our notations in the next section, Sec. 3 briefly presents the matrix factorization approach. Sec. 4 introduces the necessary background in bandit theory. In Sec. 5 and Sec. 6, we present BeWARE considering in the case of new users and new items. Sec. 7 provides an experimental study on artificial data, and on real data. Finally, we conclude and draw some future lines of work in Sec. 8.
2 Notations and Vocabulary
U T is the transpose of matrix U, and U i denotes its i th row. For a vector u and a set of integers S, u S is the sub-vector of u composed of the elements of u which indices belong to S. Accordingly, U being a matrix, U S is the sub-matrix made of the rows of U which indices belong to S. #u is the number of components (dimension) of u, and #S is the number of elements of S.
Now, we introduce a set of notations dedicated to the RS problem. As we
consider a time-evolving number of users and items, we will note n the current
number of users, and m the current number of items. These should be indexed
by a t to denote time, though often in this paper, t is dropped to simplify the
notation. Without loss of generality, we assume n < N and m < M, that is
N and M are the maximal numbers of ever seen users and items (those figures
may as large as necessary). R ∗ represents the ground truth, that is the matrix
of ratings. r i,j ∗ is the rating given by user i to item j. We suppose that there
exists an integer k and two matrices U of size N × k and V of size M × k such
that R ∗ = UV T . We denote S the set of elements that have been observed, and
R denote the matrix s.t. r i,j = r ∗ i,j + η i,j if (i, j) ∈ S , where η i,j is a noise with
zero mean and finite variance. The η i,j are i.i.d. In this paper, we assume that
R ∗ is fixed during all the time; at a given moment, only a submatrix made of
n rows and m columns is actually useful. This part of R ∗ that is observed is
increasing along time. That is, the set S is growing along time. J (i) (resp. I(j))
denotes the set of items rated by user i (resp. the set of users who rated item
j). ˆ U and ˆ V denote estimates (with the statistical meaning) of the matrices U
and V respectively. ˆ U V ˆ T is denoted by ˆ R. We use the term “observation” to
mean a triplet (i, j, r i,j ). The RS receives a stream of observations. We use the
term “rating” to mean the value associated by a user to an item. It can be a
rating as in the Netflix challenge, or an information meaning click or not, sale
or not, . . . For the sake of legibility, in the online setting we omit the t subscript
for time dependency. S, ˆ U, ˆ V, n, m should be subscripted with t.
3 Matrix Factorization
Since the Netflix challenge [4], many works in RS have been using matrix fac- torization: the matrix of observed ratings is assumed to be the product of two matrices of low rank k: ˆ R = ˆ U V ˆ T [11]. ˆ U is a latent representation of users, while ˆ V is a latent representation of items. As most of the values of the rating matrix are unknown, the decomposition can only be done using the set of obser- vations. The classical approach is to solve the regularized minimization problem ( ˆ U, V) ˆ def = argmin U,V ζ(U, V), where ζ(U, V) def = P
∀(i,j)∈S r i,j − U i · V j T 2
+ λ · Ω(U, V), in which λ ∈ R + and is a regularization term. ζ is not convex. The minimization is usually performed either by stochastic gradient descent (SGD), or by alternate least squares (ALS). Solving for ˆ U and ˆ V at once being non con- vex, ALS iterates and at iteration, ALS alternates an optimization of ˆ U keeping V ˆ fixed, and an optimization of ˆ V keeping ˆ U fixed.
In this paper we consider ALS-WR [22] whose regularization term Ω(U, V) def = P
i #J (i)||U i || 2 + P
j #I(j)||V j || 2 depends on users and items respective importance in the matrix of ratings.
This regularization is known to have a good empirical behavior — that is limited overfitting, easy tuning of λ and k, low RMSE.
4 Bandits
Let us consider a bandit machine with m independent arms. When pulling arm j, the player receives a reward drawn from [0, 1] which follows a probability distribution ν j . Let µ j denote the mean of ν j , j ∗ def = argmax j µ j be the best arm and µ ∗ def = max j µ j = µ j
∗be the best expected reward (we assume there is only one best arm). {ν j }, {µ j }, j ∗ and µ ∗ are unknown.
A player aims at maximizing the sum of rewards collected along T consecutive pulls. More specifically, by denoting j t the arm pulled at time t and r t the reward obtained at time t, the player wants to maximize the cumulative reward CumRew T = P T
t=1 r t . At each time-step but the last one, the player faces the dilemma:
– either exploit by pulling the arm which seems the best according to the estimated values of the parameters;
– or explore to improve the estimation of the parameters of the probability distribution of an arm by pulling it.
Li et al. [13] extend the bandit setting to contextual arms. They assume that a
vector of real features v ∈ R k is associated to each arm and that the expectation
of the reward associated to an arm is u ∗ · v, where u ∗ is an unknown vector. The
algorithm handling this setting is known as LinUCB. LinUCB consists in playing
the arm with the largest upper confidence bound on the expected reward:
j t = argmax
j
u.v ˆ T j + α q
v j A −1 v T j , where ˆ u is an estimate of u ∗ , α is a parameter, and A = P t−1
t
0=1 v j
t0.v T j
t0
+ Id, where Id is the identity matrix. Note that ˆ u.v T j corresponds to an estimate of the expected reward, while q
v j A −1 v T j is an optimistic correction of that estimate.
While the objective of LinUCB is to maximize the cumulative reward, the- oretical results [13,1] are expressed in term of cumulative regret (or regret for short) Regret T def = P T
t=1 (r t ∗ − r t ), where r ∗ t = max j u ∗ .v T j
t
stands for the best expected reward at time t. Hence, the regret measures how much the player loses (in expectation), in comparison to playing the optimal strategy. Standard results prove regrets of order ˜ O( √
T ) or O(ln T ), depending on the assumptions on the distributions and depending on the precise analysis 1 .
Of course LinUCB and other contextual bandit algorithms require the con- text (values of features) to be provided. In real applications this is done using side information about the items and the users [17] –i.e. expert knowledge, cat- egorization of items, Facebook profiles of users, implicit feedback . . . The core idea of this paper is to use matrix factorization techniques to build a context online using the known ratings. To this end, one assumes that the items and the arms can be represented in the same space of dimension k and assuming that the rating of user u for item v is the scalar product of u and v.
We study the introduction of new items and/or new users into the RS. This is done without using any side information on users or items.
5 BeWARE of a new user
Let us consider a particular recommendation scenario. At each time-step t, 1. a user i t requests a recommendation to the RS,
2. the RS selects an item j t among the set of items that have never been recommended to user i t beforehand,
3. user i t returns a rating r t = r i
t,j
tfor item j t .
Obviously, the objective of the RS is to maximize the cumulative reward CumRew T = P T
t=1 r t . In the context of such a scenario, the usual matrix factor- ization approach of RS recommends item j t which has the best predicted rating for user i t . This corresponds to a pure exploitation, or greedy, strategy which is well-known to be suboptimal to optimize CumRew T : to be optimal, the RS has to balance the exploitation and exploration.
Let us now describe the recommendation algorithm we propose at time-step t. We aim at recommending to user i t an item j t which leads to the best trade- off between exploration and exploitation in order to maximize CumRew ∞ . We
1
O ˜ means O up to a logarithmic term on T .
assume that the matrix R is factored into ˆ U V ˆ T by ALS-WR which terminated by optimizing ˆ U holding ˆ V fixed. In such a context, the UCB approach is based on a confidence interval on the estimated ratings ˆ r i
t,j = ˆ U i
t· V ˆ j T for any allowed item j.
We assume that we already observed a sufficient number of ratings for each item, but only a few ratings (possibly none) from user i t . As a consequence the uncertainty on ˆ U i
tis much more important than on any ˆ V j . In other words, the uncertainty on ˆ r i
t,j mostly comes from the uncertainty on ˆ U i
t. Let us express this uncertainty.
Let u ∗ denote the (unknown) true value of U i
tand let us introduce the k × k matrix:
A def = ( ˆ V J (i
t) ) T · V ˆ J (i
t) + λ · #J (i t ) · Id.
As ˆ U and ˆ V comes from ALS-WR (which last iteration optimized ˆ U),
U ˆ j
t= A −1 V ˆ T J (i
t) R T i
t,J (i
t) .
Using Azuma’s inequality over the weighted sum of random variables (as introduced by [18] for linear systems), it follows that there exists a value C ∈ R such as, with probability 1 − δ:
( ˆ U i
t− u ∗ )A −1 ( ˆ U i
t− u ∗ ) T ≤ C log(1/δ) t
This inequality defines the confidence bound around the estimate ˆ U i
tof u ∗ . Therefore, a UCB strategy selects item j t :
j t def = argmax
1≤j≤m,j / ∈J (i
t)
U ˆ i
t· V ˆ T j + α
q V ˆ j A −1 V ˆ T j ,
where α ∈ R is an exploration parameter to be tuned. Fig. 1(a) provides a graphical illustration of the link between the bound, and this choice of item j t .
Our algorithm, named BeWARE.User (BeWARE which stands for “Bandit
WARms-up REcommenders”) is described in Alg. 1. The presentation is op-
timized for clarity rather than for computational efficiency. Of course, if the
exploration parameter α is set to 0 BeWARE.User makes a greedy selection for
the item to recommend. The estimation of the center of the ellipsoid and its size
can be influenced by the use of an other regularization term. BeWARE.User uses
a regularization based on ALS-WR. It is possible to replace all #J (.) by 1. This
amounts to the standard regularization: we call this slightly different algorithm
BeWARE.ALS.User. In fact one can use any regularization as long as ˆ U i
tis a
linear combination of observed rewards.
O
R
kˆ U
itconfidence ellipsoid
V ˆ
2V ˆ
1u ˜
(1)(a) New user.
O
R
kU ˆ
itV ˆ
j˜ v
(j)(b) New items
Figure 1. (a) The leftmost part of this figure illustrates the use of the upper confidence ellipsoid for item selection for the new user i
twho enters the game at time t. Items and users are vectors in R
k. (One may suppose that k = 2 in this figure to make it in the plane.) Red dots represent items. The blue ellipse represents the confidence ellipsoid of the vector associated to the new user. The optimistic rating of the user for an item j is the maximum dot product between ˆ V
jand any point in this ellipsoid. By a simple geometrical argument based on iso-contours of the dot product, this maximum value is equal to the dot product between ˆ V
jand ˜ u
(j)it. Optimism leads to recommend the item maximizing the dot product h˜ u
(j)it, V ˆ
ji.
(b) This figure illustrates the use of the upper confidence ellipsoid for item selection in the context of a set of new items. The setting is similar to the case of a new user except that the vector associated to the user is known (represented by a blue dot) while each item now has its confidence ellipsoids. The optimistic RS recommends the item maximizing the scalar product h U ˆ
it, v ˜
(j)i.
Algorithm 1 BeWARE.User: for a user i t , recommends an item to this user.
Input: i
t, λ, α Input/Output: R, S
1: ( ˆ U, V) ˆ ← MatrixFactorization(R) 2: A ← ( ˆ V
J(it))
T· V ˆ
J(it)+ λ · #J (i
t) · Id.
3: j
t← argmax
j /∈J(it)
U ˆ
it· V ˆ
Tj+ α
q V ˆ
jA
−1V ˆ
jT4: Recommend item j
tand receive rating r
t= r
it,jt5: Update R, S
6 BeWARE of new items
In general, a set of new items is introduced at once, not a single item. In this case, the uncertainty is more important on items. We compute a confidence bound around the items instead of the users, assuming ALS terminates with optimizing V ˆ keeping ˆ U fixed. With the same criterion and regularization on ˆ V as above, at timestep t:
V ˆ j = B(j) −1 ( ˆ U I(j) ) T R I(j),j ,
with B(j) def = ( ˆ U I(j) ) T U ˆ I(j) + λ · #I(j) · Id.
So the upper confidence bound of the rating for user i on item j is:
U ˆ i · V ˆ T j + α
q U ˆ j B(j) −1 U ˆ T j .
This leads to the algorithm BeWARE.Items presented in Alg. 2. Again, the presentation is optimized for clarity rather than for computational efficiency. Be- WARE.Items can be parallelized and has the complexity of one step of ALS. Fig.
1(b) gives the geometrical intuition leading to BeWARE.Items. Again, setting α = 0 leads to a greedy selection. The regularization (line 4) can be modified.
Algorithm 2 BeWARE.Items: for a user i t , recommends an item to this user in the case where a set of new items is made available.
Input: i
t, λ, α Input/Output: R, S
1: ( ˆ U, V) ˆ ← MatrixFactorization(R)
2: ∀j / ∈ J (i
t), B(j) ← ( ˆ U
I(j))
TU ˆ
I(j)+ λ · #I(j) · Id 3: j
t← argmax
j /∈J(it)
U ˆ
it. V ˆ
Tj+ α
q U ˆ
itB(j)
−1U ˆ
Tit
4: Recommend item j
tand receive rating r
t= r
it,jt5: Update R, and S
7 Experimental Investigation
In this section we evaluate empirically BeWARE on artificial data, and on real datasets. The BeWARE algorithms are compared to:
– greedy approaches (denoted Greedy.ALS and Greedy.ALS-WR) that always choose the item with the largest current estimated value (respectively given a decomposition obtained by ALS, or by ALS-WR),
– the UCB1 approach [3] (denoted UCB.on.all.users) that considers each re-
ward r i
t,j
tas an independent realization of a distribution ν j
t. In other words,
UCB.on.all.users recommends an item without taking into account the in-
formation on the user requesting the recommendation.
The comparison to greedy selection highlights the needs of exploration to have an optimal algorithm in the online context. The comparison to UCB.on.all.users assesses the benefit of personalizing recommendations.
7.1 Experimental Setting
For each dataset, each algorithm starts with an empty R matrix of 100 items and 200 users. Then, the evaluation goes like this:
1. select a user uniformly at random among those who have not yet rated all the items,
2. request his favorite item among those he has not yet rated,
3. compute the immediate regret (the difference of rating between the best not yet selected item and the one selected by the algorithm),
4. iterate until all users have rated all items.
The difficulty with real datasets is that the ground truth is unknown, and actually, only a very small fraction of ratings is known. This makes the evaluation of algorithms uneasy. To overcome these difficulties, we also provide a comparison of the algorithms considering an artificial problem based on a ground truth matrix R ∗ considering m users and n items. This matrix is generated as in [6].
Each item belongs to either one of k genres, and each user belongs to either one of l types. For each item j of genre a and each user i of type b, r ∗ i,j = p a,b is the ground truth rating of item j by user i, where p a,b is drawn uniformly at random in the set {1, 2, 3, 4, 5}. The observed rating r i,j is a noisy value of r ∗ i,j : r i,j = r ∗ i,j + N (0, 0.5).
We also consider real datasets, the NetFlix dataset [4] and the Yahoo!Music dataset [8]. Of course, the major issue with real data is that there is no dataset with a complete matrix, which means we do no longer have access to the ground truth R ∗ , which makes the evaluation of algorithms more complex. This issue is usually solved in the bandit literature by using a method based on reject sampling [14]. For a well constructed dataset, this kind of estimators has no bias and a known bound on the decrease of the error rate [12]. For all the algorithms, we restrict the possible choices for a user at time-step t to the items with a known rating in the dataset. However, a minimum amount of ratings per user is needed to be able to have a meaningful comparison of the algorithms (otherwise, a random strategy is the only reasonable one). As a consequence, with both datasets, we focus on the 5000 heaviest users for the top ∼250 movies/songs.
This leads to a matrix R e ∗ with only 10% to 20% of missing ratings. We insist on the fact that this is necessary for performance evaluation of the algorithms;
obviously, this is not required to use the algorithms on a live RS.
We would like to advertize that this experimental methodology has a unique
feature: this methodology allows us to turn any matrix of ratings into an online
problem which can be used to test bandit recommendation algorithms. We think
that this methodology is an other contribution of this paper.
0 1000 2000 3000 4000
0200040006000800010000
t
Cumulated Regret
Random Greedy.ALS Greedy.ALS−WR BeWARE.ALS.users BeWARE.users BeWARE.ALS.items BeWARE.items UCB on all users
(a) Artificial dataset.
0 1000 2000 3000 4000
010002000300040005000
t
Cumulated Regret
Random Greedy.ALS Greedy.ALS−WR BeWARE.ALS.users BeWARE.users BeWARE.ALS.items BeWARE.items UCB on all users
(b) Netflix dataset.
0 1000 2000 3000 4000
050000100000150000200000
t
Cumulated Regret
Random Greedy.ALS Greedy.ALS−WR BeWARE.ALS.users BeWARE.users BeWARE.ALS.items BeWARE.items UCB on all users