Bandits and Recommender Systems

(1)

HAL Id: hal-01256033

https://hal.inria.fr/hal-01256033

Submitted on 14 Jan 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bandits and Recommender Systems

Jérémie Mary, Romaric Gaudel, Philippe Preux

To cite this version:

Jérémie Mary, Romaric Gaudel, Philippe Preux. Bandits and Recommender Systems. First Interna-

tional Workshop on Machine Learning, Optimization, and Big Data (MOD’15), Jul 2015, Taormina,

Italy. pp.325-336, �10.1007/978-3-319-27926-8_29�. �hal-01256033�

(2)

J´ er´ emie Mary, Romaric Gaudel, and Philippe Preux Universit´ e de Lille, CRIStAL (UMR CNRS), Villeneuve d’Ascq, France

{jeremie.mary,romaric.gaudel,philippe.preux}@univ-lille3.fr

Abstract This paper addresses the on-line recommendation problem facing new users and new items; we assume that no information is avail- able neither about users, nor about the items. The only source of infor- mation is a set of ratings given by users to some items. By on-line, we mean that the set of users, and the set of items, and the set of ratings is evolving along time and that at any moment, the recommendation system has to select items to recommend based on the currently avail- able information, that is basically the sequence of past events. We also mean that each user comes with her preferences which may evolve along short and longer scales of time; so we have to continuously update their preferences. When the set of ratings is the only available source of in- formation, the traditional approach is matrix factorization. In a decision making under uncertainty setting, actions should be selected to balance exploration with exploitation; this is best modeled as a bandit problem.

Matrix factors provide a latent representation of users and items. These representations may then be used as contextual information by the ban- dit algorithm to select items. This last point is exactly the originality of this paper: the combination of matrix factorization and bandit algo- rithms to solve the on-line recommendation problem. Our work is driven by considering the recommendation problem as a feedback controlled loop. This leads to interactions between the representation learning, and the recommendation policy.

1 Introduction

We consider the online version of the problem of the recommendation of items to users as faced by websites. Items may be ads, news, music, videos, movies, books, diapers, ... Being live, these systems have to cope with users about whom we have no information, and new items introduced in the catalog which attractive- ness is unknown. Appetence of new users towards available items, and appeal of new items towards existing users have to be estimated as fast as possible.

Currently, this situation is handled thanks to side information available on the

users, and on the items (see [2,21]). In this paper, we consider this problem from

a different perspective. Though perfectly aware of the potential utility of side in-

formation, we consider the problem without any side information, only focussing

on estimating the appetences of new users and the appeal of new items as fast

as possible; the use of side information can be mixed with the ideas presented

(3)

in this paper. Side information being unavailable, we learn a latent representa- tion of each user and each item using the currently available ratings. As already argued by others (e.g. [16]), this problem fits perfectly into the sequential de- cision making framework, and more specifically, the bandit setting [20,10,9]. A sequential decision making problem under uncertainty faces an exploration vs.

exploitation dilemma: the exploration is meant to acquire information in order to perform better subsequently by exploiting it; collecting the information has a cost that can not be merely zeroed, or simply left as an unimportant matter.

However, in rather sharp contrast with the traditional bandit setting, here the set of bandits is constantly being renewed; the number of bandits is not small, though not being huge (from a few dozens to hundreds arms in general, up to dozens of millions in some applications): this makes the problem very different from the 2-armed bandit problem; we look for efficient and effective ways to address this task, since we want the proposed solution to be able to cope with real applications on the web. For obvious practical and economical reasons, the strategy can not merely consist in repeatedly presenting all available items to users until their appetences seem accurately estimated. We have to consider the problem as an exploration vs. exploitation problem in which exploration is a necessary evil to acquire information and eventually improve the performance of the recommendation system (RS for short). To summarize, we learn a latent representation of each user and each item, from which a recommendation policy is deduced, based on the available ratings. This learning process is continuous:

the representation and the recommendation policy are updated regularly, as new ratings are observed, new items are introduced into the set of items, new users flow-in, and the preferences of already observed users change.

This being said, comes the problem of the objective function to optimize.

Since the Netflix challenge, at least in the machine learning community, the recommendation problem is often reduced to a matrix factorization problem, performed in batch, learning on a training set, and minimizing the root mean squared error (RMSE) on a testing set. However, the RMSE comes with heavy flaws. Other objective functions have been considered to handle certain of these flaws [7,19].

Based on these ideas, our contribution in this paper is the following:

we propose an original way to handle new users and new items in recom- mendation systems: we cast this problem as a sequential decision making problem to be played online that selects items to recommend in order to optimize the exploration/exploitation balance; our solution is then to perform the rating matrix factorization driven by the policy of this sequential decision problem in order to focus on the most useful terms of the factorization. This is the core idea of the contributed algorithm we name BeWARE.

The reader familiar with the bandit framework can think of this work

as a contextual bandit learning side information for each user and each

item from the observed ratings, assuming the existence of a latent space

of dimension k for both users and items. We stress the fact that learning

(4)

and updating the representation of users and items at the same time rec- ommendations are made is something very different from the traditional batch matrix factorization approach, or the traditional bandit setting.

We also introduce a methodology to use a classical partially filled rating matrices to assess the online performance of a bandit-based recommen- dation algorithm.

After introducing our notations in the next section, Sec. 3 briefly presents the matrix factorization approach. Sec. 4 introduces the necessary background in bandit theory. In Sec. 5 and Sec. 6, we present BeWARE considering in the case of new users and new items. Sec. 7 provides an experimental study on artificial data, and on real data. Finally, we conclude and draw some future lines of work in Sec. 8.

2 Notations and Vocabulary

U ^T is the transpose of matrix U, and U i denotes its i ^th row. For a vector u and a set of integers S, u S is the sub-vector of u composed of the elements of u which indices belong to S. Accordingly, U being a matrix, U _S is the sub-matrix made of the rows of U which indices belong to S. #u is the number of components (dimension) of u, and #S is the number of elements of S.

Now, we introduce a set of notations dedicated to the RS problem. As we

consider a time-evolving number of users and items, we will note n the current

number of users, and m the current number of items. These should be indexed

by a t to denote time, though often in this paper, t is dropped to simplify the

notation. Without loss of generality, we assume n < N and m < M, that is

N and M are the maximal numbers of ever seen users and items (those figures

may as large as necessary). R ^∗ represents the ground truth, that is the matrix

of ratings. r _i,j ^∗ is the rating given by user i to item j. We suppose that there

exists an integer k and two matrices U of size N × k and V of size M × k such

that R ^∗ = UV ^T . We denote S the set of elements that have been observed, and

R denote the matrix s.t. r _i,j = r ^∗ _i,j + η _i,j if (i, j) ∈ S , where η _i,j is a noise with

zero mean and finite variance. The η _i,j are i.i.d. In this paper, we assume that

R ^∗ is fixed during all the time; at a given moment, only a submatrix made of

n rows and m columns is actually useful. This part of R ^∗ that is observed is

increasing along time. That is, the set S is growing along time. J (i) (resp. I(j))

denotes the set of items rated by user i (resp. the set of users who rated item

j). ˆ U and ˆ V denote estimates (with the statistical meaning) of the matrices U

and V respectively. ˆ U V ˆ ^T is denoted by ˆ R. We use the term “observation” to

mean a triplet (i, j, r i,j ). The RS receives a stream of observations. We use the

term “rating” to mean the value associated by a user to an item. It can be a

rating as in the Netflix challenge, or an information meaning click or not, sale

or not, . . . For the sake of legibility, in the online setting we omit the t subscript

for time dependency. S, ˆ U, ˆ V, n, m should be subscripted with t.

(5)

3 Matrix Factorization

Since the Netflix challenge [4], many works in RS have been using matrix fac- torization: the matrix of observed ratings is assumed to be the product of two matrices of low rank k: ˆ R = ˆ U V ˆ ^T [11]. ˆ U is a latent representation of users, while ˆ V is a latent representation of items. As most of the values of the rating matrix are unknown, the decomposition can only be done using the set of obser- vations. The classical approach is to solve the regularized minimization problem ( ˆ U, V) ˆ ^def = argmin U,V ζ(U, V), where ζ(U, V) ^def = P

∀(i,j)∈S r i,j − U i · V _j ^T 2

+ λ · Ω(U, V), in which λ ∈ R ⁺ and is a regularization term. ζ is not convex. The minimization is usually performed either by stochastic gradient descent (SGD), or by alternate least squares (ALS). Solving for ˆ U and ˆ V at once being non con- vex, ALS iterates and at iteration, ALS alternates an optimization of ˆ U keeping V ˆ fixed, and an optimization of ˆ V keeping ˆ U fixed.

In this paper we consider ALS-WR [22] whose regularization term Ω(U, V) ^def = P

i #J (i)||U i || ² + P

j #I(j)||V j || ² depends on users and items respective importance in the matrix of ratings.

This regularization is known to have a good empirical behavior — that is limited overfitting, easy tuning of λ and k, low RMSE.

4 Bandits

Let us consider a bandit machine with m independent arms. When pulling arm j, the player receives a reward drawn from [0, 1] which follows a probability distribution ν _j . Let µ _j denote the mean of ν _j , j ^∗ ^def = argmax _j µ _j be the best arm and µ ^∗ ^def = max j µ j = µ j

^∗

be the best expected reward (we assume there is only one best arm). {ν j }, {µ j }, j ^∗ and µ ^∗ are unknown.

A player aims at maximizing the sum of rewards collected along T consecutive pulls. More specifically, by denoting j _t the arm pulled at time t and r _t the reward obtained at time t, the player wants to maximize the cumulative reward CumRew T = P T

t=1 r t . At each time-step but the last one, the player faces the dilemma:

– either exploit by pulling the arm which seems the best according to the estimated values of the parameters;

– or explore to improve the estimation of the parameters of the probability distribution of an arm by pulling it.

Li et al. [13] extend the bandit setting to contextual arms. They assume that a

vector of real features v ∈ R ^k is associated to each arm and that the expectation

of the reward associated to an arm is u ^∗ · v, where u ^∗ is an unknown vector. The

algorithm handling this setting is known as LinUCB. LinUCB consists in playing

the arm with the largest upper confidence bound on the expected reward:

(6)

j t = argmax

j

u.v ˆ ^T _j + α q

v j A ⁻¹ v ^T _j , where ˆ u is an estimate of u ^∗ , α is a parameter, and A = P t−1

t

⁰

=1 v j

_t0

.v ^T _j

t0

+ Id, where Id is the identity matrix. Note that ˆ u.v ^T _j corresponds to an estimate of the expected reward, while q

v j A ⁻¹ v ^T _j is an optimistic correction of that estimate.

While the objective of LinUCB is to maximize the cumulative reward, the- oretical results [13,1] are expressed in term of cumulative regret (or regret for short) Regret _T ^def = P T

t=1 (r _t ^∗ − r t ), where r ^∗ _t = max j u ^∗ .v ^T _j

t

stands for the best expected reward at time t. Hence, the regret measures how much the player loses (in expectation), in comparison to playing the optimal strategy. Standard results prove regrets of order ˜ O( √

T ) or O(ln T ), depending on the assumptions on the distributions and depending on the precise analysis ¹ .

Of course LinUCB and other contextual bandit algorithms require the con- text (values of features) to be provided. In real applications this is done using side information about the items and the users [17] –i.e. expert knowledge, cat- egorization of items, Facebook profiles of users, implicit feedback . . . The core idea of this paper is to use matrix factorization techniques to build a context online using the known ratings. To this end, one assumes that the items and the arms can be represented in the same space of dimension k and assuming that the rating of user u for item v is the scalar product of u and v.

We study the introduction of new items and/or new users into the RS. This is done without using any side information on users or items.

5 BeWARE of a new user

Let us consider a particular recommendation scenario. At each time-step t, 1. a user i _t requests a recommendation to the RS,

2. the RS selects an item j t among the set of items that have never been recommended to user i t beforehand,

3. user i t returns a rating r t = r i

_t

,j

_t

for item j t .

Obviously, the objective of the RS is to maximize the cumulative reward CumRew T = P T

t=1 r t . In the context of such a scenario, the usual matrix factor- ization approach of RS recommends item j t which has the best predicted rating for user i t . This corresponds to a pure exploitation, or greedy, strategy which is well-known to be suboptimal to optimize CumRew T : to be optimal, the RS has to balance the exploitation and exploration.

Let us now describe the recommendation algorithm we propose at time-step t. We aim at recommending to user i t an item j t which leads to the best trade- off between exploration and exploitation in order to maximize CumRew ∞ . We

1

O ˜ means O up to a logarithmic term on T .

(7)

assume that the matrix R is factored into ˆ U V ˆ ^T by ALS-WR which terminated by optimizing ˆ U holding ˆ V fixed. In such a context, the UCB approach is based on a confidence interval on the estimated ratings ˆ r i

_t

,j = ˆ U i

_t

· V ˆ _j ^T for any allowed item j.

We assume that we already observed a sufficient number of ratings for each item, but only a few ratings (possibly none) from user i t . As a consequence the uncertainty on ˆ U _i

_t

is much more important than on any ˆ V _j . In other words, the uncertainty on ˆ r _i

_t

_,j mostly comes from the uncertainty on ˆ U _i

_t

. Let us express this uncertainty.

Let u ^∗ denote the (unknown) true value of U i

_t

and let us introduce the k × k matrix:

A ^def = ( ˆ V _J _(i

_t

₎ ) ^T · V ˆ _J _(i

_t

₎ + λ · #J (i _t ) · Id.

As ˆ U and ˆ V comes from ALS-WR (which last iteration optimized ˆ U),

U ˆ j

_t

= A ⁻¹ V ˆ ^T _J _(i

_t

₎ R ^T _i

_t

_,J _(i

_t

₎ .

Using Azuma’s inequality over the weighted sum of random variables (as introduced by [18] for linear systems), it follows that there exists a value C ∈ R such as, with probability 1 − δ:

( ˆ U _i

_t

− u ^∗ )A ⁻¹ ( ˆ U _i

_t

− u ^∗ ) ^T ≤ C log(1/δ) t

This inequality defines the confidence bound around the estimate ˆ U i

_t

of u ^∗ . Therefore, a UCB strategy selects item j t :

j _t ^def = argmax

1≤j≤m,j / ∈J (i

t

)

U ˆ _i

_t

· V ˆ ^T _j + α

q V ˆ _j A ⁻¹ V ˆ ^T _j ,

where α ∈ R is an exploration parameter to be tuned. Fig. 1(a) provides a graphical illustration of the link between the bound, and this choice of item j t .

Our algorithm, named BeWARE.User (BeWARE which stands for “Bandit

WARms-up REcommenders”) is described in Alg. 1. The presentation is op-

timized for clarity rather than for computational efficiency. Of course, if the

exploration parameter α is set to 0 BeWARE.User makes a greedy selection for

the item to recommend. The estimation of the center of the ellipsoid and its size

can be influenced by the use of an other regularization term. BeWARE.User uses

a regularization based on ALS-WR. It is possible to replace all #J (.) by 1. This

amounts to the standard regularization: we call this slightly different algorithm

BeWARE.ALS.User. In fact one can use any regularization as long as ˆ U i

t

is a

linear combination of observed rewards.

(8)

O

R

^k

ˆ U

ⁱ^t

confidence ellipsoid

V ˆ

2

V ˆ

1

u ˜

⁽¹⁾

(a) New user.

O

R

^k

U ˆ

i_t

V ˆ

j

˜ v

^(j)

(b) New items

Figure 1. (a) The leftmost part of this figure illustrates the use of the upper confidence ellipsoid for item selection for the new user i

t

who enters the game at time t. Items and users are vectors in R

^k

. (One may suppose that k = 2 in this figure to make it in the plane.) Red dots represent items. The blue ellipse represents the confidence ellipsoid of the vector associated to the new user. The optimistic rating of the user for an item j is the maximum dot product between ˆ V

j

and any point in this ellipsoid. By a simple geometrical argument based on iso-contours of the dot product, this maximum value is equal to the dot product between ˆ V

j

and ˜ u

^(j)_i_t

. Optimism leads to recommend the item maximizing the dot product h˜ u

^(j)_i_t

, V ˆ

j

i.

(b) This figure illustrates the use of the upper confidence ellipsoid for item selection in the context of a set of new items. The setting is similar to the case of a new user except that the vector associated to the user is known (represented by a blue dot) while each item now has its confidence ellipsoids. The optimistic RS recommends the item maximizing the scalar product h U ˆ

it

, v ˜

^(j)

i.

Algorithm 1 BeWARE.User: for a user i t , recommends an item to this user.

Input: i

t

, λ, α Input/Output: R, S

1: ( ˆ U, V) ˆ ← MatrixFactorization(R) 2: A ← ( ˆ V

J(i_t)

)

^T

· V ˆ

J(i_t)

+ λ · #J (i

t

) · Id.

3: j

t

← argmax

j /∈J(i_t)

U ˆ

i_t

· V ˆ

^Tj

+ α

q V ˆ

j

A

⁻¹

V ˆ

_j^T

4: Recommend item j

t

and receive rating r

t

= r

i_t,j_t

5: Update R, S

(9)

6 BeWARE of new items

In general, a set of new items is introduced at once, not a single item. In this case, the uncertainty is more important on items. We compute a confidence bound around the items instead of the users, assuming ALS terminates with optimizing V ˆ keeping ˆ U fixed. With the same criterion and regularization on ˆ V as above, at timestep t:

V ˆ _j = B(j) ⁻¹ ( ˆ U _I(j) ) ^T R _I(j),j ,

with B(j) ^def = ( ˆ U _I(j) ) ^T U ˆ _I(j) + λ · #I(j) · Id.

So the upper confidence bound of the rating for user i on item j is:

U ˆ _i · V ˆ ^T _j + α

q U ˆ _j B(j) ⁻¹ U ˆ ^T _j .

This leads to the algorithm BeWARE.Items presented in Alg. 2. Again, the presentation is optimized for clarity rather than for computational efficiency. Be- WARE.Items can be parallelized and has the complexity of one step of ALS. Fig.

1(b) gives the geometrical intuition leading to BeWARE.Items. Again, setting α = 0 leads to a greedy selection. The regularization (line 4) can be modified.

Algorithm 2 BeWARE.Items: for a user i t , recommends an item to this user in the case where a set of new items is made available.

Input: i

t

, λ, α Input/Output: R, S

1: ( ˆ U, V) ˆ ← MatrixFactorization(R)

2: ∀j / ∈ J (i

t

), B(j) ← ( ˆ U

I(j)

)

^T

U ˆ

I(j)

+ λ · #I(j) · Id 3: j

t

← argmax

j /∈J(i_t)

U ˆ

it

. V ˆ

^T_j

+ α

q U ˆ

it

B(j)

⁻¹

U ˆ

^T_i

t

4: Recommend item j

t

and receive rating r

t

= r

i_t,j_t

5: Update R, and S

7 Experimental Investigation

In this section we evaluate empirically BeWARE on artificial data, and on real datasets. The BeWARE algorithms are compared to:

– greedy approaches (denoted Greedy.ALS and Greedy.ALS-WR) that always choose the item with the largest current estimated value (respectively given a decomposition obtained by ALS, or by ALS-WR),

– the UCB1 approach [3] (denoted UCB.on.all.users) that considers each re-

ward r i

_t

,j

_t

as an independent realization of a distribution ν j

_t

. In other words,

UCB.on.all.users recommends an item without taking into account the in-

formation on the user requesting the recommendation.

(10)

The comparison to greedy selection highlights the needs of exploration to have an optimal algorithm in the online context. The comparison to UCB.on.all.users assesses the benefit of personalizing recommendations.

7.1 Experimental Setting

For each dataset, each algorithm starts with an empty R matrix of 100 items and 200 users. Then, the evaluation goes like this:

1. select a user uniformly at random among those who have not yet rated all the items,

2. request his favorite item among those he has not yet rated,

3. compute the immediate regret (the difference of rating between the best not yet selected item and the one selected by the algorithm),

4. iterate until all users have rated all items.

The difficulty with real datasets is that the ground truth is unknown, and actually, only a very small fraction of ratings is known. This makes the evaluation of algorithms uneasy. To overcome these difficulties, we also provide a comparison of the algorithms considering an artificial problem based on a ground truth matrix R ^∗ considering m users and n items. This matrix is generated as in [6].

Each item belongs to either one of k genres, and each user belongs to either one of l types. For each item j of genre a and each user i of type b, r ^∗ _i,j = p a,b is the ground truth rating of item j by user i, where p a,b is drawn uniformly at random in the set {1, 2, 3, 4, 5}. The observed rating r _i,j is a noisy value of r ^∗ _i,j : r _i,j = r ^∗ _i,j + N (0, 0.5).

We also consider real datasets, the NetFlix dataset [4] and the Yahoo!Music dataset [8]. Of course, the major issue with real data is that there is no dataset with a complete matrix, which means we do no longer have access to the ground truth R ^∗ , which makes the evaluation of algorithms more complex. This issue is usually solved in the bandit literature by using a method based on reject sampling [14]. For a well constructed dataset, this kind of estimators has no bias and a known bound on the decrease of the error rate [12]. For all the algorithms, we restrict the possible choices for a user at time-step t to the items with a known rating in the dataset. However, a minimum amount of ratings per user is needed to be able to have a meaningful comparison of the algorithms (otherwise, a random strategy is the only reasonable one). As a consequence, with both datasets, we focus on the 5000 heaviest users for the top ∼250 movies/songs.

This leads to a matrix R e ^∗ with only 10% to 20% of missing ratings. We insist on the fact that this is necessary for performance evaluation of the algorithms;

obviously, this is not required to use the algorithms on a live RS.

We would like to advertize that this experimental methodology has a unique

feature: this methodology allows us to turn any matrix of ratings into an online

problem which can be used to test bandit recommendation algorithms. We think

that this methodology is an other contribution of this paper.

(11)

0 1000 2000 3000 4000

0200040006000800010000

t

Cumulated Regret

Random Greedy.ALS Greedy.ALS−WR BeWARE.ALS.users BeWARE.users BeWARE.ALS.items BeWARE.items UCB on all users

(a) Artificial dataset.

0 1000 2000 3000 4000

010002000300040005000

t

Cumulated Regret

(b) Netflix dataset.

0 1000 2000 3000 4000

050000100000150000200000

t

Cumulated Regret

(c) Yahoo!Music dataset.

Figure 2. Cumulated regret (the lower, the better) for a set of 100 new items and 200 users with no prior information. Figures are averaged over 20 runs (for Netflix and artificial data, k = 5, λ = 0.05, α = 0.12 whereas for Yahoo!Music, k = 8, λ = 0.2, α = 0.05). On the artificial dataset (a), BeWARE.items is better than the other strategies in terms of regret. On the Netflix dataset (b), UCB on all users is the best approach and BeWARE.items is the second best. On the Yahoo!Music dataset (c), BeWARE.items, Greedy.ALS-WR and UCB all 3 lead to similar performances.

7.2 Experimental Results

Figures 2(a) and 2(b) show that given a fixed factorization method, BeWARE strategies outperform greedy item selection. Looking more closely at the results, BeWARE.items performs better than BeWARE.user, and BeWARE.user is the only BeWARE strategy beaten by its greedy counterpart (Greedy.ALS-WR) on the Netflix dataset. These results demonstrate that an online strategy has to care about exploration to tend towards optimality.

While UCB.on.all.users is almost the worst approach on artificial data (Fig.

2(a)), it surprisingly performs better than all other approaches on the Netflix dataset. We feel that this difference is strongly related to the preprocessing of the Netflix dataset we have done to be able to follow the experimental protocol (and have an evaluation at all). By focusing on the top ∼250 movies, we only keep blockbusters that everyone enjoys. With that particular subset of movies, there is no need to adapt the recommendation user per user. As a consequence, UCB.on.all.users suffers a smaller regret than other strategies, as it considers users as n independent realizations of the same distribution. It is worth noting that the regret of UCB.on.all.users would increase with the number of items while the regret of BeWARE scales with the dimensionality of the factorization, which makes BeWARE a better candidates for real applications with much more items to deal with.

Last, on the Yahoo! Music datatset (Fig. 2(c)), all algorithms suffer the same

regret.

(12)

7.3 Discussion

In a real setting, BeWARE.items has a desirable property: it tends to favor new items with regards to older ones because they simply have less ratings than the others, hence larger confidence bounds. So the algorithm gives them a boost which is exactly what a webstore is willing. Moreover, the RS then uses at its best the novelty effect associated to new items. This natural attraction of users for new items can be very strong as it has been shown during the Exploration &

Exploitation challenge at ICML’2012 which was won by a context free algorithm [15].

The computational cost of BeWARE is the same as doing an additional step of alternate least squares; moreover some intermediate calculations of the QR factorization can be re-used to speed up the computation. So the total cost of BeWARE.Items is almost the same as ALS-WR. Even better, while the online setting requires to recompute the factorization at each time-step, this factoriza- tion changes only slightly from one iteration to the other. As a consequence, only a few ALS-WR iterations are needed to update the factorization. Overall the computational cost remains reasonable even in a real application.

8 Conclusion and Future Work

In this paper, we have bridged matrix factorization with bandits to address in a principled way the balance between exploration and exploitation faced by online recommendations systems when considering new users or new items. We think that this contribution is conceptually rich, and opens ways to many different studies. We showed on large, publicly available datasets that this approach is also effective, leading to efficient algorithms able to work online, under the expected computational constraints of such systems. Furthermore, the algorithms are quite easy to implement.

Many extensions are currently under study. First, we work on extending these algorithms to use contextual information about users, and items. This will require combining the similarity measure with confidence bounds; this might be translated into a Bayesian prior. We also want to analyze regret bound for large enough number of items and users. This part can be tricky as LinUCB still does not have a full formal analysis, though some insights are available in [1].

An other important point is to work on the recommendation of several items at once and get feedback only for the one. There has been some work in the non contextual bandits on this point [5].

Finally, we plan to combine confidence ellipsoid about both users and items.

We feel that such a combination has low odds of providing better results for real applications, but it is interesting from a theoretical perspective, and should lead to even better results on artificial problems.

Acknowledgements: authors acknowledge the support of INRIA, and the

stimulating environment of the research group SequeL.

(13)

References

1. Y. Abbasi-yadkori, D. Pal, and Cs. Szepesvari. Improved algorithms for linear stochastic bandits. In Proc. NIPS, pages 2312–2320, 2011.

2. D. Agarwal, B-Ch. Chen, P. Elango, N. Motgi, S-T. Park, R. Ramakrishnan, S. Roy, and J. Zachariah. Online models for content optimization. In Proc. NIPS, pages 17–24, 2008.

3. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, May 2002.

4. J. Bennett, S. Lanning, and Netflix. The Netflix prize. In KDD Cup and Workshop, 2007.

5. N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. J. Comput. Syst. Sci., 78(5):1404–1422, 2012.

6. Sourav Chatterjee. Matrix estimation by universal singular value thresholding.

pre-print, 2012. http://arxiv.org/abs/1212.1247.

7. Ch. Dhanjal, R. Gaudel, and S. Cl´ emen¸ con. Collaborative filtering with localised ranking. In Proc. AAAI, 2015.

8. G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The Yahoo! music dataset and kdd-cup’11. In Proceedings of KDD Cup, 2011.

9. S. Feldman. Personalization with contextual bandits. http://engineering.

richrelevance.com/author/sergey-feldman/.

10. P. Kohli, M. Salek, and G. Stoddard. A fast bandit algorithm for recommendations to users with heterogeneous tastes. In Proc. AAAI, pages 1135–1141, 2013.

11. Y. Koren, R. Bell, and Ch. Volinsky. Matrix factorization techniques for recom- mender systems. Computer, 42(8):30–37, August 2009.

12. J. Langford, A. Strehl, and J. Wortman. Exploration scavenging. In Proc. ICML, pages 528–535. Omnipress, 2008.

13. L. Li, W. Chu, J. Langford, and R.E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proc. WWW, pages 661–670, New York, NY, USA, 2010. ACM.

14. L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual- bandit-based news article recommendation algorithms. In Proc. WSDM, pages 297–306. ACM, 2011.

15. J. Mary, A. Garivier, L. Li, R. Munos, O. Nicol, R. Ortner, and P. Preux. Icml exploration and exploitation 3 - new challenges, 2012.

16. G. Shani, D. Heckerman, and I. Brafman Ronen. An MDP-based recommender system. Journal of Machine Learning Research, 6:1265–1295, September 2005.

17. P.K. Shivaswamy and Th. Joachims. Online learning with preference feedback, 2011. NIPS workshop on choice models and preference learning.

18. Th. J. Walsh, I. Szita, C. Diuk, and Michael L. Littman. Exploring com- pact reinforcement-learning representations with linear regression. CoRR, abs/1205.2606, 2012.

Bandits and Recommender Systems

HAL Id: hal-01256033

https://hal.inria.fr/hal-01256033

Submitted on 14 Jan 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bandits and Recommender Systems

Jérémie Mary, Romaric Gaudel, Philippe Preux

To cite this version:

Jérémie Mary, Romaric Gaudel, Philippe Preux. Bandits and Recommender Systems. First Interna-

tional Workshop on Machine Learning, Optimization, and Big Data (MOD’15), Jul 2015, Taormina,

Italy. pp.325-336, �10.1007/978-3-319-27926-8_29�. �hal-01256033�

J´ er´ emie Mary, Romaric Gaudel, and Philippe Preux Universit´ e de Lille, CRIStAL (UMR CNRS), Villeneuve d’Ascq, France

{jeremie.mary,romaric.gaudel,philippe.preux}@univ-lille3.fr

1 Introduction

Currently, this situation is handled thanks to side information available on the

users, and on the items (see [2,21]). In this paper, we consider this problem from

a different perspective. Though perfectly aware of the potential utility of side in-

formation, we consider the problem without any side information, only focussing

on estimating the appetences of new users and the appeal of new items as fast

as possible; the use of side information can be mixed with the ideas presented

exploitation dilemma: the exploration is meant to acquire information in order to perform better subsequently by exploiting it; collecting the information has a cost that can not be merely zeroed, or simply left as an unimportant matter.

the representation and the recommendation policy are updated regularly, as new ratings are observed, new items are introduced into the set of items, new users flow-in, and the preferences of already observed users change.

This being said, comes the problem of the objective function to optimize.

Based on these ideas, our contribution in this paper is the following:

The reader familiar with the bandit framework can think of this work

as a contextual bandit learning side information for each user and each

item from the observed ratings, assuming the existence of a latent space

of dimension k for both users and items. We stress the fact that learning

and updating the representation of users and items at the same time rec- ommendations are made is something very different from the traditional batch matrix factorization approach, or the traditional bandit setting.

We also introduce a methodology to use a classical partially filled rating matrices to assess the online performance of a bandit-based recommen- dation algorithm.

2 Notations and Vocabulary

Now, we introduce a set of notations dedicated to the RS problem. As we

consider a time-evolving number of users and items, we will note n the current

number of users, and m the current number of items. These should be indexed

by a t to denote time, though often in this paper, t is dropped to simplify the

notation. Without loss of generality, we assume n < N and m < M, that is

N and M are the maximal numbers of ever seen users and items (those figures

may as large as necessary). R ∗ represents the ground truth, that is the matrix

of ratings. r i,j ∗ is the rating given by user i to item j. We suppose that there

exists an integer k and two matrices U of size N × k and V of size M × k such

that R ∗ = UV T . We denote S the set of elements that have been observed, and

R denote the matrix s.t. r i,j = r ∗ i,j + η i,j if (i, j) ∈ S , where η i,j is a noise with

zero mean and finite variance. The η i,j are i.i.d. In this paper, we assume that

R ∗ is fixed during all the time; at a given moment, only a submatrix made of

n rows and m columns is actually useful. This part of R ∗ that is observed is

increasing along time. That is, the set S is growing along time. J (i) (resp. I(j))

denotes the set of items rated by user i (resp. the set of users who rated item

j). ˆ U and ˆ V denote estimates (with the statistical meaning) of the matrices U

and V respectively. ˆ U V ˆ T is denoted by ˆ R. We use the term “observation” to

mean a triplet (i, j, r i,j ). The RS receives a stream of observations. We use the

term “rating” to mean the value associated by a user to an item. It can be a

rating as in the Netflix challenge, or an information meaning click or not, sale

or not, . . . For the sake of legibility, in the online setting we omit the t subscript

for time dependency. S, ˆ U, ˆ V, n, m should be subscripted with t.

3 Matrix Factorization

∀(i,j)∈S r i,j − U i · V j T 2

In this paper we consider ALS-WR [22] whose regularization term Ω(U, V) def = P

i #J (i)||U i || 2 + P

j #I(j)||V j || 2 depends on users and items respective importance in the matrix of ratings.

This regularization is known to have a good empirical behavior — that is limited overfitting, easy tuning of λ and k, low RMSE.

4 Bandits

Let us consider a bandit machine with m independent arms. When pulling arm j, the player receives a reward drawn from [0, 1] which follows a probability distribution ν j . Let µ j denote the mean of ν j , j ∗ def = argmax j µ j be the best arm and µ ∗ def = max j µ j = µ j

be the best expected reward (we assume there is only one best arm). {ν j }, {µ j }, j ∗ and µ ∗ are unknown.

A player aims at maximizing the sum of rewards collected along T consecutive pulls. More specifically, by denoting j t the arm pulled at time t and r t the reward obtained at time t, the player wants to maximize the cumulative reward CumRew T = P T

t=1 r t . At each time-step but the last one, the player faces the dilemma:

– either exploit by pulling the arm which seems the best according to the estimated values of the parameters;

– or explore to improve the estimation of the parameters of the probability distribution of an arm by pulling it.

Li et al. [13] extend the bandit setting to contextual arms. They assume that a

vector of real features v ∈ R k is associated to each arm and that the expectation

of the reward associated to an arm is u ∗ · v, where u ∗ is an unknown vector. The

algorithm handling this setting is known as LinUCB. LinUCB consists in playing

the arm with the largest upper confidence bound on the expected reward:

j t = argmax

j

u.v ˆ T j + α q

v j A −1 v T j , where ˆ u is an estimate of u ∗ , α is a parameter, and A = P t−1

t

=1 v j

.v T j

may as large as necessary). R ^∗ represents the ground truth, that is the matrix

of ratings. r _i,j ^∗ is the rating given by user i to item j. We suppose that there

that R ^∗ = UV ^T . We denote S the set of elements that have been observed, and

R denote the matrix s.t. r _i,j = r ^∗ _i,j + η _i,j if (i, j) ∈ S , where η _i,j is a noise with

zero mean and finite variance. The η _i,j are i.i.d. In this paper, we assume that

R ^∗ is fixed during all the time; at a given moment, only a submatrix made of

n rows and m columns is actually useful. This part of R ^∗ that is observed is

and V respectively. ˆ U V ˆ ^T is denoted by ˆ R. We use the term “observation” to

∀(i,j)∈S r i,j − U i · V _j ^T 2

In this paper we consider ALS-WR [22] whose regularization term Ω(U, V) ^def = P

i #J (i)||U i || ² + P

j #I(j)||V j || ² depends on users and items respective importance in the matrix of ratings.

Let us consider a bandit machine with m independent arms. When pulling arm j, the player receives a reward drawn from [0, 1] which follows a probability distribution ν _j . Let µ _j denote the mean of ν _j , j ^∗ ^def = argmax _j µ _j be the best arm and µ ^∗ ^def = max j µ j = µ j

be the best expected reward (we assume there is only one best arm). {ν j }, {µ j }, j ^∗ and µ ^∗ are unknown.

A player aims at maximizing the sum of rewards collected along T consecutive pulls. More specifically, by denoting j _t the arm pulled at time t and r _t the reward obtained at time t, the player wants to maximize the cumulative reward CumRew T = P T

vector of real features v ∈ R ^k is associated to each arm and that the expectation

of the reward associated to an arm is u ^∗ · v, where u ^∗ is an unknown vector. The

u.v ˆ ^T _j + α q

v j A ⁻¹ v ^T _j , where ˆ u is an estimate of u ^∗ , α is a parameter, and A = P t−1

.v ^T _j

+ Id, where Id is the identity matrix. Note that ˆ u.v ^T _j corresponds to an estimate of the expected reward, while q

v j A ⁻¹ v ^T _j is an optimistic correction of that estimate.

While the objective of LinUCB is to maximize the cumulative reward, the- oretical results [13,1] are expressed in term of cumulative regret (or regret for short) Regret _T ^def = P T

t=1 (r _t ^∗ − r t ), where r ^∗ _t = max j u ^∗ .v ^T _j

T ) or O(ln T ), depending on the assumptions on the distributions and depending on the precise analysis ¹ .

Let us consider a particular recommendation scenario. At each time-step t, 1. a user i _t requests a recommendation to the RS,

assume that the matrix R is factored into ˆ U V ˆ ^T by ALS-WR which terminated by optimizing ˆ U holding ˆ V fixed. In such a context, the UCB approach is based on a confidence interval on the estimated ratings ˆ r i

· V ˆ _j ^T for any allowed item j.

We assume that we already observed a sufficient number of ratings for each item, but only a few ratings (possibly none) from user i t . As a consequence the uncertainty on ˆ U _i

is much more important than on any ˆ V _j . In other words, the uncertainty on ˆ r _i

_,j mostly comes from the uncertainty on ˆ U _i

Let u ^∗ denote the (unknown) true value of U i

A ^def = ( ˆ V _J _(i

₎ ) ^T · V ˆ _J _(i

₎ + λ · #J (i _t ) · Id.

= A ⁻¹ V ˆ ^T _J _(i

₎ R ^T _i

_,J _(i

₎ .

( ˆ U _i

− u ^∗ )A ⁻¹ ( ˆ U _i

− u ^∗ ) ^T ≤ C log(1/δ) t

of u ^∗ . Therefore, a UCB strategy selects item j t :

j _t ^def = argmax

U ˆ _i

· V ˆ ^T _j + α

q V ˆ _j A ⁻¹ V ˆ ^T _j ,