Low-rank Bandits with Latent Mixtures

(1)

HAL Id: hal-01400318

https://hal.archives-ouvertes.fr/hal-01400318

Preprint submitted on 21 Nov 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Low-rank Bandits with Latent Mixtures

Aditya Gopalan, Odalric-Ambrym Maillard, Mohammadi Zaki

To cite this version:

Aditya Gopalan, Odalric-Ambrym Maillard, Mohammadi Zaki. Low-rank Bandits with Latent Mix- tures. 2016. �hal-01400318�

(2)

arXiv:1609.01508v1 [cs.LG] 6 Sep 2016

Low-rank Bandits with Latent Mixtures

Aditya Gopalan¹ ADITYA@ECE.IISC.ERNET.IN

Odalric-Ambrym Maillard² ODALRICAMBRYM.MAILLARD@INRIA.FR

Mohammadi Zaki¹ ZAKI@ECE.IISC.ERNET.IN

1.Electrical Communication Engineering, Indian Institute of Science,

Bangalore 560012, India 2.Inria Saclay – ˆIle de France

Laboratoire de Recherche en Informatique 660 Claude Shannon

91190 Gif-sur-Yvette, France

Abstract

We study the task of maximizing rewards from recommending items (actions) to users sequentially interacting with a recommender system. Users are modeled as latent mixtures ofCmany representative user classes, where each class specifies a mean reward profile across actions. Both the user features (mixture distribution over classes) and the item features (mean reward vector per class) are unknown a priori. The user identity is the only contextual information available to the learner while interacting. This induces a low-rank structure on the matrix of expected rewardsra,bfrom recommending itemato userb. The problem reduces to the well-known linear bandit when either user- or item-side features are perfectly known. In the setting where each user, with its stochastically sampled taste profile, interacts only for a small number of sessions, we develop a bandit algorithm for the two-sided uncertainty. It combines the Robust Tensor Power Method ofAnandkumar et al.

(2014b) with the OFUL linear bandit algorithm ofAbbasi-Yadkori et al.(2011). We provide the first rigorous regret analysis of this combination, showing that its regret afterTuser interactions is O(C˜ √

BT), withB the number of users. An ingredient towards this result is a novel robustness property of OFUL, of independent interest.

Keywords: Multi-armed bandits, online learning, low-rank matrices, recommender systems, rein- forcement learning.

1. Introduction

Recommender systems aim to provide targeted, personalized content recommendations to users by learning their responses over time. The underlying goal is to be able to predict which items a user might prefer based on preferences expressed by other related users and items, also known as the principle of collaborative filtering.

A popular approach to model preferences expressed by users in recommender systems is via probabilistic mixture models or latent class models (Hofmann and Puzicha,1999;Kleinberg and Sandler, 2004). In such a mixture model, we have a set ofAitems (content) that can be recommended to B users (consumers). Whenever itemais recommended to userb, the system gains an expected reward ofra,b. The key structural assumption that captures the relationship between users’ preferences is that there exists a set of latent set ofCrepresentative user types or typical taste profiles.

(3)

Formally, each taste profilec is a unique vectoru_c ≡(ua,c)_a of the expected rewards that every itemaelicits under the taste profile. Each userbis assumed to sample one of the typical profiles randomly using an individual probability distributionv_b ≡(vb,c)_c; its reward distribution across the items subsequently becomes that induced by the assumed profile.

Our focus is to address the sequential optimization of net reward gained by the recommender, without any prior knowledge of either the latent user classes or users’ mixture distributions. As- suming that users arrive to the system repeatedly following an unknown stochastic process and re-sample their profiles over time, according to their respective unknown mixtures across latent classes, we seek online learning strategies that can achieve low regret relative to the best single item that can be recommended to each user. Note that this is qualitatively different than the task of estimating latent classes or user mixtures in a batch fashion, well-studied by now (Sutskever et al., 2009;Anandkumar et al.,2014a,b); the task of simultaneously optimizing net utility in a bandit fashion in complex expression models like these has received little or no analytical treatment. Our work takes a step towards filling this void.

An especially challenging aspect of online learning in recommender systems is the relatively meager number of available interactions with a same user, which is offset to an extent by the assumption that users can only have a limited number of taste profiles (classes). Indeed, if one can identify the class to which a certain user belongs and aggregate information from all other users in that class, then one can recommend to the user the best item for the class. In practice, classes are latent and not necessarily known in advance, and several works (Gentile et al.,2014;Lazaric et al., 2013;Maillard and Mannor,2014) study the restricted situation when each user always belongs to one specific class (i.e., when all mixture distributions have support size1). We go two steps further, since in many situations (a) users cannot be assumed to belong to one class only, such as when a user account is shared by several individuals (e.g. a smart-TV), and (b) the duration of a user-session, that is the number of consecutive recommendations to the same individual connected to a user-account, cannot assumed to be long¹.

The key challenges that this work addresses are (1) the lack of knowledge of “features” on both the user-side and item-side in a linear bandit problem (in this case, both the user mixture weights and the item class reward profiles) and (2) provable regret minimization with very few i.e.

O(1)interactions with every userbhaving a specific taste profile, as opposed to a large number of interactions such as in transfer learning (Lazaric et al.,2013).

Contributions and overview of results. We consider a setting when users are assumed to come from arbitrary mixtures across classes (they are not assumed to fall perfectly in one class as was the assumption in works byGentile et al.(2014);Maillard and Mannor(2014)). We develop a novel bandit algorithm (Algorithm3) that combines (a) the Optimization in the Face of Uncer- tainty Linear bandit OFUL algorithm (Abbasi-Yadkori et al.,2011) for bandits with known action features, and (b) a variant of the Robust Tensor Power (RTP) algorithm (Anandkumar et al.,2014b) that uses only bandit (partial) estimates of latent user classes with observations coming from a mix- ture model. More specifically, we introduce a subroutine (Algorithm1) that makes use of the RTP method to extract item-side attributes (U) and, contributing to its theoretical analysis, show a recovery property (Theorem1). Note that the RTP method ideally requires (unbiased) estimates of the2nd and3rd order moments of actions’ rewards, but with bandit information the learner can access only partial reward information, i.e., a single reward sample from an action. To overcome this, we devise an importance sampling scheme across3successive time instants to build the2nd and3rd order moment tensor estimates that RTP uses. For the task of issuing recommendations, we develop an algorithm (section4), essentially based on OFUL, instantiated per user, using for each athe estimated latent class vectors{ua,c}^c(obtained via the RTP subroutine) as arm features, and uncertain parameter vector to be learnedv_b.

1. It is also unlikely to be very short, say, less than 3.

(4)

We carry out a rigorous analysis of the algorithm and show that it achieves regretO(ℓC˜ √ BT) inT rounds of interaction (Theorem4), provided each arriving user interacts with the system for ℓ > 3 rounds with the same profile. In comparison, the regret of the strategy that completely disregards the latent mixture structure of rewards and employs a standard bandit strategy (e.g.

UCB (Auer et al.,2002)) per user, scales asO(Bp

T A/B) =O(√

ABT)afterT rounds², which is considerably suboptimal in the practical case with a very large number of items but very few representative user classes (C ≪ A). It is also worth noting that the regret bound we achieve, order-wise, is what would result from applying the OFUL or any optimal linear bandit algorithm assuming a priori knowledge of all latent user classes{ua,c}a,c, that isO(ℓC˜ √

BT). In this sense, our result shows that one can simultaneously estimate features on both sides of a bilinear reward model and achieve regret performance equivalent to that of a one-sided linear model, which is the first result of its kind to the best of our knowledge³. Our results are presented for finite time horizons with explicit details of the constants arising from the error analysis of RTP, which at this point are large but possibly improvable.

En route to deriving the regret for our algorithm, we also make a novel contribution that ad- vances the theoretical understanding of OFUL, and which is of independent interest. We show that in the standard linear bandit setting, where the expected reward of an arm linearly depends ondfeatures, OFUL yields (sub-linear)O˜

ρd√ T

regret even when it makes decisions based on perturbed or inexact feature vectors (Theorem3), whereρquantifies the distortion. This property holds whenever the perturbation error is small enough, and we explicitly give both (a) a sufficient condition on the size of the perturbation in terms of the set of actual features, and (b) a bound on the (multiplicative) distortionρin the regret due to the perturbation (note thatρ = 1in the ideal linear case).

2. Setup and notation

For any positive integern,[n]denotes the set{1,2, . . . , n}.

At eachn∈N, nature selects a userbn ∈[B]according to the probability distributionβover [B], independent of the past, and bn is revealed to the learner. A user classcn is subsequently sampled from the probability distributionv_b

n over[C], and cn (the assumed class of user bn) interacts with the learner for the nextℓ > 3consecutive steps. Such an interaction will often be termed a mini-session.

In each stepl ∈[ℓ]of a mini-session, the learner plays an action (issues a recommendation) an,l ∈ [A] and subsequently receives rewardYn,l = uan,l,cn+ηn,l, whereηn,lis a (centered) R-sub-Gaussian i.i.d. random variable independent from an,l, cn, representing the noise in the reward. We letu_a∈R^Crepresent the vector(ua,c)c∈[C]of the mean rewards from actionain each class. Note thatE[uan,l,cn|an,l] = E[u^⊤_a

n,lv_b

n|an,l]. For convenience, we use the index notation t ≡(n, l)and introduceT =N ℓ, whereN is the total number of mini-sessions, andT the total number of interactions of the learner with the system. We denote likewiseYt, at, ct, ηtforYn,l, an,l,cn,ηn,l, and letumaxdef

= maxa∈[A],c∈[C]|ua,c|.

We are interested in designing an online recommendation strategy, i.e., one that plays actions depending on past observations, achieving low (cumulative) regret afterT ≡(N, ℓ)mini-sessions, defined asR^T ^def= P

n∈[N],l∈[ℓ]rn,l, wherern,ldef

= maxa∈[A]u^⊤_av_b_n−u^⊤_a

n,lv_b_n. In other words, we wish to compete against a strategy that plays for every user an action yielding the highest reward in expectation under its mixture distribution over user classes.

2. Roughly, each UCB per-user plays from a pool of A actions for about T /B rounds, thus suffering regret O(p

A(T /B)).

3. An earlier result ofDjolonga et al.(2013) getsO(T^4/5)regret while moreover assuming a perfect control of the sampling process (we can’t assume this due to the user arrivals).

(5)

3. Recovering latent user classes: The EstimateFeatures subroutine

In this section, we provide an estimation algorithm for the matrixU, using the RTP method.⁴ Estimation of tensors. We assume that in mini-sessionn, when interacting with userbn, the triplet{an,l}l6ℓ is chosen from a distribution pn(a, a^′, a^′′|bn). Letting Xan,l,bn,n,l

def= Yn,l = uan,l,cn +ηn,l to explicitly indicate the active user and action chosen at (n, l), we form the importance-weighted estimates

˜ ra,a^′,n

def= 1 n

Xn i=1

Xai,1,bi,i,1Xai,2,bi,i,2

pi(a, a^′|bi) I{ai,1=a, ai,2=a^′},

˜

ra,a^′,a^′′,ndef

= 1 n

Xn

i=1

Xai,1,bi,i,1Xai,2,bi,i,2Xai,3,bi,i,3

pi(a, a^′, a^′′|bi) I{ai,1=a, ai,2=a^′, ai,3=a^′′}. for the second and third-order tensors⁵.

We introduce the matricesMcn,2≡(˜ra,a^′,n)a,a^′∈[A]andM2≡(ma,a^′)a,a^′∈[A]withma,a^′ def= E[˜ra,a^′,n], and the tensorsMcn,3 ≡ (˜ra,a^′,a^′′,n)a,a^′,a^′′∈[A] andM3 ≡ (ma,a^′,a^′′)a,a^′,a^′′∈[A] with ma,a^′,a^′′

def= E[˜ra,a^′,a^′′,n]. The following result decomposes the matrix M2 and tensor M3 as weighted sums of outer products.

Lemma 1 When the user arrivals are i.i.d. according to the lawβ, i.e.,bi i.i.d

∼ β∀i∈[n], it holds that ma,a^′,n= X

c∈[C]

vβ,cua,cua^′,c, and

ma,a^′,a^′′,n= X

c∈[C]

vβ,cua,cua^′,cua^′′,c.

Having shown the unbiasedness of the empirical2nd and3rd moment tensorsMcn,2andMcn,3, we next turn to showing concentration to their respective means.

Lemma 2 Assuming thatpi(a, a^′|bi) > q2,i andpi(a, a^′, a^′′|bi) > q3,i for deterministicq2,i, q3,i, for all i∈N, a, a^′, a^′′∈[A], then for alln6N, with probability higher than1−δ, it holds simultaneously for all a, a^′, a^′′that

|˜ra,a^′,n−ma,a^′,n|6 vu ut

Xn i=1

q⁻_2,i²log(4A²/δ) 2n² ,

|r˜a,a^′,n−ma,a^′,a^′′,n|6 vu ut

Xn

i=1

q⁻_3,i²log(4A³/δ) 2n² . An immediate corollary is the following one:

Corollary 1 Provided thatq2,i=γi/A²andq3,i=γi/A³for someγi >0, then on an event of probability higher than1−δ, the following hold simultaneously:

e⁽²⁾_n ^def= kMcn,2−M2k6A³ vu ut

Xn

i=1

γ⁻_i ²log(4A²/δ) 2n² ,

e⁽³⁾_n ^def= kMcn,3−M3k6A^9/2 vu ut

Xn i=1

γ⁻_i ²log(4A³/δ) 2n² .

4. We considerℓ= 3to describe the algorithm;ℓ >3is easily handled by repeating the 3-wise samplingp(a, a^′, a^′′)for

⌊ℓ/3⌋times and discarding the remaining (<3) steps in the mini-session during exploration (leading to a negligible regret overhead).

5. An alternative is the implicit exploration method due toKoc´ak et al.(2014).

(6)

Algorithm 1 EstimateFeatures

1: Input:

#sessions

n;

#mini-sessions

ℓ;

(user, action, reward) tuples

(b_i, a_i,l, X_a_i,l_,b_i_,i,l)16i6n,16l6ℓ

.

2:

Compute the

A ×A

matrix

Mcn,2 = (˜r_a,a^′_,n)_a,a^′_∈[A]

and the

A ×A ×A

tensor

Mcn,3 = (rb_a,a^′_,a^′′_,)_a,a^′_,a^′′_∈[A]

.

3:

Compute a

A×C

whitening matrix

cW_n

of

Mc_n,2

{

Take

Wc_n=Ub_nDbn^−1/2

where

Db_n

is the

C×C

diagonal matrix with the top

C

eigenvalues of

Mc_n,2

, and

Ub_n

the

A×C

matrix of corresponding eigenvectors.

}

4:

Form the

C×C×C

tensor

Tbn=Mcn,3(Wcn,cWn,cWn).

5:

Apply the RTP algorithm (Anandkumar et al.,

2014b) toTb_n

, and compute its robust eigenvalues

(bλ_n,c)_c∈[C]

with eigenvectors

(ϕb_n,c)_c∈[C]

.

{

The paper of

Anandkumar et al.

(2014b, Sec. 4) defines eigenvalues/eigenvectors of tensors.

}

6:

Compute for each

c∈[C],u_n,c=λ_n,c(cW_n^⊤)^†ϕb_n,c

and

v_n,c=λ⁻²_n,c

.

7: Output: Estimate of latent classesU

: The

A×C

matrix

U_n

obtained by stacking the vectors

u_n,c∈R^A

side by side.

Reconstruction algorithm. The EstimateFeatures algorithm (Algorithm1) employs a whitening matrixWcn, of the empirical estimate of the matrixM2, to build the empirical tensorTbn. This tensor is then used to recover the columns of the matrixU = (ua,c)a∈[A],c∈[C] via the RTP algorithm. For the sake of completeness, we also introduceW, a whitening matrix ofM2 (i.e., W^TM2W = I), the corresponding tensorT = M3(W, W, W), and finally the estimation error en

def= kTbn−Tk.

Reconstruction guarantee. Our next result makes use of the following proposition from Anandkumar et al.(2014b, Theorem 5.1), restated here for completeness.

Proposition 1 (Theorem 5.1 ofAnandkumar et al.(2014b)) LetTb = T +E ∈ R^C^×^C^×^C, whereT is a symmetric tensor with orthogonal decompositionT = PC

c=1λcϕ^⊗_c³, where eachλc >0,{ϕc}c∈[C]is an orthonormal basis, andEis a symmetric tensor with operator norm||E||6 ε. Letλmin = min{λc : c ∈ [C]},λmax= max{λc :c∈[C]}. Run the RTP algorithm with inputTbforCiterations. Let{(bλc,ϕbc)}c∈[C]

be the corresponding sequence of estimated eigenvalue/eigenvector pairs returned. Then, there exist universal constantsC1, C2 > 0for which the following is true. Fixη ∈ (0,1)and run RTP with parameters (i.e., number of iterations)L, N withL =poly(C) log(1/η), andN >C2

log(C) + log log ^λ^max_ε . Ifε6 C1λmin

C , then with probability at least1−η, there exists a permutationπ∈S_Csuch that

∀c∈[C] : |λc−bλπ(c)|65ε, ||ϕc−ϕbπ(c)||68ε/λc, and ||T−

XC c=1

λbcϕb^⊗_c³||655ε .

Lemma1gives a decomposition of the (symmetric) tensorM3, but it may be not orthogonal;

standard transformation (Anandkumar et al.,2014b, Sec. 4.3) gives an orthogonal decomposition for the tensor⁶M3(W, W, W), withW a matrix that whitensM2. We can thus use Proposition1 6. For a 3rd order tensorA∈Râ^×â^×âand 2nd order tensor or matrixB ∈Râ^×^b,A(B, B, B)∈R^b^×^b^×^bis the 3rd

order tensor defined by[A(B, B, B)]i1,i2,i3 def= P

j1,j2,j3∈[n]Aj1,j2,j3Bj1,i1Bj2,i2Bj3,i3. SeeAnandkumar et al.

(2014b) for more details on notation and results.

(7)

withT =M3(W, W, W),Tb =Tbn,ε =enandη =δin order to prove the following guarantee (Theorem1) on the recovery error between columns ofUand their estimate.

We now introduce mild separability conditions on the mixture weightsv_b and the spectrum of the 2nd moment matrixM2 needed for the reconstruction guarantee to hold, similar to those assumed forLazaric et al.(2013, Theorem 2).

Assumption 1 There exist positive constantsvmin,σmin,σmaxandΓsuch that

b∈[B],cmin∈[C]vb,c>vmin,

∀c∈[C], σc=p

λc(M2)∈[σmin, σmax] and

c6=c^′min∈[C]×[C]|σc−σc^′|>Γ, whereλc(A)denotes thec^thtop eigenvalue ofA.

Theorem 1 (Recovery guarantee for online estimation of user classesU) Let Assumption1hold, and let δ∈(0,1). If the number of mini-session satisfies

n² Pn

i=1γ_i⁻² >max

(2A⁶log(4A²/δ)

min{Γ, σmin}² ,A⁹(1 + 10(¹_Γ+_σ_min¹ )(1 +u³_max))²C⁵log(4A³/δ) 2C₁²σ_min³

) ,

then with probability at least1−2δ, there exists some permutationπ∈ S_Csuch that for allc ∈ [C], the outputU¯nof the EstimateFeatures algorithm satisfies

||u_c−u_n,π(c)||63A³ vu ut

Xn

i=1

γ_i⁻²Clog(4A³/δ)

2n² . (1)

whereu_c = (ua,c)a∈[A]. Here, the constant (we use the ”diamond” symbol to denote it) is 3= CA

σmin

3/2

13√

σmax+ 4p

2 min{Γ, σmin} + 5

σmax

Γ + 1

2σmax

min{Γ, σmin}

! ℵ

+

2σmax

Γ + 1

σmax

1 v_min² + 5p

3/8√σmax+p

min{Γ, σmin}/2 2CA σmin

3

ℵ²min{Γ, σmin}, with the notationℵ= 1 + 10(_Γ¹ +_σ_min¹ )(1 +u³max).

The proof strategy follows that ofLazaric et al.(2013, Theorem 2) and is detailed in the ap- pendix for clarity. It consists in relating, on the one hand, the estimation errorse⁽²⁾n ofM2 and e⁽³⁾n ofM3 from Corollary1to the conditionε 6 C1λmin

C , and, on the other hand, relating the reconstruction error on the columns ofUto the control on the terms|λc−bλπ(c)|and||ϕc−ϕbπ(c)||

coming from Proposition1. We note that the bound appearing in the condition on the number of mini-sessions is potentially large (due to the termsA⁶, C⁵, etc.). This is due to the combination of the RTP method with the importance sampling scheme, and it remains unclear if the bound can be significantly improved within this framework.

(8)

4. Recovering latent mixture distributions ( v

_b

): robustness of the OFUL algorithm

In order to recover the weights vectorsv_b ∈ R^Cand thus the matrixV, it would be tempting to use again an instance of the RTP method but this time to aggregate across actions, i.e., by forming aB×BandB×B×Btensor. Unfortunately, aggregation of elements ofUfails for two reasons:

First, we do not have different views across usersb, contrary to what we have for actionsa. It is thus hopeless to be able to form an estimate of the 2nd and 3rd moment tensors as before. Second, and rather technically, convex combinations of the{ua,c}a∈[A]need not be positive. This prevents the application of the RTP method which requires positive weights to work.

We thus consider a different strategy that uses an algorithm designed for linear bandits. How- ever since the feature matrixU is unknown a priori and can only be estimated, we need to work with perturbed features. A first solution is to propagate the additional error resulting from the error on the features in the standard proof of OFUL. However, this leads to a sub-optimal regret that is no longer scaling asO(˜ √

T)with the time horizon. We overcome this hurdle by showing in Theorem3a robustness property of OFUL of independent interest, which aids us in controlling the regret of the overall latent class algorithm (Algorithm3).

Consider OFUL run with perturbed (not necessarily linearly realizable) rewards. Formally, consider a finite action setA={1,2, . . . , A}and distinct feature vectors{u¯_a ∈R^C^×¹}a∈A. Let U¯^⊤ := [¯u₁u¯₂ . . . u¯_A]∈R^C^×^A. The expected reward when playing actionAt =aat timetis denoted byma:=E

Yt

At=a

, withm:= (ma)_a_∈A. Let us assume that there exists a unique optimal action for the expected rewardsm, i.e.,arg maxa∈Ama ={a^⋆}, with the regret at time nbeingR_n :=Pn

t=1(ma^⋆−mAt). The key point here is thatmneed not be linearly realizable w.r.t. the actions’ features – we will not require thatminv∈R^C

m−U¯vbe0.

Algorithm 2 OFUL (Optimism in Face of Uncertainty for Linear bandits) (Abbasi-Yadkori et al., 2011)

Require: Arms’ featuresU, regularization parameter¯ λ, norm parameterR_Θ for all timest>1do

1. Form theC×(t−1)

matrix

U¯_1:t−1 := [¯u_A

1 u¯_A

2. . . u¯_A

t−1]

consisting of all arm features played up to time

t−1, andY_1:t−1 := (Y₁, . . . , Y_t−1)^⊤

. Set

V_t−1 :=λI+P_t−1

s=1u¯_A_su¯^⊤

As

.

2. Choose the action

A_t∈arg max

a∈A max

v∈Ct−1

¯ u^⊤

av,

where

C^t−1 :={v∈R^C :kv−bv_t−1kVt−1 6Dt−1}, D_t−1 :=R

s 2 log

det(Vt−1)^1/2λ^−C/2 δ

+λ^1/2R_Θ b

v_t−1 :=V_t−1⁻¹U¯_1:t−1Y_1:t−1.. end for

OFUL Regret with linearly realizable rewards. The OFUL algorithm is stated for the sake of clarity as Algorithm2. Before studying the linearly non-realizable case, we record the well- known regret bound for it in the unperturbed case, that is when∀a ∈[A], ma = ¯u^⊤_av^⋆ for some unknownv^⋆.

Theorem 2 (OFUL regret (Abbasi-Yadkori et al.,2011)) Assume that||v^⋆||26RΘ, and that for alla∈ A,||u¯_a||26R_X,|hu¯_a,v^⋆i|61. Then with probability at least1−δ, the regret of OFUL satisfies:∀n>0,

(9)

R_n64q

nClog(1 +nR²_X/(λC) × λ^1/2RΘ+R

q

2 log(1/δ) +Clog(1 +nR_X²/(λC)), provided that the regularization parameterλis chosen such thatλ>max

1, R²_X,1/R²_Θ .

Regret of OFUL with Perturbed Features. We make a structural definition to present the result. Let α( ¯U) := maxJA⁻¹

J

₂, where A = U¯

IC

∈ R^(A+C)^×^C, A_J is the C ×C submatrix ofAformed by picking rowsJ, andJ ranges over all size-Csubsets of full-rank rows ofA. We will require for our purposes thatα( ¯U^T)is not too large. For intuition regardingα, we refer toForsgren(1996) (the final 3 paragraphs of p. 770, Corollary 5.4 and section 7). We remark that the condition thatα( ¯U^T)be small is analogous to aγ-incoherence type property commonly used in prior work (Bresler et al.,2014, Assumption A2), stating that two distinct feature vectors u_candu_c′,c6=c^′, must have a minimum angle separation.

Letv^◦∈R^Cbe arbitrary withℓ²norm at mostRΘ(it helps to think ofU¯v^◦as an approximation ofm),εa :=ma−u¯^⊤_av^◦,ε:= (εa)a∈A∈R^A. We now state a robustness result for OFUL potentially of independent interest.

Theorem 3 (OFUL robustness property) Suppose||v◦||²6RΘ,λ>max

1, R²_X,1/4R²_Θ ,∀a∈A,||u¯_a||²6 R_X and|ma|61. If the deviation from linearity satisfies

kεk2≡m−U¯v^◦₂<min

a6=a^⋆

¯

u^⊤_a⋆v^◦−u¯^⊤_av^◦ 2α( ¯U^⊤)ku¯_a⋆−u¯_ak2

, (2)

then, with probability at least1−δfor allT >0, R_T 68ρ^′

s T Clog

1 +T R_X² λC

λ^1/2RΘ+R s

2 log1

δ +Clog

1 +T R_X² λC

! ,

whereρ^′:= maxn

1,maxa6=a^⋆ ma⋆−ma

¯ u⊤

a⋆v◦−¯u⊤av◦

o .

Theorem3 essentially states that when the deviation of the actual mean reward vector from the subspace spanned by the feature vectors is small, the OFUL algorithm continues to enjoy a favorableO(√

T)regret up to a factorρ^′ >1. The quantityρ^′in the result is a geometric measure of the distortion in the arms’ actual rewardsmwith respect to the (linear) approximationU¯v^◦. We control this quantity in the next paragraph. (Note thatρ^′ = 1in the perfectly linearly realizable caseε= 0, and this gives back the standard OFUL regret up to a universal multiplicative constant.) Applying the Robust analysis of OFUL to the Low-rank Bandit setup. In this paragraph, we translate Theorem3to our Low Rank Bandit (LRB) setting in which OFUL uses feature vectors with noisy perturbations (estimated by, say, a Robust Tensor Power (RTP) algorithm). Throughout this section, we fix a userb.

We can now translate Theorem3thanks to the correspondence with the perturbed OFUL setting: In our low-rank bandit setting, the matrixU¯ = ¯Undepends on the reconstruction algorithm at mini-sessionn. Moreover, the optimal actiona^⋆≡a^⋆_b now depends on the userb. We denote for a userb∈[B]the minimum gap across suboptimal actions to begb

def= mina6=a^⋆_b(u_a⋆

b−u_a)^⊤v_b. Likewise, the error vectorεdepends onb, n. Its norm||ε||2appears in the condition (2) and the definition ofρ, and is controlled by the reconstruction error of Theorem1. It decays with the number of mini-sessionsn.

We defineαn

def= α( ¯Un),α⋆

def= α(U)and usemaxb||v_b||forRΘ, Using these notations, and adapting the proof of Theorem3 to handle a variableUn, we can now translate the result of the perturbed OFUL to our LRB setting:

(10)

OFUL

LRB

v^◦ v_b≡(v_b,c)_c∈[C]∈R^C U¯ U¯_n∈R^A×C m m_b ≡Uv_b ∈R^A a^⋆ a^⋆_b := arg max_a∈[A]u^T_av_b ε_a (u_a−u¯_n,a)^⊤v_b ε≡(εa)a∈A (U −U¯n)v_b

Table 1: Correspondences between OFUL and Low Rank Bandit (LRB) quantities at time

n

and for user

b

Lemma 3 Let0< δ 61andb ∈[B]. Provided that the number of mini-sessionsn0satisfies Pn0ⁿ²⁰ i=1γ_i⁻² >

9_b,δ, where we introduced the notation 9_b,δ= max

(2A⁶log(4A²/δ) min{Γ, σmin}² ,

A⁹(1 + 10(_Γ¹+_σ_min¹ )(1 +u³_max))²C⁵log(4A³/δ)

2C₁²σ³_min ×

3²A⁶C²log(4A³/δ) × max

2α²_⋆,8A||v_b||²2

g²_b ,2⁷α²_⋆Cu²_max||v_b||²2

g²_b +1 2

) ,

then with probability at least1−2δ,||ε||2 = ||(U −U¯n)v_b||2is small enough that for anyn > n0, condition (2) is satisfied. Consequently, Theorem3applies with

RΘ = max

b ||v_b||2, R_X = max

a∈A||u_a||2+

√A 2α⋆

, and ρ^′≡ρ^′_n,b62.

Thus, provided that the total number of mini-sessions of interaction (not necessarily corresponding to interactions with userb) is large enough, then the OFUL algorithm run during interactions with userbwill achieve a controlled regret. However, we want to warn that the9_b,δresulting from the RTP method, especially the second term of the max, may be potentially large, although being a constant.

5. Putting it together: Online Recommendation algorithm

This section details our main contributions for recommendations in the context of mini-sessions of interactions with unknown mixtures of latent profiles: first Algorithm3that combines RTP with OFUL, and then a regret analysis in Theorem4.

The recommendation algorithm we propose (Algorithm3) uses the RTP method to estimate the matrixU and then applies OFUL to determine an optimistic action. Importantly, it finally outputs a distribution that mixes the optimistic action with a uniform exploration. The mixture coefficient goes to0with the number of rounds, thus converging to playing OFUL only. It ensures that the importance sampling weights are bounded away from0in the beginning.

Main analytical result: Regret bound

(11)

Theorem 4 (Regret of Algorithm3) With Assumption1holding, letδ∈(0,1),9_δ = maxb∈[B]9_b,δ(from Lemma3), and letn0be the first mini-session at which Pn0ⁿ²⁰

i=1γ⁻²_i > 9_δ. The regret of Algorithm3at time T =N ℓ(acting forN mini-sessions of lengthℓ) using internal instances of OFUL parameterized byδ >0 satisfies

E[R_T]616 s

BT Clog

1 +T R_X² λC

λ^1/2RΘ+R s

2 log1

δ +Clog

1 +T R²_X λC

!

+ℓ(n0−1 + XN n=n0

γn) + 3δT ,

provided thatλ > min{1, R²_X,1/R²_Θ}, withRΘ > maxb||v_b||2,R_X > maxa∈A||u_a||2+ ^√_2α^A

⋆. Conse- quently, choosingδ= 1/Tandγn =p

log(n+ 1)/n,n∈N, say, yields the orderE[R_T] =O C√

BTlogT . Discussion. (1) The regret of Algorithm3scales withTsimilar to that of an OFUL algorithm

run with perfect knowledge of the feature matrixU:O(C˜ √

BT). This is a non-trivial result asU is not assumed to be known a priori and is estimated by Algorithm3using tensor methods.

Algorithm 3 Per-user OFUL with exploration

Require: Parametersλ,RΘ

for OFUL, exploration rate parameters

γn

,

n>1.

1: for mini-sessionn= 1, . . . , N do

2:

Get user

b_n

.

3:

Let

pn∼

Bernoulli(γ

n)

4: ifp_n= 0then

5: {

Carry out an ESTIMATE mini-session

}

6: for stepk= 1,2, . . . , ℓdo

7:

Output

a_n,k ∼

Uniform([A]).

8: end for

9:

Let

U_n=EstimateFeatures (Algorithm1) with input(b_i, a_i,l, X_a_i,l_,b_i_,i,l)₁6i6n,16l6ℓ,pn=0

{

Update feature estimates using samples from previous ESTIMATE mini-sessions

}

10: else

11: {

Carry out an OFUL mini-session

}

12: for stepk= 1,2, . . . , ℓdo

13:

Run one iteration of OFUL (Algorithm

2) with featuresU_n

, parameters

λ

and

R_Θ

, and past actions and rewards

(a_i,l, X_a_i,l_,b_i_,i,l),1 6i < n,16l 6ℓ, for whichpi = 1

and

b_i =b_n

{

An instance of OFUL for each user using current feature estimates, and observed actions and rewards from previous OFUL mini-sessions

}

14:

Output action

a_n,k

returned by OFUL

15: end for

16: end if

17: end for

(2) One can also compare the result with the regret of ignoring the mixture (low-rank) structure and simply running an instance of UCB per user, which would scale asO(√

ABT). This becomes highly suboptimal when the number of actions/itemsA is much larger than the number of user typesC, demonstrating the gain from leveraging the mixed linear structure of the problem. Note

(12)

also that we do not need a specific user to interact for a long time but for as few asℓ>3consecutive steps, contrary for instance to the transfer method (Lazaric et al.,2013), where a large number of consecutive interaction steps with the same user is required.

(3) It is worthwhile to contrast the result and approach with that in Djolonga et al.(2013) – the authors there incur an additional regret term due to the error in approximately estimating the low-rank matrix, which requires additional tuning ending up with a regret ofO(T^4/5). On the other hand, we avoid this approximation error by showing and exploiting the robustness property of OFUL, which guarantees√

Tregret as soon as the estimated featuresU˜are within a small radius of the actual ones.

The result (and analysis) does come with a caveat that the model-dependent term9_δ, although being independent on the time horizonT, is potentially large. Withγn set as in Theorem4, it appears as an additive exponential constant term in the regret⁷. This arises from the RTP method, and it is currently unclear if this term can be significantly reduced with the current line of analysis.

Numerical evidence, however, indicates that no such large additive constant enters into the regret (Section5). Also, on the bright side, note that9_δdoes not need to be known by the algorithm.

Numerical Results. The performance of the low-rank bandit strategy (Algorithm3) is shown in Figure1, simulated for20users arriving uniformly at random,3user classes and200actions.

Both the latent class matrixU200×3the mixture matrixV20×3are random one-shot instantiations.

The proposed algorithm (Algorithm3), with two different exploration rate schedulesO(n˜ ⁻^1/2)and O(n˜ ⁻^1/3)(’RTP+OFUL(sqrt)’ and ’RTP+OFUL(cuberoot)’ in the figure), is compared with (a) ba- sic UCB (’UCB’ in the figure) ignoring the linear structure of the problem (i.e., UCB per-user with 200actions), (b) OFUL per-user with complete knowledge of the user classes andpn= 1always, i.e., no exploration mini-sessions, and (c) An implementation of the Alternating Least Squares esti- mator (Tak´acs and Tikk,2012;Mary et al.,2014) for the matrixUalong with OFUL per-user. The proposed algorithm, with the theoretically suggested explorationO(n˜ ⁻^1/2), is observed to exploit the latent structure considerably better than simple UCB, and is not too far from the unrealistic OFUL strategy which enjoys the luxury of latent class information. It is also competitive with performing Alternating Least Squares, which does not come with analytically sound performance guarantees in the bandit learning setting. Also, the large additive constants in the theoretical bounds for Algorithm3do not manifest here.

Related work. The popular low-rank matrix completion problem studies the recoveryU and V given a small number of entries sampled at random fromU V^T with bothU andV being tall matrices, see for instanceJain et al.(2013) and citations therein. However, its setting is different than ours for several reasons. It typically deals with batch data arising from a sampling process that is not active but uniform across entries ofU V^T. Further, it requires sensing operators having strong properties (such as the RIP property), and most importantly, the performance metric is not regret but reconstruction error (Frobenius or2-norm).

In the linear bandit literature (Abbasi-Yadkori et al.,2011; Rusmevichientong and Tsitsiklis, 2010;Dani et al.,2008), the key constraining assumption is that either user side (V) or item side (U) features are precisely and completely known a priori. In contrast, the problem of low regret recommendation across users with latent mixtures does not afford us the luxury of knowing either UorV, and so they must be learnt “on the fly”. Another related work in the context of bandit type schemes for latent mixture model recommender systems is that ofBresler et al.(2014), in which, under the very specific uniform mixture model for all users, they exhibit strategies with good regret.

Nguyen et al.(2014) consider an alternating minimization type scheme in linear bandit models with two-sided uncertainty (an alternative model involving latent “factors”). However no rigorous guarantees are given for the bandit schemes they present; moreover, it is not known if alternating minimization finds global minima in general. Another related work is in the transfer learning setting 7. With additional prior knowledge ofγn, the dependence of the additive term can be made polynomial in9_δ: choosing

γn= min{1,p

9_δ/n}, it holds thatℓ(n0−1 +PN

n=n0γn)62√

9_δℓT+ℓ .

(13)

0 2 4 6 8 10 12 14 16 x 10⁴ 0

1 2 3 4 5 6x 10⁴

Number of iterations (3 X Number of mini−slots)

Expected Regret

OFUL ALS+OFUL UCB

RTP+OFUL(cuberoot) RTP+OFUL(sqrt)

Figure 1: Regret of the proposed algorithm (‘RTP+OFUL’ or Algorithm

3) for two different explo-

ration rate schedules, compared with (a) independent UCB per-user, (b) OFUL per-user with perfect knowledge of latent classes

U

, and (c) Alternating Least Squares estimation for the matrix

U

, along with OFUL per-user. Here,

B = 20

users,

C = 3

classes, and

A= 200, with randomly generatedU

and

V

. Plots show the sample mean of cumulative regret with time, with

1

standard deviation-error bars over

10

sample experiments.

from Lazaric et al.(2013): The method combines the RTP method (Anandkumar et al., 2014b, 2012) essentially with a standard UCB (Auer et al.,2002), but however works in the setting of a large number interactions with a same user, without assuming access to “user ids”. As a result, the regret bound in this setting scales linearly with the number of rounds. Our result in this paper shows that with additional access to just user identifiers, we can reduce the regret rate to be sublinear in time.

The RTP method has been used as a processing step to the EM algorithm in crowdsourcing (Zhang et al.,2014), but only convergence properties are considered, which is not enough to provide regret guarantees.

On the theoretical side, our contribution generalizes the setting of clustered bandits (Maillard and Mannor, 2014; Gentile et al., 2014) in which a hard clustering model is assumed (one user is assigned

to one class, or equivalently mixture distributions can only have support size1). In particular, Maillard and Mannor(2014) specifically highlight the benefit of a collaborative gain across users against using a vanilla UCB for each user. However their setting is less general than assuming a soft clustering of users (one user corresponds to a mixture of classes) across various “representative”

taste profiles as we study here.

The Alternating Least-Squares (ALS) method (Tak´acs and Tikk,2012;Mary et al.,2014) has been shown to yield promising experimental results in similar settings where bothU andV are unknown. However, no theoretical guarantees are known for this algorithm that may converge to a local optimum in general.

The work ofValko et al.(2014) studies stochastic bandits with a linear model over a low-rank (graph Laplacian) structure. However, they assume complete knowledge of the graph and hence knowledge of the eigenvectors of the Laplacian, converting it into a bilinear problem with only one-sided uncertainty. This is in contrast to our setup where bothU,V are completely uncertain.

(14)

Perhaps the closest work to ours is that ofDjolonga et al.(2013) where the authors develop a flexible approach for bandit problems in high dimension but with low-dimensional reward dependence. They use a two-phase algorithm: First a low-rank matrix completion technique (the Dantzig selector) estimates the feature-reward map, then a Gaussian Process-UCB (GP-UCB) bandit algorithm controls the regret, and show that if afterniterations the approximation error between the feature matrix and its estimate is less thenη, the final regret is given by the sum of the regret of GP-UCB when given perfect knowledge of the features and ofn+η(T −n)(due to the learning phase and approximation error). This results in an overall regret scaling withO(T^4/5). We depart from their results in two fundamental ways: Firstly, they have the possibility of uniformly sampling the entries (a common assumption in low-rank matrix completion techniques). We do not have this luxury in our setting as we do not control the process of user arrivals, that is not constrained to be uniform. Secondly, we prove and exploit a novel robustness property (see Theorem8) of the bandit subroutine we use (OFUL in our case instead of GP-UCB), which allows us to effectively eliminate the approximation error in their work and obtain aO(√

T)regret bound (see Theorem4).

6. Conclusion & Directions

We consider a full-blown latent class mixture model in which users are described by unknown mixtures across unknown user classes, more general and challenging than when users are assumed to fall perfectly in one class (Gentile et al.,2014;Maillard and Mannor,2014).

We provide the first provable sublinear regret guarantees in this setting, when both the canon- ical classes and user mixture weights are completely unknown, which we believe is striking when compared to existing work in the setting, e.g., alternate minimization typically gets stuck in local minima. We currently use a combination of noisy tensor factorization and linear bandit techniques, and control the uncertainty in the estimates resulting from each one of these techniques. This enables us to effectively recover the latent class structure.

Future directions include reducing the numerical constant (e.g. using an alternative to RTP), and studying how to combine our work with the aggregation of user parameters suggested in Maillard and Mannor(2014).

References

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved Algorithms for Linear Stochastic Bandits.

In Proc. NIPS, pages 2312–2320, 2011.

Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. arXiv preprint arXiv:1203.0683, 2012.

Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor approach to learning mixed membership community models. Journal of Machine Learning Research, 15(1):2239–2312, 2014a.

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompo- sitions for learning latent variable models. J. Mach. Learn. Res., 15(1):2773–2832, January 2014b.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.

Guy Bresler, George H Chen, and Devavrat Shah. A latent source model for online collaborative filtering. In Proc. NIPS 27, pages 3347–3355. Curran Associates, Inc., 2014.

Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic Linear Optimization under Bandit Feed- back. In Proc. COLT, 2008.

(15)

Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional Gaussian process bandits. In Proc.

NIPS, pages 1025–1033, 2013.

Anders Forsgren. On linear least-squares problems with diagonally dominant weight matrices. SIAM Journal on Matrix Analysis and Applications, 17(4):763–788, 1996.

Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In Proc. ICML, pages 757–765, 2014.

Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Sequential transfer in multi-armed bandit with finite set of models. In Proc. NIPS, pages 2220–2228. Curran Associates, Inc., 2013.

Thomas Hofmann and Jan Puzicha. Latent class models for collaborative filtering. In IJCAI, volume 99, pages 688–693, 1999.

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating mini- mization. In Proc. ACM Symposium on Theory Of computing (STOC), pages 665–674. ACM, 2013.

Jon Kleinberg and Mark Sandler. Using mixture models for collaborative filtering. In Proc. ACM Symposium on Theory Of Computing (STOC), pages 569–578. ACM, 2004.

Tomáˇs Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Proc. NIPS, pages 613–621, 2014.

Alessandro Lazaric, Emma Brunskill, et al. Sequential transfer in multi-armed bandit with finite set of models. In Proc. NIPS, pages 2220–2228, 2013.

Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In Proc. ICML, pages 136–144, 2014.

J´er´emie Mary, Romaric Gaudel, and Preux Philippe. Bandits warm-up cold recommender systems. arXiv preprint arXiv:1407.2806, 2014.

Hai Thanh Nguyen, J´er´emie Mary, and Philippe Preux. Cold-start problems in recommendation systems via contextual-bandit algorithms. arXiv preprint, arXiv:1405.7544, 2014.

Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.

Gilbert W Stewart, Ji-guang Sun, and Harcourt Brace Jovanovich. Matrix perturbation theory, volume 175.

Academic press New York, 1990.

Ilya Sutskever, Joshua B. Tenenbaum, and Ruslan R Salakhutdinov. Modelling relational data using Bayesian clustered tensor factorization. In Proc. NIPS, pages 1821–1828. Curran Associates, Inc., 2009.

G´abor Tak´acs and Domonkos Tikk. Alternating least squares for personalized ranking. In ACM Conference on Recommender systems, 2012.

Michal Valko, Rémi Munos, Branislav Kveton, and Tomáˇs Kocák. Spectral Bandits for Smooth Graph Functions. In Proc. ICML, 2014.

Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Proc. NIPS, pages 1260–1268, 2014.