• Aucun résultat trouvé

Low-rank Bandits with Latent Mixtures

N/A
N/A
Protected

Academic year: 2021

Partager "Low-rank Bandits with Latent Mixtures"

Copied!
37
0
0

Texte intégral

(1)

HAL Id: hal-01400318

https://hal.archives-ouvertes.fr/hal-01400318

Preprint submitted on 21 Nov 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Low-rank Bandits with Latent Mixtures

Aditya Gopalan, Odalric-Ambrym Maillard, Mohammadi Zaki

To cite this version:

Aditya Gopalan, Odalric-Ambrym Maillard, Mohammadi Zaki. Low-rank Bandits with Latent Mix- tures. 2016. �hal-01400318�

(2)

arXiv:1609.01508v1 [cs.LG] 6 Sep 2016

Low-rank Bandits with Latent Mixtures

Aditya Gopalan1 ADITYA@ECE.IISC.ERNET.IN

Odalric-Ambrym Maillard2 ODALRICAMBRYM.MAILLARD@INRIA.FR

Mohammadi Zaki1 ZAKI@ECE.IISC.ERNET.IN

1.Electrical Communication Engineering, Indian Institute of Science,

Bangalore 560012, India 2.Inria Saclay – ˆIle de France

Laboratoire de Recherche en Informatique 660 Claude Shannon

91190 Gif-sur-Yvette, France

Abstract

We study the task of maximizing rewards from recommending items (actions) to users sequentially interacting with a recommender system. Users are modeled as latent mixtures ofCmany represen- tative user classes, where each class specifies a mean reward profile across actions. Both the user features (mixture distribution over classes) and the item features (mean reward vector per class) are unknown a priori. The user identity is the only contextual information available to the learner while interacting. This induces a low-rank structure on the matrix of expected rewardsra,bfrom recom- mending itemato userb. The problem reduces to the well-known linear bandit when either user- or item-side features are perfectly known. In the setting where each user, with its stochastically sampled taste profile, interacts only for a small number of sessions, we develop a bandit algorithm for the two-sided uncertainty. It combines the Robust Tensor Power Method ofAnandkumar et al.

(2014b) with the OFUL linear bandit algorithm ofAbbasi-Yadkori et al.(2011). We provide the first rigorous regret analysis of this combination, showing that its regret afterTuser interactions is O(C˜ √

BT), withB the number of users. An ingredient towards this result is a novel robustness property of OFUL, of independent interest.

Keywords: Multi-armed bandits, online learning, low-rank matrices, recommender systems, rein- forcement learning.

1. Introduction

Recommender systems aim to provide targeted, personalized content recommendations to users by learning their responses over time. The underlying goal is to be able to predict which items a user might prefer based on preferences expressed by other related users and items, also known as the principle of collaborative filtering.

A popular approach to model preferences expressed by users in recommender systems is via probabilistic mixture models or latent class models (Hofmann and Puzicha,1999;Kleinberg and Sandler, 2004). In such a mixture model, we have a set ofAitems (content) that can be recommended to B users (consumers). Whenever itemais recommended to userb, the system gains an expected reward ofra,b. The key structural assumption that captures the relationship between users’ prefer- ences is that there exists a set of latent set ofCrepresentative user types or typical taste profiles.

(3)

Formally, each taste profilec is a unique vectoruc ≡(ua,c)a of the expected rewards that every itemaelicits under the taste profile. Each userbis assumed to sample one of the typical profiles randomly using an individual probability distributionvb ≡(vb,c)c; its reward distribution across the items subsequently becomes that induced by the assumed profile.

Our focus is to address the sequential optimization of net reward gained by the recommender, without any prior knowledge of either the latent user classes or users’ mixture distributions. As- suming that users arrive to the system repeatedly following an unknown stochastic process and re-sample their profiles over time, according to their respective unknown mixtures across latent classes, we seek online learning strategies that can achieve low regret relative to the best single item that can be recommended to each user. Note that this is qualitatively different than the task of estimating latent classes or user mixtures in a batch fashion, well-studied by now (Sutskever et al., 2009;Anandkumar et al.,2014a,b); the task of simultaneously optimizing net utility in a bandit fashion in complex expression models like these has received little or no analytical treatment. Our work takes a step towards filling this void.

An especially challenging aspect of online learning in recommender systems is the relatively meager number of available interactions with a same user, which is offset to an extent by the assumption that users can only have a limited number of taste profiles (classes). Indeed, if one can identify the class to which a certain user belongs and aggregate information from all other users in that class, then one can recommend to the user the best item for the class. In practice, classes are latent and not necessarily known in advance, and several works (Gentile et al.,2014;Lazaric et al., 2013;Maillard and Mannor,2014) study the restricted situation when each user always belongs to one specific class (i.e., when all mixture distributions have support size1). We go two steps further, since in many situations (a) users cannot be assumed to belong to one class only, such as when a user account is shared by several individuals (e.g. a smart-TV), and (b) the duration of a user-session, that is the number of consecutive recommendations to the same individual connected to a user-account, cannot assumed to be long1.

The key challenges that this work addresses are (1) the lack of knowledge of “features” on both the user-side and item-side in a linear bandit problem (in this case, both the user mixture weights and the item class reward profiles) and (2) provable regret minimization with very few i.e.

O(1)interactions with every userbhaving a specific taste profile, as opposed to a large number of interactions such as in transfer learning (Lazaric et al.,2013).

Contributions and overview of results. We consider a setting when users are assumed to come from arbitrary mixtures across classes (they are not assumed to fall perfectly in one class as was the assumption in works byGentile et al.(2014);Maillard and Mannor(2014)). We develop a novel bandit algorithm (Algorithm3) that combines (a) the Optimization in the Face of Uncer- tainty Linear bandit OFUL algorithm (Abbasi-Yadkori et al.,2011) for bandits with known action features, and (b) a variant of the Robust Tensor Power (RTP) algorithm (Anandkumar et al.,2014b) that uses only bandit (partial) estimates of latent user classes with observations coming from a mix- ture model. More specifically, we introduce a subroutine (Algorithm1) that makes use of the RTP method to extract item-side attributes (U) and, contributing to its theoretical analysis, show a re- covery property (Theorem1). Note that the RTP method ideally requires (unbiased) estimates of the2nd and3rd order moments of actions’ rewards, but with bandit information the learner can access only partial reward information, i.e., a single reward sample from an action. To overcome this, we devise an importance sampling scheme across3successive time instants to build the2nd and3rd order moment tensor estimates that RTP uses. For the task of issuing recommendations, we develop an algorithm (section4), essentially based on OFUL, instantiated per user, using for each athe estimated latent class vectors{ua,c}c(obtained via the RTP subroutine) as arm features, and uncertain parameter vector to be learnedvb.

1. It is also unlikely to be very short, say, less than 3.

(4)

We carry out a rigorous analysis of the algorithm and show that it achieves regretO(ℓC˜ √ BT) inT rounds of interaction (Theorem4), provided each arriving user interacts with the system for ℓ > 3 rounds with the same profile. In comparison, the regret of the strategy that completely disregards the latent mixture structure of rewards and employs a standard bandit strategy (e.g.

UCB (Auer et al.,2002)) per user, scales asO(Bp

T A/B) =O(√

ABT)afterT rounds2, which is considerably suboptimal in the practical case with a very large number of items but very few representative user classes (C ≪ A). It is also worth noting that the regret bound we achieve, order-wise, is what would result from applying the OFUL or any optimal linear bandit algorithm assuming a priori knowledge of all latent user classes{ua,c}a,c, that isO(ℓC˜ √

BT). In this sense, our result shows that one can simultaneously estimate features on both sides of a bilinear reward model and achieve regret performance equivalent to that of a one-sided linear model, which is the first result of its kind to the best of our knowledge3. Our results are presented for finite time horizons with explicit details of the constants arising from the error analysis of RTP, which at this point are large but possibly improvable.

En route to deriving the regret for our algorithm, we also make a novel contribution that ad- vances the theoretical understanding of OFUL, and which is of independent interest. We show that in the standard linear bandit setting, where the expected reward of an arm linearly depends ondfeatures, OFUL yields (sub-linear)

ρd√ T

regret even when it makes decisions based on perturbed or inexact feature vectors (Theorem3), whereρquantifies the distortion. This property holds whenever the perturbation error is small enough, and we explicitly give both (a) a sufficient condition on the size of the perturbation in terms of the set of actual features, and (b) a bound on the (multiplicative) distortionρin the regret due to the perturbation (note thatρ = 1in the ideal linear case).

2. Setup and notation

For any positive integern,[n]denotes the set{1,2, . . . , n}.

At eachn∈N, nature selects a userbn ∈[B]according to the probability distributionβover [B], independent of the past, and bn is revealed to the learner. A user classcn is subsequently sampled from the probability distributionvb

n over[C], and cn (the assumed class of user bn) interacts with the learner for the nextℓ > 3consecutive steps. Such an interaction will often be termed a mini-session.

In each stepl ∈[ℓ]of a mini-session, the learner plays an action (issues a recommendation) an,l ∈ [A] and subsequently receives rewardYn,l = uan,l,cnn,l, whereηn,lis a (centered) R-sub-Gaussian i.i.d. random variable independent from an,l, cn, representing the noise in the reward. We letua∈RCrepresent the vector(ua,c)c[C]of the mean rewards from actionain each class. Note thatE[uan,l,cn|an,l] = E[ua

n,lvb

n|an,l]. For convenience, we use the index notation t ≡(n, l)and introduceT =N ℓ, whereN is the total number of mini-sessions, andT the total number of interactions of the learner with the system. We denote likewiseYt, at, ct, ηtforYn,l, an,l,cnn,l, and letumaxdef

= maxa[A],c[C]|ua,c|.

We are interested in designing an online recommendation strategy, i.e., one that plays actions depending on past observations, achieving low (cumulative) regret afterT ≡(N, ℓ)mini-sessions, defined asRT def= P

n[N],l[ℓ]rn,l, wherern,ldef

= maxa[A]uavbn−ua

n,lvbn. In other words, we wish to compete against a strategy that plays for every user an action yielding the highest reward in expectation under its mixture distribution over user classes.

2. Roughly, each UCB per-user plays from a pool of A actions for about T /B rounds, thus suffering regret O(p

A(T /B)).

3. An earlier result ofDjolonga et al.(2013) getsO(T4/5)regret while moreover assuming a perfect control of the sampling process (we can’t assume this due to the user arrivals).

(5)

3. Recovering latent user classes: The EstimateFeatures subroutine

In this section, we provide an estimation algorithm for the matrixU, using the RTP method.4 Estimation of tensors. We assume that in mini-sessionn, when interacting with userbn, the triplet{an,l}l6ℓ is chosen from a distribution pn(a, a, a′′|bn). Letting Xan,l,bn,n,l

def= Yn,l = uan,l,cnn,l to explicitly indicate the active user and action chosen at (n, l), we form the importance-weighted estimates

˜ ra,a,n

def= 1 n

Xn i=1

Xai,1,bi,i,1Xai,2,bi,i,2

pi(a, a|bi) I{ai,1=a, ai,2=a},

˜

ra,a,a′′,ndef

= 1 n

Xn

i=1

Xai,1,bi,i,1Xai,2,bi,i,2Xai,3,bi,i,3

pi(a, a, a′′|bi) I{ai,1=a, ai,2=a, ai,3=a′′}. for the second and third-order tensors5.

We introduce the matricesMcn,2≡(˜ra,a,n)a,a[A]andM2≡(ma,a)a,a[A]withma,a def= E[˜ra,a,n], and the tensorsMcn,3 ≡ (˜ra,a,a′′,n)a,a,a′′[A] andM3 ≡ (ma,a,a′′)a,a,a′′[A] with ma,a,a′′

def= E[˜ra,a,a′′,n]. The following result decomposes the matrix M2 and tensor M3 as weighted sums of outer products.

Lemma 1 When the user arrivals are i.i.d. according to the lawβ, i.e.,bi i.i.d

∼ β∀i∈[n], it holds that ma,a,n= X

c[C]

vβ,cua,cua,c, and

ma,a,a′′,n= X

c[C]

vβ,cua,cua,cua′′,c.

Having shown the unbiasedness of the empirical2nd and3rd moment tensorsMcn,2andMcn,3, we next turn to showing concentration to their respective means.

Lemma 2 Assuming thatpi(a, a|bi) > q2,i andpi(a, a, a′′|bi) > q3,i for deterministicq2,i, q3,i, for all i∈N, a, a, a′′∈[A], then for alln6N, with probability higher than1−δ, it holds simultaneously for all a, a, a′′that

|˜ra,a,n−ma,a,n|6 vu ut

Xn i=1

q2,i2log(4A2/δ) 2n2 ,

|r˜a,a,n−ma,a,a′′,n|6 vu ut

Xn

i=1

q3,i2log(4A3/δ) 2n2 . An immediate corollary is the following one:

Corollary 1 Provided thatq2,ii/A2andq3,ii/A3for someγi >0, then on an event of probability higher than1−δ, the following hold simultaneously:

e(2)n def= kMcn,2−M2k6A3 vu ut

Xn

i=1

γi 2log(4A2/δ) 2n2 ,

e(3)n def= kMcn,3−M3k6A9/2 vu ut

Xn i=1

γi 2log(4A3/δ) 2n2 .

4. We considerℓ= 3to describe the algorithm;ℓ >3is easily handled by repeating the 3-wise samplingp(a, a, a′′)for

⌊ℓ/3⌋times and discarding the remaining (<3) steps in the mini-session during exploration (leading to a negligible regret overhead).

5. An alternative is the implicit exploration method due toKoc´ak et al.(2014).

(6)

Algorithm 1 EstimateFeatures

1: Input:

#sessions

n;

#mini-sessions

ℓ;

(user, action, reward) tuples

(bi, ai,l, Xai,l,bi,i,l)16i6n,16l6ℓ

.

2:

Compute the

A ×A

matrix

Mcn,2 = (˜ra,a,n)a,a∈[A]

and the

A ×A ×A

tensor

Mcn,3 = (rba,a,a′′,)a,a,a′′∈[A]

.

3:

Compute a

A×C

whitening matrix

cWn

of

Mcn,2

{

Take

Wcn=UbnDbn−1/2

where

Dbn

is the

C×C

diagonal matrix with the top

C

eigenvalues of

Mcn,2

, and

Ubn

the

A×C

matrix of corresponding eigenvectors.

}

4:

Form the

C×C×C

tensor

Tbn=Mcn,3(Wcn,cWn,cWn).

5:

Apply the RTP algorithm (Anandkumar et al.,

2014b) toTbn

, and compute its robust eigenvalues

(bλn,c)c∈[C]

with eigenvectors

(ϕbn,c)c∈[C]

.

{

The paper of

Anandkumar et al.

(2014b, Sec. 4) defines eigenvalues/eigenvectors of tensors.

}

6:

Compute for each

c∈[C],un,cn,c(cWn)ϕbn,c

and

vn,c−2n,c

.

7: Output: Estimate of latent classesU

: The

A×C

matrix

Un

obtained by stacking the vectors

un,c∈RA

side by side.

Reconstruction algorithm. The EstimateFeatures algorithm (Algorithm1) employs a whiten- ing matrixWcn, of the empirical estimate of the matrixM2, to build the empirical tensorTbn. This tensor is then used to recover the columns of the matrixU = (ua,c)a[A],c[C] via the RTP al- gorithm. For the sake of completeness, we also introduceW, a whitening matrix ofM2 (i.e., WTM2W = I), the corresponding tensorT = M3(W, W, W), and finally the estimation error en

def= kTbn−Tk.

Reconstruction guarantee. Our next result makes use of the following proposition from Anandkumar et al.(2014b, Theorem 5.1), restated here for completeness.

Proposition 1 (Theorem 5.1 ofAnandkumar et al.(2014b)) LetTb = T +E ∈ RC×C×C, whereT is a symmetric tensor with orthogonal decompositionT = PC

c=1λcϕc3, where eachλc >0,{ϕc}c[C]is an orthonormal basis, andEis a symmetric tensor with operator norm||E||6 ε. Letλmin = min{λc : c ∈ [C]},λmax= max{λc :c∈[C]}. Run the RTP algorithm with inputTbforCiterations. Let{(bλc,ϕbc)}c[C]

be the corresponding sequence of estimated eigenvalue/eigenvector pairs returned. Then, there exist universal constantsC1, C2 > 0for which the following is true. Fixη ∈ (0,1)and run RTP with parameters (i.e., number of iterations)L, N withL =poly(C) log(1/η), andN >C2

log(C) + log log λmaxε . Ifε6 C1λmin

C , then with probability at least1−η, there exists a permutationπ∈SCsuch that

∀c∈[C] : |λc−bλπ(c)|65ε, ||ϕc−ϕbπ(c)||68ε/λc, and ||T−

XC c=1

λbcϕbc3||655ε .

Lemma1gives a decomposition of the (symmetric) tensorM3, but it may be not orthogonal;

standard transformation (Anandkumar et al.,2014b, Sec. 4.3) gives an orthogonal decomposition for the tensor6M3(W, W, W), withW a matrix that whitensM2. We can thus use Proposition1 6. For a 3rd order tensorA∈Ra×a×aand 2nd order tensor or matrixB ∈Ra×b,A(B, B, B)∈Rb×b×bis the 3rd

order tensor defined by[A(B, B, B)]i1,i2,i3 def= P

j1,j2,j3∈[n]Aj1,j2,j3Bj1,i1Bj2,i2Bj3,i3. SeeAnandkumar et al.

(2014b) for more details on notation and results.

(7)

withT =M3(W, W, W),Tb =Tbn,ε =enandη =δin order to prove the following guarantee (Theorem1) on the recovery error between columns ofUand their estimate.

We now introduce mild separability conditions on the mixture weightsvb and the spectrum of the 2nd moment matrixM2 needed for the reconstruction guarantee to hold, similar to those assumed forLazaric et al.(2013, Theorem 2).

Assumption 1 There exist positive constantsvmin,σmin,σmaxandΓsuch that

b[B],cmin[C]vb,c>vmin,

∀c∈[C], σc=p

λc(M2)∈[σmin, σmax] and

c6=cmin[C]×[C]c−σc|>Γ, whereλc(A)denotes thecthtop eigenvalue ofA.

Theorem 1 (Recovery guarantee for online estimation of user classesU) Let Assumption1hold, and let δ∈(0,1). If the number of mini-session satisfies

n2 Pn

i=1γi2 >max

(2A6log(4A2/δ)

min{Γ, σmin}2 ,A9(1 + 10(1Γ+σmin1 )(1 +u3max))2C5log(4A3/δ) 2C12σmin3

) ,

then with probability at least1−2δ, there exists some permutationπ∈ SCsuch that for allc ∈ [C], the outputnof the EstimateFeatures algorithm satisfies

||uc−un,π(c)||63A3 vu ut

Xn

i=1

γi2Clog(4A3/δ)

2n2 . (1)

whereuc = (ua,c)a[A]. Here, the constant (we use the ”diamond” symbol to denote it) is 3= CA

σmin

3/2

13√

σmax+ 4p

2 min{Γ, σmin} + 5

σmax

Γ + 1

max

min{Γ, σmin}

! ℵ

+

max

Γ + 1

σmax

1 vmin2 + 5p

3/8√σmax+p

min{Γ, σmin}/2 2CA σmin

3

2min{Γ, σmin}, with the notationℵ= 1 + 10(Γ1 +σmin1 )(1 +u3max).

The proof strategy follows that ofLazaric et al.(2013, Theorem 2) and is detailed in the ap- pendix for clarity. It consists in relating, on the one hand, the estimation errorse(2)n ofM2 and e(3)n ofM3 from Corollary1to the conditionε 6 C1λmin

C , and, on the other hand, relating the reconstruction error on the columns ofUto the control on the terms|λc−bλπ(c)|and||ϕc−ϕbπ(c)||

coming from Proposition1. We note that the bound appearing in the condition on the number of mini-sessions is potentially large (due to the termsA6, C5, etc.). This is due to the combination of the RTP method with the importance sampling scheme, and it remains unclear if the bound can be significantly improved within this framework.

(8)

4. Recovering latent mixture distributions ( v

b

): robustness of the OFUL algorithm

In order to recover the weights vectorsvb ∈ RCand thus the matrixV, it would be tempting to use again an instance of the RTP method but this time to aggregate across actions, i.e., by forming aB×BandB×B×Btensor. Unfortunately, aggregation of elements ofUfails for two reasons:

First, we do not have different views across usersb, contrary to what we have for actionsa. It is thus hopeless to be able to form an estimate of the 2nd and 3rd moment tensors as before. Second, and rather technically, convex combinations of the{ua,c}a[A]need not be positive. This prevents the application of the RTP method which requires positive weights to work.

We thus consider a different strategy that uses an algorithm designed for linear bandits. How- ever since the feature matrixU is unknown a priori and can only be estimated, we need to work with perturbed features. A first solution is to propagate the additional error resulting from the error on the features in the standard proof of OFUL. However, this leads to a sub-optimal regret that is no longer scaling asO(˜ √

T)with the time horizon. We overcome this hurdle by showing in Theorem3a robustness property of OFUL of independent interest, which aids us in controlling the regret of the overall latent class algorithm (Algorithm3).

Consider OFUL run with perturbed (not necessarily linearly realizable) rewards. Formally, consider a finite action setA={1,2, . . . , A}and distinct feature vectors{u¯a ∈RC×1}a∈A. Let U¯ := [¯u12 . . . u¯A]∈RC×A. The expected reward when playing actionAt =aat timetis denoted byma:=E

Yt

At=a

, withm:= (ma)a∈A. Let us assume that there exists a unique optimal action for the expected rewardsm, i.e.,arg maxa∈Ama ={a}, with the regret at time nbeingRn :=Pn

t=1(ma−mAt). The key point here is thatmneed not be linearly realizable w.r.t. the actions’ features – we will not require thatminv∈RC

m−U¯vbe0.

Algorithm 2 OFUL (Optimism in Face of Uncertainty for Linear bandits) (Abbasi-Yadkori et al., 2011)

Require: Arms’ featuresU, regularization parameter¯ λ, norm parameterRΘ for all timest>1do

1. Form theC×(t−1)

matrix

1:t−1 := [¯uA

1A

2. . . u¯A

t−1]

consisting of all arm features played up to time

t−1, andY1:t−1 := (Y1, . . . , Yt−1)

. Set

Vt−1 :=λI+Pt−1

s=1As

As

.

2. Choose the action

At∈arg max

a∈A max

v∈Ct−1

¯ u

av,

where

Ct−1 :={v∈RC :kv−bvt−1kVt1 6Dt−1}, Dt−1 :=R

s 2 log

det(Vt−1)1/2λ−C/2 δ

1/2RΘ b

vt−1 :=Vt−1−11:t−1Y1:t−1.. end for

OFUL Regret with linearly realizable rewards. The OFUL algorithm is stated for the sake of clarity as Algorithm2. Before studying the linearly non-realizable case, we record the well- known regret bound for it in the unperturbed case, that is when∀a ∈[A], ma = ¯uav for some unknownv.

Theorem 2 (OFUL regret (Abbasi-Yadkori et al.,2011)) Assume that||v||26RΘ, and that for alla∈ A,||u¯a||26RX,|hu¯a,vi|61. Then with probability at least1−δ, the regret of OFUL satisfies:∀n>0,

(9)

Rn64q

nClog(1 +nR2X/(λC) × λ1/2RΘ+R

q

2 log(1/δ) +Clog(1 +nRX2/(λC)), provided that the regularization parameterλis chosen such thatλ>max

1, R2X,1/R2Θ .

Regret of OFUL with Perturbed Features. We make a structural definition to present the result. Let α( ¯U) := maxJA1

J

2, where A = U¯

IC

∈ R(A+C)×C, AJ is the C ×C submatrix ofAformed by picking rowsJ, andJ ranges over all size-Csubsets of full-rank rows ofA. We will require for our purposes thatα( ¯UT)is not too large. For intuition regardingα, we refer toForsgren(1996) (the final 3 paragraphs of p. 770, Corollary 5.4 and section 7). We remark that the condition thatα( ¯UT)be small is analogous to aγ-incoherence type property commonly used in prior work (Bresler et al.,2014, Assumption A2), stating that two distinct feature vectors ucanduc,c6=c, must have a minimum angle separation.

Letv∈RCbe arbitrary withℓ2norm at mostRΘ(it helps to think ofU¯vas an approxima- tion ofm),εa :=ma−u¯av,ε:= (εa)a∈A∈RA. We now state a robustness result for OFUL potentially of independent interest.

Theorem 3 (OFUL robustness property) Suppose||v||26RΘ,λ>max

1, R2X,1/4R2Θ ,∀a∈A,||u¯a||26 RX and|ma|61. If the deviation from linearity satisfies

kεk2≡m−U¯v2<min

a6=a

¯

uav−u¯av 2α( ¯U)ku¯a−u¯ak2

, (2)

then, with probability at least1−δfor allT >0, RT 68ρ

s T Clog

1 +T RX2 λC

λ1/2RΘ+R s

2 log1

δ +Clog

1 +T RX2 λC

! ,

whereρ:= maxn

1,maxa6=a ma⋆ma

¯ u

a⋆v¯uav

o .

Theorem3 essentially states that when the deviation of the actual mean reward vector from the subspace spanned by the feature vectors is small, the OFUL algorithm continues to enjoy a favorableO(√

T)regret up to a factorρ >1. The quantityρin the result is a geometric measure of the distortion in the arms’ actual rewardsmwith respect to the (linear) approximationU¯v. We control this quantity in the next paragraph. (Note thatρ = 1in the perfectly linearly realizable caseε= 0, and this gives back the standard OFUL regret up to a universal multiplicative constant.) Applying the Robust analysis of OFUL to the Low-rank Bandit setup. In this paragraph, we translate Theorem3to our Low Rank Bandit (LRB) setting in which OFUL uses feature vectors with noisy perturbations (estimated by, say, a Robust Tensor Power (RTP) algorithm). Throughout this section, we fix a userb.

We can now translate Theorem3thanks to the correspondence with the perturbed OFUL set- ting: In our low-rank bandit setting, the matrixU¯ = ¯Undepends on the reconstruction algorithm at mini-sessionn. Moreover, the optimal actiona≡ab now depends on the userb. We denote for a userb∈[B]the minimum gap across suboptimal actions to begb

def= mina6=ab(ua

b−ua)vb. Likewise, the error vectorεdepends onb, n. Its norm||ε||2appears in the condition (2) and the def- inition ofρ, and is controlled by the reconstruction error of Theorem1. It decays with the number of mini-sessionsn.

We defineαn

def= α( ¯Un),α

def= α(U)and usemaxb||vb||forRΘ, Using these notations, and adapting the proof of Theorem3 to handle a variableUn, we can now translate the result of the perturbed OFUL to our LRB setting:

(10)

OFUL

LRB

v vb≡(vb,c)c∈[C]∈RC U¯ U¯n∈RA×C m mb ≡Uvb ∈RA a ab := arg maxa∈[A]uTavb εa (ua−u¯n,a)vb ε≡(εa)a∈A (U −U¯n)vb

Table 1: Correspondences between OFUL and Low Rank Bandit (LRB) quantities at time

n

and for user

b

Lemma 3 Let0< δ 61andb ∈[B]. Provided that the number of mini-sessionsn0satisfies Pn0n20 i=1γi2 >

9b,δ, where we introduced the notation 9b,δ= max

(2A6log(4A2/δ) min{Γ, σmin}2 ,

A9(1 + 10(Γ1+σmin1 )(1 +u3max))2C5log(4A3/δ)

2C12σ3min ×

32A6C2log(4A3/δ) × max

2,8A||vb||22

g2b ,27α2Cu2max||vb||22

g2b +1 2

) ,

then with probability at least1−2δ,||ε||2 = ||(U −U¯n)vb||2is small enough that for anyn > n0, condition (2) is satisfied. Consequently, Theorem3applies with

RΘ = max

b ||vb||2, RX = max

a∈A||ua||2+

√A 2α

, and ρ≡ρn,b62.

Thus, provided that the total number of mini-sessions of interaction (not necessarily corre- sponding to interactions with userb) is large enough, then the OFUL algorithm run during interac- tions with userbwill achieve a controlled regret. However, we want to warn that the9b,δresulting from the RTP method, especially the second term of the max, may be potentially large, although being a constant.

5. Putting it together: Online Recommendation algorithm

This section details our main contributions for recommendations in the context of mini-sessions of interactions with unknown mixtures of latent profiles: first Algorithm3that combines RTP with OFUL, and then a regret analysis in Theorem4.

The recommendation algorithm we propose (Algorithm3) uses the RTP method to estimate the matrixU and then applies OFUL to determine an optimistic action. Importantly, it finally outputs a distribution that mixes the optimistic action with a uniform exploration. The mixture coefficient goes to0with the number of rounds, thus converging to playing OFUL only. It ensures that the importance sampling weights are bounded away from0in the beginning.

Main analytical result: Regret bound

(11)

Theorem 4 (Regret of Algorithm3) With Assumption1holding, letδ∈(0,1),9δ = maxb[B]9b,δ(from Lemma3), and letn0be the first mini-session at which Pn0n20

i=1γ−2i > 9δ. The regret of Algorithm3at time T =N ℓ(acting forN mini-sessions of lengthℓ) using internal instances of OFUL parameterized byδ >0 satisfies

E[RT]616 s

BT Clog

1 +T RX2 λC

λ1/2RΘ+R s

2 log1

δ +Clog

1 +T R2X λC

!

+ℓ(n0−1 + XN n=n0

γn) + 3δT ,

provided thatλ > min{1, R2X,1/R2Θ}, withRΘ > maxb||vb||2,RX > maxa∈A||ua||2+ A

. Conse- quently, choosingδ= 1/Tandγn =p

log(n+ 1)/n,n∈N, say, yields the orderE[RT] =O C√

BTlogT . Discussion. (1) The regret of Algorithm3scales withTsimilar to that of an OFUL algorithm

run with perfect knowledge of the feature matrixU:O(C˜ √

BT). This is a non-trivial result asU is not assumed to be known a priori and is estimated by Algorithm3using tensor methods.

Algorithm 3 Per-user OFUL with exploration

Require: Parametersλ,RΘ

for OFUL, exploration rate parameters

γn

,

n>1.

1: for mini-sessionn= 1, . . . , N do

2:

Get user

bn

.

3:

Let

pn

Bernoulli(γ

n)

4: ifpn= 0then

5: {

Carry out an ESTIMATE mini-session

}

6: for stepk= 1,2, . . . , ℓdo

7:

Output

an,k

Uniform([A]).

8: end for

9:

Let

Un=EstimateFeatures (Algorithm1) with input(bi, ai,l, Xai,l,bi,i,l)16i6n,16l6ℓ,pn=0

{

Update feature estimates using samples from previous ESTIMATE mini-sessions

}

10: else

11: {

Carry out an OFUL mini-session

}

12: for stepk= 1,2, . . . , ℓdo

13:

Run one iteration of OFUL (Algorithm

2) with featuresUn

, parameters

λ

and

RΘ

, and past actions and rewards

(ai,l, Xai,l,bi,i,l),1 6i < n,16l 6ℓ, for whichpi = 1

and

bi =bn

{

An instance of OFUL for each user using current feature estimates, and observed actions and rewards from previous OFUL mini-sessions

}

14:

Output action

an,k

returned by OFUL

15: end for

16: end if

17: end for

(2) One can also compare the result with the regret of ignoring the mixture (low-rank) structure and simply running an instance of UCB per user, which would scale asO(√

ABT). This becomes highly suboptimal when the number of actions/itemsA is much larger than the number of user typesC, demonstrating the gain from leveraging the mixed linear structure of the problem. Note

(12)

also that we do not need a specific user to interact for a long time but for as few asℓ>3consecutive steps, contrary for instance to the transfer method (Lazaric et al.,2013), where a large number of consecutive interaction steps with the same user is required.

(3) It is worthwhile to contrast the result and approach with that in Djolonga et al.(2013) – the authors there incur an additional regret term due to the error in approximately estimating the low-rank matrix, which requires additional tuning ending up with a regret ofO(T4/5). On the other hand, we avoid this approximation error by showing and exploiting the robustness property of OFUL, which guarantees√

Tregret as soon as the estimated featuresU˜are within a small radius of the actual ones.

The result (and analysis) does come with a caveat that the model-dependent term9δ, although being independent on the time horizonT, is potentially large. Withγn set as in Theorem4, it appears as an additive exponential constant term in the regret7. This arises from the RTP method, and it is currently unclear if this term can be significantly reduced with the current line of analysis.

Numerical evidence, however, indicates that no such large additive constant enters into the regret (Section5). Also, on the bright side, note that9δdoes not need to be known by the algorithm.

Numerical Results. The performance of the low-rank bandit strategy (Algorithm3) is shown in Figure1, simulated for20users arriving uniformly at random,3user classes and200actions.

Both the latent class matrixU200×3the mixture matrixV20×3are random one-shot instantiations.

The proposed algorithm (Algorithm3), with two different exploration rate schedulesO(n˜ 1/2)and O(n˜ 1/3)(’RTP+OFUL(sqrt)’ and ’RTP+OFUL(cuberoot)’ in the figure), is compared with (a) ba- sic UCB (’UCB’ in the figure) ignoring the linear structure of the problem (i.e., UCB per-user with 200actions), (b) OFUL per-user with complete knowledge of the user classes andpn= 1always, i.e., no exploration mini-sessions, and (c) An implementation of the Alternating Least Squares esti- mator (Tak´acs and Tikk,2012;Mary et al.,2014) for the matrixUalong with OFUL per-user. The proposed algorithm, with the theoretically suggested explorationO(n˜ 1/2), is observed to exploit the latent structure considerably better than simple UCB, and is not too far from the unrealistic OFUL strategy which enjoys the luxury of latent class information. It is also competitive with performing Alternating Least Squares, which does not come with analytically sound performance guarantees in the bandit learning setting. Also, the large additive constants in the theoretical bounds for Algorithm3do not manifest here.

Related work. The popular low-rank matrix completion problem studies the recoveryU and V given a small number of entries sampled at random fromU VT with bothU andV being tall matrices, see for instanceJain et al.(2013) and citations therein. However, its setting is different than ours for several reasons. It typically deals with batch data arising from a sampling process that is not active but uniform across entries ofU VT. Further, it requires sensing operators having strong properties (such as the RIP property), and most importantly, the performance metric is not regret but reconstruction error (Frobenius or2-norm).

In the linear bandit literature (Abbasi-Yadkori et al.,2011; Rusmevichientong and Tsitsiklis, 2010;Dani et al.,2008), the key constraining assumption is that either user side (V) or item side (U) features are precisely and completely known a priori. In contrast, the problem of low regret recommendation across users with latent mixtures does not afford us the luxury of knowing either UorV, and so they must be learnt “on the fly”. Another related work in the context of bandit type schemes for latent mixture model recommender systems is that ofBresler et al.(2014), in which, under the very specific uniform mixture model for all users, they exhibit strategies with good regret.

Nguyen et al.(2014) consider an alternating minimization type scheme in linear bandit models with two-sided uncertainty (an alternative model involving latent “factors”). However no rigorous guarantees are given for the bandit schemes they present; moreover, it is not known if alternating minimization finds global minima in general. Another related work is in the transfer learning setting 7. With additional prior knowledge ofγn, the dependence of the additive term can be made polynomial in9δ: choosing

γn= min{1,p

9δ/n}, it holds thatℓ(n0−1 +PN

n=n0γn)62√

9δℓT+ℓ .

(13)

0 2 4 6 8 10 12 14 16 x 104 0

1 2 3 4 5 6x 104

Number of iterations (3 X Number of mini−slots)

Expected Regret

OFUL ALS+OFUL UCB

RTP+OFUL(cuberoot) RTP+OFUL(sqrt)

Figure 1: Regret of the proposed algorithm (‘RTP+OFUL’ or Algorithm

3) for two different explo-

ration rate schedules, compared with (a) independent UCB per-user, (b) OFUL per-user with perfect knowledge of latent classes

U

, and (c) Alternating Least Squares estimation for the matrix

U

, along with OFUL per-user. Here,

B = 20

users,

C = 3

classes, and

A= 200, with randomly generatedU

and

V

. Plots show the sample mean of cumulative regret with time, with

1

standard deviation-error bars over

10

sample experiments.

from Lazaric et al.(2013): The method combines the RTP method (Anandkumar et al., 2014b, 2012) essentially with a standard UCB (Auer et al.,2002), but however works in the setting of a large number interactions with a same user, without assuming access to “user ids”. As a result, the regret bound in this setting scales linearly with the number of rounds. Our result in this paper shows that with additional access to just user identifiers, we can reduce the regret rate to be sublinear in time.

The RTP method has been used as a processing step to the EM algorithm in crowdsourcing (Zhang et al.,2014), but only convergence properties are considered, which is not enough to pro- vide regret guarantees.

On the theoretical side, our contribution generalizes the setting of clustered bandits (Maillard and Mannor, 2014; Gentile et al., 2014) in which a hard clustering model is assumed (one user is assigned

to one class, or equivalently mixture distributions can only have support size1). In particular, Maillard and Mannor(2014) specifically highlight the benefit of a collaborative gain across users against using a vanilla UCB for each user. However their setting is less general than assuming a soft clustering of users (one user corresponds to a mixture of classes) across various “representative”

taste profiles as we study here.

The Alternating Least-Squares (ALS) method (Tak´acs and Tikk,2012;Mary et al.,2014) has been shown to yield promising experimental results in similar settings where bothU andV are unknown. However, no theoretical guarantees are known for this algorithm that may converge to a local optimum in general.

The work ofValko et al.(2014) studies stochastic bandits with a linear model over a low-rank (graph Laplacian) structure. However, they assume complete knowledge of the graph and hence knowledge of the eigenvectors of the Laplacian, converting it into a bilinear problem with only one-sided uncertainty. This is in contrast to our setup where bothU,V are completely uncertain.

(14)

Perhaps the closest work to ours is that ofDjolonga et al.(2013) where the authors develop a flexible approach for bandit problems in high dimension but with low-dimensional reward depen- dence. They use a two-phase algorithm: First a low-rank matrix completion technique (the Dantzig selector) estimates the feature-reward map, then a Gaussian Process-UCB (GP-UCB) bandit algo- rithm controls the regret, and show that if afterniterations the approximation error between the feature matrix and its estimate is less thenη, the final regret is given by the sum of the regret of GP-UCB when given perfect knowledge of the features and ofn+η(T −n)(due to the learning phase and approximation error). This results in an overall regret scaling withO(T4/5). We depart from their results in two fundamental ways: Firstly, they have the possibility of uniformly sampling the entries (a common assumption in low-rank matrix completion techniques). We do not have this luxury in our setting as we do not control the process of user arrivals, that is not constrained to be uniform. Secondly, we prove and exploit a novel robustness property (see Theorem8) of the bandit subroutine we use (OFUL in our case instead of GP-UCB), which allows us to effectively eliminate the approximation error in their work and obtain aO(√

T)regret bound (see Theorem4).

6. Conclusion & Directions

We consider a full-blown latent class mixture model in which users are described by unknown mixtures across unknown user classes, more general and challenging than when users are assumed to fall perfectly in one class (Gentile et al.,2014;Maillard and Mannor,2014).

We provide the first provable sublinear regret guarantees in this setting, when both the canon- ical classes and user mixture weights are completely unknown, which we believe is striking when compared to existing work in the setting, e.g., alternate minimization typically gets stuck in local minima. We currently use a combination of noisy tensor factorization and linear bandit techniques, and control the uncertainty in the estimates resulting from each one of these techniques. This enables us to effectively recover the latent class structure.

Future directions include reducing the numerical constant (e.g. using an alternative to RTP), and studying how to combine our work with the aggregation of user parameters suggested in Maillard and Mannor(2014).

References

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved Algorithms for Linear Stochastic Bandits.

In Proc. NIPS, pages 2312–2320, 2011.

Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. arXiv preprint arXiv:1203.0683, 2012.

Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor approach to learning mixed membership community models. Journal of Machine Learning Research, 15(1):2239–2312, 2014a.

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompo- sitions for learning latent variable models. J. Mach. Learn. Res., 15(1):2773–2832, January 2014b.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.

Guy Bresler, George H Chen, and Devavrat Shah. A latent source model for online collaborative filtering. In Proc. NIPS 27, pages 3347–3355. Curran Associates, Inc., 2014.

Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic Linear Optimization under Bandit Feed- back. In Proc. COLT, 2008.

(15)

Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional Gaussian process bandits. In Proc.

NIPS, pages 1025–1033, 2013.

Anders Forsgren. On linear least-squares problems with diagonally dominant weight matrices. SIAM Journal on Matrix Analysis and Applications, 17(4):763–788, 1996.

Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In Proc. ICML, pages 757–765, 2014.

Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Sequential transfer in multi-armed bandit with finite set of models. In Proc. NIPS, pages 2220–2228. Curran Associates, Inc., 2013.

Thomas Hofmann and Jan Puzicha. Latent class models for collaborative filtering. In IJCAI, volume 99, pages 688–693, 1999.

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating mini- mization. In Proc. ACM Symposium on Theory Of computing (STOC), pages 665–674. ACM, 2013.

Jon Kleinberg and Mark Sandler. Using mixture models for collaborative filtering. In Proc. ACM Symposium on Theory Of Computing (STOC), pages 569–578. ACM, 2004.

Tom´aˇs Koc´ak, Gergely Neu, Michal Valko, and R´emi Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Proc. NIPS, pages 613–621, 2014.

Alessandro Lazaric, Emma Brunskill, et al. Sequential transfer in multi-armed bandit with finite set of models. In Proc. NIPS, pages 2220–2228, 2013.

Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In Proc. ICML, pages 136–144, 2014.

J´er´emie Mary, Romaric Gaudel, and Preux Philippe. Bandits warm-up cold recommender systems. arXiv preprint arXiv:1407.2806, 2014.

Hai Thanh Nguyen, J´er´emie Mary, and Philippe Preux. Cold-start problems in recommendation systems via contextual-bandit algorithms. arXiv preprint, arXiv:1405.7544, 2014.

Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.

Gilbert W Stewart, Ji-guang Sun, and Harcourt Brace Jovanovich. Matrix perturbation theory, volume 175.

Academic press New York, 1990.

Ilya Sutskever, Joshua B. Tenenbaum, and Ruslan R Salakhutdinov. Modelling relational data using Bayesian clustered tensor factorization. In Proc. NIPS, pages 1821–1828. Curran Associates, Inc., 2009.

G´abor Tak´acs and Domonkos Tikk. Alternating least squares for personalized ranking. In ACM Conference on Recommender systems, 2012.

Michal Valko, R´emi Munos, Branislav Kveton, and Tom´aˇs Koc´ak. Spectral Bandits for Smooth Graph Functions. In Proc. ICML, 2014.

Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Proc. NIPS, pages 1260–1268, 2014.

Références

Documents relatifs

In Section 3, we analyze the significantly harder and largely unaddressed setting when the cluster distributions are still known, but the users may now come from all clusters..

Our algorithm, called Relative Exponential-weight al- gorithm for Exploration and Exploitation ( REX 3), is a non-trivial extension of the Exponential-weight algo- rithm for

The second one, Meta-Bandit, formulates the Meta-Eve problem as another multi-armed bandit problem, where the two options are: i/ forgetting the whole process memory and playing

/ La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur.. Access

Chandra HETG spectra of the HMXB GX 301 −2, included in this survey, ObsID 2733, showing all the relevant features discussed in the present work: Fe Kα and Fe Kβ fluorescence lines,

This transformation provided the 9 condensed identity coef- ficients for any pair of individuals, but the 15 detailed identity coefficients can only be calculated using Karigl’s

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Abstract—In this paper, we propose an improvement of the attack on the Rank Syndrome Decoding (RSD) problem found in [1], usually the best attack considered for evaluating the