Online bi-clustering with sparsity priors [L11]

Supervised classification is widely used for solving many real-world problems such as spam filtering or medical diagonostic (see Chapter 5). The main reason is the following : it can be easily expressed as a well-defined problem with a loss function. This loss function, chosen by the scientist, makes the problem clearly defined. On the other side, clustering is ”unsupervised” classification, that is to assign classes that are not defined a priori. The goal is to learn the underlying structure of a dataset. This unsupervised problem remains a hard issue since this representation is not necessarily unique. Even worst, a huge number of different structures may coexist in any non-trivial set of data. In von Luxburg, Williamson, and Guyon [2009], two distinct purposes for clustering are expressed : data pre-processing where clustering is considered as a step in a whole data processing chain and exploratory data analysis to discover a new structure in a dataset. Earlier in the dissertation, exploratory data clustering was initiated in statistical and online learning. Clustering was carried out with the k-means loss function as a quantization problem : how to summarize a distribution P in terms of Euclidean distance ? In this section, a contrario, data pre-processing clustering is studied where the clustering task is an abstraction of the ultimate end-use problem.

4.2.1 Introduction

Bi-clustering or co-clustering is a popular method to analyse huge data matrices and build recommen-der systems. In this field, we mainly observe a random matrix, where rows correspond to a population and columns to variables (or products). This matrix is usually sparse, i.e. with many hidden entries (such as ratings). The goal is to reconstruct the matrix by clustering simultaneously the rows and colums of the matrix. This scenario has been applied to many real-world problems such as text mining (see Slonim and Tishby [2000]), gene expression (Cheng and Church [2000]), social networks (see Gnatyshak, Igna-tov, Semenov, and Poelmans [2012]) or collaborative filtering (see Seldin [2009]). In Seldin and Tishby [2010], generalization bounds in terms of Kullback-Leibler divergence are proposed for this problem.

Assuming the existence of a probabilistic distribution p(x₁, x₂, y) over the triplet of rows, colums and rating, a discriminative predictor q(y|x₁, x2) is constructed via a PAC-Bayesian approach. The matrix is supposed to have i.i.d. entries and the number of clusters is known in advance. In this section, we want to investigate a deterministic and sequential version of the bi-clustering problem. As before, we are interested in sparsity regret bounds, where the sparsity is associated with the structure of the data points.

We consider an individual sequence (xt, yt), t= 1, . . . , T where T is the known horizon whereas for any t= 1, . . . , T :

— the input variable x_t= (x_t,1, . . . , x_t,d)∈ X₁×. . .× X_d=:X,

— the output² y_t∈ Y ⊆[0, M].

A seminal example is the construction of recommender systems. In this case,d= 2 andxt= (xt,1, xt,2) corresponds to a couple customer×movie whereas y_t is the associated rating (such as {?, ??, ? ? ?} for instance). Note also that our analysis is not limited to the bi-clustering problem where d = 2 above, since we can considerd >2 tensors as well. At each timet, an inputxt∈ X is observed and we design a prediction ˆy_t. Then,y_tis given and we loss`(ˆy_t, y_t) = (y_t−yˆ_t)² for simplicity. This particular loss enjoys the useful property to beλ-exp-concave, which means that ˆy7→e^−λ`(ˆ^y,y) is concave. Unlike the previous pure unsupervised study, it allows to reach fast regret bounds (see also the discussion in Chapter 1).

With this loss function, we mix decision functions according to : ˆ

y_t=E~c∼ˆptg_~_c(x_t).

(4.18)

In (4.18), decision functionsg_~_c(·) depend on a set ofd-tensor codebooks~c= (c1, . . . ,c_d)∈Qd

j=1X_j^p^j. A d-tensor codebook assigns to each componentx_j ofx∈ X the nearest center of c_j := (c_j,1, . . . , c_j,p_j) for a givenpj ∈N^∗. The associatedd-tensor Vorono¨ı cell is denoted asV_~_c(x) and corresponds to the product of each Vorono¨ı cellVcj(xj) = {z_j ∈ X_j : arg minij=1,...,pj|z_j−cj,ij|₂ = arg minij=1,...,pj|x_j−cj,ij|₂}. In what follows (see for instance Theorem 20), we consider two different mappings~c7→g_~_c(·). The first one consists in computing the mean value of the sequence of past outputs in a given Vorono¨ı cell. In this case,g_~_c(xt) is literally defined for anyt as :

g_~_c^mean(x_t) =

Pt−1

u=1yu1_x_u_∈V_~_c_(x_t₎ card{{x₁, . . . , xt−1} ∩V_~_c(xt)}. (4.19)

In (4.19), we can initialize g_~^mean_c (x1) = M/2 without loss of generality. In (4.19) (and also in (4.20)), g_~_c(x_t) depends on the past observations (x₁, y₁), . . . ,(xt−1, yt−1). We omit this dependence for simplicity.

Furthermore, whenY ={1, . . . , M}, we can also use a majority vote for g_~_c(x_t), where the majority vote at timet is taken in the Vorono¨ı cellV_~_c(xt) as follows :

g_~^vote_c (x_t) = arg max

k∈Y card{u= 1, . . . , t−1 :y_u =k andx_u∈V_~_c(x_t)}.

(4.20)

Equipped with these decision functions, we want to promote a sparse representation. Here, the spar-sity is associated with the set ofd-tensor codebooks. Given some vector of integersm= (m₁, . . . , m_p)∈ N^p, we restrict the study to the Euclidean space by considering X_j = R^m^j for any j = 1, . . . , p, X =

2. In the sequel, two cases are considered :Y= [0, M] (online regression) andY={1, . . . , M}(online classification).

j=1mjpj and~c= (c1, . . . ,cd)∈Qd

j=1R^m^j^p^j. Then, we wish that yt≈g_~_c^?(xt), where~c^?= (c^?₁, . . . ,c^?_d) is such that c^?_j has a small`0-norm for anyj = 1, . . . , dwhere |c^?_j|₀ is defined as :

|c^?_j|₀ = card{i_j = 1, . . . , p_j :c^?_j,i_j 6= (0, . . . ,0)^>∈R^m^j}.

(4.21)

Consequently, we are looking forddistincts group-sparse codebooksc₁, . . . ,c_d. For this purpose, we will use in our algorithm a product of dgroup-sparsity priors introduced in Lemma 10 in online clustering.

This prior is defined in Lemma 11 below.

As mentioned in the introduction, a comparable approach could be seen in Seldin and Tishby [2010]

in a classical statistical learning context. By considering a random generatorP with unknown probabi-lity distribution on the set X × Y, Seldin and Tishby [2010] suggest to consider the following form of discriminative predictors :

h(y|x₁, . . . , xd) = X

(i1,...,id)

h(y|i₁, . . . , id)Π^d_j=1h(ij|x_j).

In this stochastic setting, the hidden variables (i₁, . . . , i_d) represent the clustering of the input X = (X1, . . . , Xd). Using a PAC-Bayesian analysis and bounds as in Mac Allester [1998], generalization errors in terms of Kullback-Leibler divergence are proposed. The randomized strategy is based on a density estimation of the lawP, where the number of groups for each i_j,j= 1, . . . , dis known.

In this section, the framework is essentially different since we propose to use the results of Section 4.1 in order to get sparsity regret bounds according to :

t=1

(yt−yˆt)²≤ inf

~c∈Π^d_j=1R^{mj pj}

( _T X

t=1

(yt−g_~_c(xt))²+ pen₀(~c) )

, (4.22)

where pen₀(~c) is a penalty function which is proportional to the sum of the `₀-norm (4.21) of the codebooksc1, . . . ,cd. The infimum in the RHS of (4.22) could be seen as a compromise between fitting the data and structural complexity, where the complexity is related to the number of clusters in each subspaceX_j,j= 1, . . . , d. In recommender systems, it means that we can propose a simple representation of ratings with a block matrix with a few number of blocks. Note that this fact is strongly related with the usual sparsity assumption in low rank matrix completion (see for instance Cand`es and Recht [2009], Koltchinskii, Lounici, and Tsybakov [2011]).

4.2.2 General algorithm and associated PAC-Bayesian inequality

Before to describe the algorithm, let us introduce some notations. We denote by C := Π^d_j=1R^m^j^p^j the space of d-tensor codebooks, whereas a decision function at time tis denoted as g_~_c(·) (see (4.19) or (4.20) for instance). We introduce a prior π ∈ ∆(C), where ∆(C) is the set of probability measure on C, and a temperature parameter λ >0. We can now describe the general algorithm and its associated PAC-Bayesian inequality.

The principle of the algorithm is to predicty_t according to a mixture of decision functionsg_~_c, where the mixture is updated by giving the best prediction of yt at each iteration. At the beginning of the game, ˆp₁:=π. We observex₁ and predict according to ˆy₁:=E~c∼ˆp1g_~_c(x₁), whereg_~_c(x₁) is defined above.

Then, learning proceeds as the following sequence of trials t= 1, . . . , T −1 : 1. Get y_t and compute :

pt+1(d~c) = e^−λ^P^tû=1^(yû^−g^~^c^(xû⁾⁾² Wt

dπ(~c), (4.23)

whereW_t:=Eπe^−λ^P^tû=1^(yû^−g^~^c^(xû⁾⁾² is the normalizing constant.

2. Get xt+1 and predict ˆyt+1 :=E~c∼ˆpt+1g_~_c(xt+1).

Then, we have constructed a sequence of prediction (ˆyt)t=1,...,T which satisfies the following PAC-Bayesian bound.

Proposition 2. For any deterministic sequence (x_t, y_t)^T_t=1, for any p ∈ N^d, any λ ≤ 1/2M² and any

where g_~_c(·) satisfies (4.19) or (4.20) whereas K(ρ, π) denotes the Kullback-Leibler divergence between ρ and the priorπ.

The bound of Proposition 2 gives a control of the cumulative loss of the sequential procedure described above. By using the square loss in the algorithm, this bound is more efficient than Proposition 1 where a variance term appears in the RHS. In the framework of this section, the loss function isλ-exp-concave for values ofλstrictly less than 1/2M². Moreover, Proposition 2 holds for any choice of priorπ. It allows us in the sequel to choose a particular sparsity prior in order to state a sparsity regret bound of the form (4.22).

4.2.3 Sparsity regret bounds

The main motivation to introduce our prior is to promote sparsity in the following sense. Ing_~_c(·), we want a codebook~c ∈R

j=1mjpj where~c= (c₁, . . . ,c_d) is such that |c_j|₀ is small for any j = 1, . . . , d, where :

|c_j|₀ = card{i_j = 1, . . . , pj :cj,ij = (0, . . . ,0)^>∈R^m^j}.

To deal with this issue, we propose a product ofdgroup-sparsity priors according to :

dπ_S,d(~c) :=

for some constant a_τ > 0. This prior consists of a product of d products of p_j multivariate Student’s distribution√

2τT_m_j(3), where τ >0 is a scaling parameter and T_m_j(3) is the mj-multivariate Student with three degrees of freedom. It can be viewed as a generalization of the group-sparsity prior defined in Section 4.1 whered= 1.

It is important to stress that in (4.25), we don’t need to threshold the prior at a given radius R >0 such as in Section 4.1 (see the definition of the prior in (4.12)). This is due to the presence of the square loss with bounded outputs y ∈ Y ⊆ [0, M]. A straightforward application of Lemma 10 in the bi-clustering framework gives the following lemma :

Lemma 11. Let p ∈ N^d, τ > 0. Consider the prior π_S,d defined in (4.25). Let ~c = (c₁, . . . ,c_d)) ∈

In this paragraph, we state the main results of this section, i.e. sparsity regret bounds of the form (4.22) for the algorithm described in Section 4.2.2. The first result is a direct consequence of Proposition 2 and the introduction of the sparsity prior (4.25).

Theorem 20. For any deterministic sequence (xt, yt)^T_t=1, let us consider the algorithm of Section 4.2.2 using prior π_S,d defined in (4.25) withτ =δ{√

where C >0 is a constant independent of T.

This result gives a sparsity regret bound where the penalty term is proportional to the sum of the

`0-norm of the set of codebooks~c= (c1, . . . ,cd). The algorithm performs as well as the best compromise between fitting the data and complexity. It is important to highlight that the RHS does not depend on the sequence (p₁, . . . , p_d). Then, as in Section 4.1, we can consider large values ofp_j to learn the number of clusters.

Unfortunately, this result is essentially asymptotic since it holds for large values of T. This is due to the control of the deviation of the random variableg_~_c⁰(x_t) to g_~_c(x_t) for anyt= 1, . . . , T wherec⁰ ∼p_0,d with p_0,d defined in Lemma 11. This problem is specific to the context of bi-clustering where the map

~c7→g_~_c(x) defined in (4.19) or (4.20) is not continuous.

As in Section 4.1, this algorithm in not adaptive since it depends on unknown quantities such as the constantδ > 0 (see the proof in [L11] for a precise definition) and the horizon T. Adaptive algorithms could be arranged as in Section 4.1. We can also stress that as in Gerchinovitz [2013], we can avoid the boundedness assumption Y ⊆ [0, M]. In this case, the choice of λ >0 in the algorithm will depend on the sequence and an adaptive choice could be investigated following Gerchinovitz [2013]. We omit these considerations for concision but could be the core of a more advanced contribution.

Corollary 20 holds for a family {g_~_c, ~c ∈ C} satisfying (4.19) or (4.20). An inspection of the proof shows that a sufficient condition for the family {g_~_c, ~c∈ C}is :

{1, . . . , p_j} is the nearest neighbor quantizer associated with thed-tensor code-book~c. This inequality holds in particular for families (4.19) or (4.20). Corollary 7 proposes to generalize the previous regret bound to a richer class of base functions :

{g_~^k_c, ~c∈ C, k∈ {1, . . . , N}},

where for any value of k = 1, . . . , N, (4.26) holds forg_~^k_c. Functions g_~_c^k includes the previous cases (4.19) and (4.20) but any other labelizer g_~_c constructed thanks to the set of past observations in the cell associated with~c could be considered. Given this family of N labelizers, we can advance the following prior in the algorithm described above :

πS,d,Nd(~c,k) =

The introduction of (4.27) allows to enlarge the family of decision functions. It leads to a better sparsity regret bound with an extra logN term due to the number of base labelizers :

Corollary 7. For any deterministic sequence(x_t, y_t)^T_t=1, consider algorithm of Section 4.2.2 using prior πS,d,N defined in (4.25) withτ =δ{√

This result improves Corollary 20 since the infimum is the RHS could involves different indexes k. The prize to pay is an extra M²logN term due to the introduction of the parameter k in the algorithm. For instance, consider the caseY = [0, M]. IfN = 2 in Corollary 7 and the family{g_~_c^k, ~c∈ C, k∈ {1,2}},is made of forecasters (4.19) and a median estimator, the algorithm performs as well as the best strategy between the mean and the median.

Dans le document Probl`emes inverses d’apprentissage, apprentissage en ligne et applications S´ebastien Loustau (Page 87-91)