pdf

(1)

SUPERVISED CLUSTERING IN THE DATA CUBE

VINCENT ROULET, FAJWEL FOGEL, ALEXANDRE D’ASPREMONT, AND FRANCIS BACH

ABSTRACT. We study a supervised clustering problem seeking to cluster either features, tasks or sample points using losses extracted from supervised learning problems. We formulate a unified optimization problem han- dling these three settings and derive algorithms whose core iteration complexity is concentrated in a k-means clustering step, which can be approximated efficiently. We test our methods on both artificial and realistic data sets extracted from movie reviews and 20NewsGroup.

1. INTRODUCTION

Using information from a supervised problem like regression or classification,supervised clusteringseeks to discover hidden clusters of features, tasks or samples that both help inference, and provide additional structural insights on the data.

In the classical multi-task setting, clustering predictors of related tasks often helps prediction. In computer vision for instance, classifiers associated with different species of cats should be quite similar, but well separated from classifiers associated with cars, and incorporating this structural information at the training stage should improve performance. This problem has been well studied in the multi-task learning literature [1,2,3].

Similarly, when there exists some groups of highly correlated features, reducing dimensionality by as- signing the same weights to some groups of features can be beneficial both in terms of prediction and interpretation [4]. This often occurs in text classification, where it is natural to group together words having the same meaning for a given task [5,6].

Finally, in some settings, it can be valuable to cluster sample points, with each cluster having its own distinct prediction function [7]. This last problem may have applications in the context of privacy learning, where the group information of an individual may not be revealed because of confidentiality issues.

Here, we present a unified and flexible framework for supervised clustering over the “data cube” of either tasks, features or sample points (a representation introduced by [8]). We directly formulate supervised clustering as an optimization problem on the clustered predictors, where clustering is either a hard constraint or a soft regularization penalty, and losses are adapted to the application at hand (classification or regression, clustering tasks, features or sample points). We propose several optimization schemes to solve these problems efficiently. While the original optimization problem is non-convex, we show that the core non- convexity is concentrated in a subproblem similar to k-means, which we solve using classical approximation techniques [9]. In the particular case of feature clustering for regression, the k-means steps are performed in dimension one, and can therefore be solved exactly by dynamic programming [10,11]. Our formulation is then an explicit convex relaxation which can be solved efficiently using the conditional gradient algorithm [12,13]. We describe experiments on both synthetic and real datasets involving large corpora of text where our method compares favorably with standard benchmarks.

Date: May 11, 2019.

Key words and phrases. Clustering, Multitask, Dimensionality Reduction, Multi Outpout.

The two first authors contributed equally.

1

arXiv:1506.04908v1 [cs.LG] 16 Jun 2015

(2)

2. SUPERVISED CLUSTERING OF FEATURES,TASKS OR SAMPLES

We write supervised clustering as a clustering problem penalized by a supervised learning loss. We let nbe the number of training examples in the dataset,dthe number of features (i.e.the ambient dimension) andKthe number of tasks. For regression or binary classificationK = 1, while for multi-classificationK corresponds to the number of classes (using one-versus-all majority vote, training one binary classifier per class vs all others). For simplicity, we consider only square or logistic losses for linear predictors.

Our main variable is amatrix of predictorswith one column per task, writtenW = [w1, ...,wK]∈R^d×K. We letQbe the desired number of clusters andmthe number of items to cluster. When clusteringtasks, m = K and we are grouping the predictors associated with each task, i.e. the columns of W. When clusteringfeatures,m = dand we are clustering the predictors associated with each feature,i.e.the rows ofW. Finally, when clustering sample points,m = nand we introduce individual predictor vectorsW⁽ⁱ⁾ for each sample pointi, which we cluster together to obtainQdistinct predictors associated with a partition of the points into Qgroups. Hence, the term “predictors” either refers to columns of W, rows of W, or individual predictor vectorsW⁽ⁱ⁾, depending on which dimension of the data cube (tasks, features or sample points) the clustering is performed. A summary of these settings is given in Table1. In the following we use the generic notationsU andV to designate predictors.

Aclusteringcan be seen as a partitionC1∪ C2 ∪. . .∪ CQ = [1, m]of predictors, to which corresponds an assignment matrixZ ∈ {0,1}^m×Qsuch thatZij = 1if itemiis in clusterCj. As ink-means, we define a matrix of centroidsC = [c₁, . . . ,c_Q], with each centroidc_j equal to the mean of predictors in cluster j. This information is summarized in the matrixV = CZ^T, which we callmatrix of individual centroids (MIC), whose columnsvi are such thatvi =cjif predictoriis in clusterC^j.

Given a supervised learning loss L(·) and a regularizing penaltyΩ(·) on predictors, we formulate the supervised clustering problem as an optimization problem over the set of MIC matrix V. Enforcing the hard constraint that predictors are exactly confounded with centroids, we get the followinghard supervised clusteringproblem (HSC)

minimize L(V) + Ω(V)

subject to V =CZ^T, Z ∈ {0,1}^m×Q, Z1=1. (HSC) If instead we want to allow predictors wi to deviate from their assigned centroid vi, we can relax the hard clustering constraint using a regularizerΩ_SC(U, V) described below. This yields thesoft supervised clusteringproblem

minimize L(U) + Ω_SC(U, V) + Ω(V)

subject to V =CZ^T, Z ∈ {0,1}^m×Q, Z1=1. (SSC) Given Q clusters Cq, we let sq = |Cq|be the size of the q^th cluster, with s = Z^T1 the corresponding vector. We writeΠ =I− _m¹11^T the centering matrix. As in [2], the clustering penaltyΩ_SC(W, V)can be decomposed into three separable terms as follows (see Figure1for an illustration).

• A measure of how large the barycenter of the centers¯c= _m¹ PQ

q=1sqcqis Ω_mean(V) =λ_mm

2 ||¯c||²2 = λm

2 Tr(V(I−Π)V^T).

• A measure of the variance between clusters Ω_between(V) = λ_b

2 XQ q=1

sq||cq−c¯||²2 = λ_b

2 Tr(VΠV^T).

• A measure of the variance within clusters Ω_within(U, V) = λw

2 XQ q=1

X

i∈Jq

||ui−cq||²2 = λw

2 ||U−V||²F.

2

(3)

Ω^within

Ω^between Ω^mean

0

FIGURE1. Decomposed clustering penalty onmitems.

We then simply add a penalty with parameterµon the Frobenius norm ofV,i.e.the norm of the centroids weighted by the number of items in each clustersq

Ω(V) = µ 2

XQ q=1

s_qkc_qk²2 = µ

2 Tr(V^TV).

We now detail the losses associated with each dimension of the data cube: tasks, features or sample points. Input samples are given by the matrixX = [x₁, . . . ,x_n]^T ∈R^n×d, labels by(y₁, . . . , y_n), andx^(j) refers to thej^thcoordinate ofx.

2.1. Clustering features. Given a regression or classification task, we want to reduce dimensionality by grouping together features which have a similar influence on the output. We present the linear regression case [4], which can be extended to logistic regression and classification. Imposing that all predictor coefficients within a cluster with (scalar) centroidcqare identical means the prediction function can be written f :x→PQ

q=1c_qP

j∈C_qx^(j), which leads to the following loss L(V) = 1

n Xn

i=1

l



y_i, XQ q=1

c_q X

j∈Cq

x^(j)_i



= 1 n

Xn i=1

l



y_i, Xd j=1

XQ q=1

Z_jqc_qx^(j)_i



, in the variableV =PQ

q=1Z_jqc_q =CZ^T ∈ R^dof predictor coefficients, quantized overQvalues. In this case, if we relax the hard clustering by imposing a soft clustering penalty, we lose the benefit of dimensionality reduction.

2.2. Clustering tasks. Given a set ofK supervised tasks like regression or binary classification, multitask learning aims at jointly solving these tasks, hoping that each task can benefit from the information given by other tasks [1,2,3]. For simplicity, we illustrate the case of multi-classification, which can be extended to the general multitask setting. When performing classification with one-versus-all majority vote, we train one binary classifier for each class vs all others. We define the total empirical loss as

L(U) = 1 n

XK k=1

Xn i=1

l(y_i,u^T_kx_i).

in the matrix variableU ∈R^d×K of classifier vectors (one per task).

2.3. Clustering sample points. Imagine for instance that we are interested in measuring the effect of two treatments, e.g. a medicine and a placebo, on uniformly distributed patients. Clearly, the regression function varies a lot between the groups of patients that have a different treatment. Now suppose that we do not know to which patients different treatments were given. Since patients are uniformly distributed, the groups cannot be predicted without supplementary information. We will use the effect of treatments to simultaneously

3

(4)

retrieve the groups of patients and the associated regression functions. Note that this setting is different from a mixture of experts model in the sense that the latent cluster assignment variable Z can only be estimated onceyis known and cannot be deduced from the input featuresX(cf.figure2).

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215

patients different treatments were given. Since patients are uniformly distributed, the groups cannot be predicted without supplementary information. We will use the effect of treatments to simultaneously retrieve the groups of patients and the associated regression functions. Note that this setting is different from a mixture of experts model in the sense that the latent cluster assignment variableZ can only be estimated onceyis known and cannot be deduced from the input featuresX.

Z

X Y

Z

X Y

Figure 2: Learning multiple diverse predictors (left), mixture of experts model (right).

Given a regression or multi-classification task, we want to findQgroups of sample points to maximize the within-group prediction performance using a group specific predictor. This amount to producingQdiverse answers per sample point, considering only the best one. We thus learnQpre- dictors, each predictor having low error rate on some cluster of points. For simplicity, we illustrate the case of regression, which can be extended to multi-classification. We minimize the loss incurred by the best linear predictorfq :x!c^T_q xfor each point,i.e.

L(V) =L(CZ^T) = 1 n

Xn

i=1

q2{min1,...,Q}l(yi,c^T_q xi) = 1 n

Xn

i=1

l yi, XQ

q=1

Ziqc^T_q xi

! ,

in the matrix variableV 2R^d^⇥ⁿof predictor vectors (one per sample point), whereC2R^d^⇥^Qand Z2 {0,1}^n⇥Qare the centroid and assignment matrices defined above.

Loss Dim.U, V Predictors Goal

Features _n¹Pn

i=1l⇣PQ q=1

Pd

j=1Zjqcqxij

⌘ 1⇥d rows ofW regression

Tasks _n¹PK

k=1

Pn

i=1l(w^T_k xi) d⇥K columns ofW classification Samples _n¹Pn

i=1l⇣PQ

q=1Ziqc^T_q xi

⌘ d⇥n W⁽ⁱ⁾ regression

Table 1: Summary of the presented supervised clustering settings.

3 Algorithms

We now present optimization strategies to solve these supervised clustering problems. We begin by simple greedy procedures, then propose a non-convex projected gradient descent scheme and finally a more refined convex relaxation solved using conditional gradient and approximations to k-means.

3.1 Simple strategies

A straightforward strategy is to first minimize on predictors, as in a classical supervised learning problem, and then cluster predictors together using k-means. The procedure can be repeated in the case of a soft clustering penalty. In the same spirit, when clustering sample points, one can alternate minimization on the predictors of each group and assignment of each point to the group where its loss is smallest. These methods are fast but very dependent on the initialization. Alternating minimization can optionally be used to refine the solution of the more robust algorithms proposed below.

3.2 Projected gradient descent

A natural strategy is to do a projected gradient descent on the non-convex problems (HSC) or (SSC).

The projection of a matrixV is made by finding argmin

Z,C kV CZ^Tk²F = argmin XQ

q=1

X

i2Cq

kvi cqk²2,

FIGURE2. Learning multiple diverse predictors (left), mixture of experts model (right).

Given a regression or multi-classification task, we want to findQgroups of sample points to maximize the within-group prediction performance using a group specific predictor. This amount to producingQdiverse answers per sample point, considering only the best one. We thus learnQpredictors, each predictor having low error rate on some cluster of points. For simplicity, we illustrate the case of regression, which can be extended to multi-classification. We minimize the loss incurred by the best linear predictorfq :x→ c^T_q x for each point,i.e.

L(V) =L(CZ^T) = 1 n

Xn i=1

q∈{1,...,Q}min l(yi,c^T_q xi) = 1 n

Xn i=1

l



y_i, XQ q=1

Ziqc^T_q xi



,

in the matrix variable V ∈ R^d×n of predictor vectors (one per sample point), where C ∈ R^d×Q and Z ∈ {0,1}^n×Qare the centroid and assignment matrices defined above.

Loss Dim.U, V Predictors Goal

Features _n¹ P_n

i=1lPQ q=1

P_d

j=1Z_jqc_qx_ij

1×d rows ofW regression Tasks _n¹ PK

k=1

Pn

i=1l(w^T_k xi) d×K columns ofW classification Samples ¹_nPn

i=1lPQ

q=1Z_iqc^T_q x_i

d×n W⁽ⁱ⁾ regression

TABLE1. Summary of the presented supervised clustering settings.

3. ALGORITHMS

We now present optimization strategies to solve these supervised clustering problems. We begin by simple greedy procedures, then propose a non-convex projected gradient descent scheme and finally a more refined convex relaxation solved using conditional gradient and approximations to k-means.

3.1. Simple strategies. A straightforward strategy is to first minimize on predictors, as in a classical supervised learning problem, and then cluster predictors together using k-means. The procedure can be repeated in the case of a soft clustering penalty. In the same spirit, when clustering sample points, one can alternate minimization on the predictors of each group and assignment of each point to the group where its loss is smallest. These methods are fast but very dependent on the initialization. Alternating minimization can optionally be used to refine the solution of the more robust algorithms proposed below.

4

(5)

3.2. Projected gradient descent. A natural strategy is to do a projected gradient descent on the non-convex problems (HSC) or (SSC). The projection of a matrixV is made by finding

argmin

Z,C kV −CZ^Tk²F = argmin XQ q=1

X

i∈Cq

kv_i−c_qk²2,

where the minimum is taken over centroidsci and partitions (C¹, . . . ,C^Q). This can be solved with the k- means++ algorithm which performs alternate minimization on the assignments and the centroids. Although it is a non-convex problem, k-means++ gives general approximation bounds on its solution [9].

Writing k-means++(V, Q)the approximate solution of the projection. Writingφthe objective function and using a backtracking line search for the stepsizeαt, the full procedure is summarized in Algorithms1 and2. Details of gradient computations for each setting are given in the appendix.

Algorithm 1Proj. Gradient Descent (SSC) Input: X, y, Q, , λb, λw, λm, µ

InitializeW₀ =V₀ = 0

while|φ(Wt, Vt)−φ(Wt−1, Vt−1)| ≥do Wt+1=Wt−αt(∇L(Wt) +∇ΩS,C(Vt, Wt)) V_t+1

2 =V_t−α_t∇Ω_S,C(V_t, W_t) [Zt+1, Ct+1] =k-means++(V_t+¹

2, Q) V_t+1=C_t+1Z_t+1^T

end while

Z^∗andC^∗ are given through Kmeans++

Output: W^∗, V^∗, Z^∗, C^∗

Algorithm 2Proj. Gradient Descent (HSC) Input: X, y, Q, , µ

InitializeV0 = 0

while|φ(V_t)−φ(Vt−1)| ≥do V_t+¹

2 =Vt−αt(∇L(V) +∇Ω(V)) [Z_t+1, C_t+1] =k-means++(V_t+¹

2, Q) Vt+1 =Ct+1Z_t+1^T

end while

Z^∗andC^∗are given through Kmeans++

Output: V^∗, Z^∗, C^∗

3.3. Convex relaxation using approximate conditional gradient . Another algorithmic approach to the supervised clustering problem is to minimize with respect to the assignment matrixZusing the Frank-Wolfe algorithm (a.k.a. conditional gradient, [12,13]). Considering the squared lossl(f(x), y) = ¹₂(y−f(x))², we use the analytic form of the minimization in(W, V)orV to rewrite the problem. The clustering is then captured in terms of the equivalence matrixM = Z(Z^TZ)⁻¹Z^T, which satisfies Mij = 1/|Cq|if itemi andjare in the same clusterqandM_ij = 0otherwise.

We describe here the simple case of supervised clustering of features in a regression task, introduced in§2.1. Detailed computations and explicit procedures for all settings are given in the appendix. When clustering features, we optimize over the coefficients associated with each feature and the k-means step can be performed exactly using dynamic programming [10,11].

The lossLcan be written here

L(V) =L(CZ^T) = 1 2n

Xn i=1



y_i− Xd j=1

XQ q=1

Zjqcqx^(j)_i





2

,

whereC= [c1, . . . , cQ]∈R^1×Q, hence the regularized loss becomes φ(C, Z) = 1

2n Xn i=1

y_i−CZ^Tx_i2

+µ

2kCZ^Tk²2

= 1

2nTr(y^T y) + 1

2nTr(CZ^TX^TXZC^T)− 1

nTr(CZ^TX^T y) +µ

2 Tr(CZ^TZC^T).

5

(6)

Minimizing inCand using the Sherman-Woodbury-Morrison formula we get G(M) := min

C L(CZ^T) + µ

2kCZ^Tk²2

= 1

2n y y^T I−XZ(Z^TX^TXZ+µnZ^TZ)⁻¹Z^TX^T y

= 1

2nTr y y^T

I+ 1

nµXM X^T −1!

.

DefiningMas the set of equivalence matrices of the formM =Z(Z^TZ)⁻¹Z^T, forZan assignment matrix, each iteration of the conditional gradient method requires solving an affine minimization subproblem over hull(M). The setMbeing discrete, we haveargmin_M∈hull(M)Tr(M∇G(M)) = argmin_M_∈MTr(M∇G(M)).

WritingP =−∇G(M), we get P = 1

2n²µX^T

I+ 1

nµXM X^T −1

y y^T

I+ 1

nµXM X^T −1

X,

which is always semidefinite positive. WritingP¹² the matrix square root ofP we have argmin

M∈M

Tr(M∇G(M)) = argmin

M∈M −Tr(M P¹²P¹²^T)

= argmin

M∈M

Tr((I−M)P¹²P¹²^T))

= argmin

M∈M kP¹² −M P¹²k²F

= argmin

Z

minC kP¹² −ZC^Tk²F,

where we recognize the k-means problem defined above. In fact, in this particular case, the k-means subproblem is one-dimensional and can be solved exactly using dynamic programming [10,11]. We use the classical stepsize for conditional gradientα_k= _k+2² and Frank-Wolfe rounding. The procedure is summarized in Algorithm3.

Algorithm 3Conditional gradient on the equivalence matrix Input: X, y, Q,

InitializeM0∈ M

while|G(M_k)−G(Mk−1)| ≥do

Compute the matrix square rootP¹² of−∇G(M_k) Get oracle∆_k=k-means(P¹², Q)

M_k+1=M_k+α_k(∆_k−M_k) end while

Use Frank Wolfe rounding to get a solutionM^∗∈ M Z^∗is given by k-means

C^∗is given by the analytic solution of the minimization forZ^∗fixed Output: C^∗, Z^∗, M^∗

3.4. Complexity. The core complexity of Algorithms1and2is concentrated in the inner k-means subproblem, which standard alternating minimization approximates inO(tQS), wheretis the number of alternating steps,Qis the number of clusters, andSis the product of the dimensions ofV (see Table1). Using a proper conditioning of the gradient, the number of iterations before convergence is typically below 100, which make Algorithms1and2both fast and scalable. For Algorithm3, we also need to compute a matrix square

6

(7)

root of the gradient at each iteration, which can slow down computations for large datasets. The choice of the number of clusters can be done given an a priori on the problem (e.g.if we know the hierarchical structure of classes in a classification problem), or cross-validation, idem for the other regularization parameters.

4. NUMERICALEXPERIMENTS

4.1. Synthetic dataset.

Supervised clustering of sample points. We generatendata points(xi, yi)fori = 1, . . . , nwithxi ∈ R^d andy_i ∈R, divided in two clusters corresponding to regression tasks with weightsw₁andw₂. Regression labels for pointsxiin clusterqare given byyi =w^T_q xi+η, whereη ∼ N(0, σ²). We test the robustness of the algorithms to noise dimensions,i.e.we completexi withdndimensions of noiseη_d ∼ N(0, σ_d). The results are reported in Table2.

Here, the intrinsic dimension is 10. “Oracle” refers to the least square fit given the true assignments. It can be seen as the best error rate that can be achieved. PGK refers to projected gradient, OM refers to conditional gradient onM, AM refers to alternate minimization. PGK and OM were followed by AM refinement. 200 points were used for training, 200 for testing. The regularization parameter was set to λ = 10⁻² for all experiments. Noise on labels and added dimensionsσy =σd= 2×10⁻¹. Results were averaged over 100 experiments, figures after the sign±correspond to one standard deviation.

dim-noise=0 dim-noise=100 dim-noise=300 dim-noise=900 Oracle 4.2±0.5 29.5±7.0 17.3±6.2 81.6±30.0

PGK 5.8±15.1 251.0±154.4 187.5±98.7 275.2±167.2 OM 4.3±0.5 254.4±144.6 209.2±93.5 220.9±93.0 AM 50.4±121.1 897.0±456.1 445.3±207.1 455.4±192.3

TABLE2. Supervised clustering of sample points.

It appears that that PGK and OM give very similar results, significantly improving on the naive alternating minimization scheme. In view of standard deviations, OM seems more robust.

Supervised clustering of features. We generatendata points (xi, yi) for i = 1, . . . , n with xi ∈ R^dand y_i ∈ R. Regression weights have only 10 different valuesw_q for q = 1, . . . ,10, uniformly distributed around 0. Regression labels are given byyi = w^T xi+η, whereη ∼ N(0, σ²). We test the robustness of the algorithms to the number of learning examples, i.e.the size of the training set n. We measure thel₂ norm of the difference between the true vector of weights and the estimated ones.

We compare in Table 3 the proposed algorithms to OSCAR [4], Ridge, Lasso and Ridge followed by k-means on the weights (using associated centroids as predictors). “Oracle” refers to the mean square fit given the true assignments of features. It can be seen as the best error rate that can be achieved. PGK refers to projected gradient and was intitialized with the solution of Ridge followed by k-means, OM refers to conditional gradient onM and was followed by PGK. Noise on labels is set toσ = 10⁻¹. Algorithm parameters were all cross-validated using a logarithmic grid. Results were averaged over 30 experiments and figures after the sign±correspond to one standard deviations. Results were multiplied by 100 to shorten the table.

PGK and OM appear to give significantly better results than other methods and even reach the performance of the Oracle forn > d, while forn≤dresults are in the same range. Note that contrary to OSCAR, Lasso and Ridge, reduction of dimensionality is guaranteed (the number of clusters is fixed).

4.2. Real data.

7

(8)

n=50 n=75 n=100 n=150 n=200 Oracle 3.3±1.2 2.6±0.9 2.2±0.5 1.8±0.4 1.5±0.4

PGK 72.9±2.7 58.0±2.7 41.6±12.1 1.9±1.0 1.5±0.4

OM+PGK 73.1±2.5 55.9±3.5 43.5±7.9 4.6±7.4 1.5±0.4 Ridge+Kmeans 72.1±1.6 58.6±2.7 46.9±20.6 7.7±4.1 2.4±1.0 OSCAR 80.3±7.5 65.9±1.4 32.8±4.2 14.0±1.9 9.9±0.9 Lasso 91.2±0.6 81.9±4.2 56.1±30.5 13.9±1.7 10.0±1.0 Ridge 69.9±0.5 58.0±0.8 33.0±5.8 14.0±1.6 10.1±0.9

TABLE3. Supervised clustering of features.

4.2.1. 20NewsGroup classification. We perform classification on 2800 documents extracted from the publicly available 20NewsGroup dataset, which contains 20 different newsgroups, each one corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g.comp.sys.ibm.pc.hardware and comp.sys.mac.hardware), while others are highly unrelated (e.g.misc.forsale and soc.religion.christian).

Our goal is to retrieve clusters of related newsgroups, while performing competitive classification. We used a dictionary of 5000 words chosen by tf-idf and took the empirical distribution over words as covariates for each document.

In Table4, we compare our approach to other classical regularizations such as the Frobenius norm and the trace norm, as implemented in [3], using either a ridge or a logistic loss. As the projected gradient descent scheme is not convex we initialize it by the solution given by the logistic loss non regularized. All algorithms were 5-fold cross-validated on 80% of the data then tested on the remaining 20%. The number of clusters was set to 5, as suggested by the names of newsgroups. Figures after the sign±correspond to one standard deviation when varying the training and test sets. As reported in Table4, it appears that the clustering regularization helps prediction on topics.

Frobenius Logistic Trace Ridge OM Ridge PGK Logistic

5.8±2.0 4.3±0.8 4.9±0.8 2.6±1.5

TABLE4. 100×mean absolute errors for predicting topics on 20NewsGroup dataset.

4.2.2. Predicting ratings associated with reviews using groups of words. We perform “sentiment” analysis of newspaper movie reviews. We use the publicly available dataset introduced in [14] which contains movie reviews paired with star ratings. We treat it as a regression problem, taking log-responses fory in (0,1) and the empirical distribution over words as covariates. With a 5000 term vocabulary chosen by tf-idf, the corpus contains 5006 documents and comprises 1.6M words. We evaluate our algorithms for regression with clustered features against standard regression approaches: Lasso, Ridge, and Ridge followed by k- means on predictors. All algorithms were 5-fold cross-validated on 80% of the dataset and then tested on the remaining 20%. The number of clusters was arbitrarily set to 10, though we did not notice significant changes when varying it. Results are reported in Table5, figures after the sign±correspond to one standard deviation when varying the training and test sets. While all methods including ours give 10% mean absolute errors on the test set (up to 0.5% accuracy), our approach has the additional benefit of providing clusters of words which have a similar influence. Moreover we noticed that the clusters with highest absolute weights are also the ones with smallest number of words, which confirms the intuition that only a few words are very discriminative. We illustrate this in Table6, listing words of the first and last clusters.

8

(9)

PGK OM+PGK Ridge+Kmeans Lasso Ridge 9.7±0.1 9.8±0.1 10.0±0.1 9.8±0.1 9.9±0.1

TABLE5. 100×mean absolute errors for predicting movie ratings associated with reviews.

Cluster 1 stupidity,shred,inept,sophomoric,sit,drivel,uninteresting,clue,ludicrous,tedious,single, (negative) remotely,crass,tolerable,bland,predictable,devoid,waste,fail,embarrassing,mess,idea,inane, size 47 lifeless,abysmal,badly,disgusting,bearable,poorly,care,leaden,annoying,turkey,attempt,

interesting,problem,lame,lack,worse, unfunny,watchable,worst,ridiculous,flat,dull,boring, awful,suppose,bad

Cluster 10 perfect,fascinating,great,powerful,hilarious,fine,brilliant,delightful,strongly,easy,rare, (positive) wonderfully, recommendation,mature,nomination,simple,perfectly,goodspirited, size 87 masterpiece,strong,potent,marvelous,excellent,fortunately,world,intelligent,memorable,

unique,send,small,funniest,unexpected,carefully,academy,superbly,important,life,poignancy, speak,wonderful,outstanding,mesmerizing,man,imaginative,simplicity,confidence,delightfully,

delight,touching, nuanced,study,melville,brilliance,work,picture,honest,oscar, classification,masterfully,favorite,effective,masterful,intense,bind,unrated,surprisingly,superb,

triumph,flawless,move,lyrical,gem,devastate,intricate,highly,subtle,power,accomplishment, keen,uplifting,empathize,element,quibble,minor,love,finest,crisp

TABLE6. Supervised clustering of words on movie reviews.

5. DISCUSSION

We have developed a unified framework for supervised clustering over tasks, features or samples and provided two robust algorithms for solving the associated optimization problems. Results on synthetic and realistic text datasets suggest that our method is competitive against standard regression and classification methods, while having the additional benefit of providing clusters of tasks, features or samples. Similarly as in compressed sensing, optimization is made on a union of subspaces, hence in analogy with RIP conditions for iterative hard thresholding [15], it would be interesting to see under which assumptions our algorithms can recover the optimal solution to the supervised clustering problem.

6. APPENDIX

We give here detailed computations of the function Gcorresponding to the convex relaxation defined in3.3in all settings.

We always supposeZ^TZ inversiblei.e.∀q, sq 6= 0 (there is no empty cluster). For any integerp, we let1p ∈ R^p be the vector whose coordinates are all ones, Ip the identity matrix inR^p×p, Γp = ¹^p_p¹^T^p and Πp =Ip−Γpthe corresponding centering matrix.

For all settings input samples are represented by the matrixX= (x₁, ...,x_n)^T ∈R^n×dand their respec- tive labels byy = (y1, ....yn) ∈ Y. For regression problemsY = R. Classification problems are casted into the multitask setting, each class corresponding to a task. We denote byy^k = (y₁^k, ..., y^k_n)the vector of binary labels corresponding to the classk∈[[1, K]]andY = (y¹, ...,y^K).

6.1. Minimization in C for Soft Clustering Problems . For soft supervised clustering problems, the minimization in the variableCforW, Z fixed can be made without assumptions on the specific lossL. We let (δ, m)be the dimensions of the predictors of interest, such thatW ∈R^δ×m,C ∈R^δ×QandZ ∈ {0,1}^m×Q. We denote byλ_W, λ_B, λ_M the weights associated respectively to the within, between and mean penalties.

Minimization inCis then given by

9

(10)

H(W, Z) := min

C∈R^δ×Q

λW

2 kW −CZ^Tk²F +λB

2 Tr(CZ^TΠmZC^T) +λM

2 Tr(CZ^TΓmZC^T)

= min

C∈R^δ×Q

λ_W

2 kWk²F +1

2Tr C (λ_W +λ_B)Z^TZ+ (λ_M −λ_B)Z^TΓ_mZ C^T

−λ_WTr(C^TW Z)

= λ_W

2 Tr(W W^T)− λ²_W 2 Tr

W Z (λ_W +λ_B)ZZ^T + (λ_M −λ_B)Z^TΓ_mZ⁻¹

Z^TW . Lets = Z^T1_m ∈ R^Q be the vector whose coordinatess_q are the sizes of the clusters, denoting bys¹² the vector whose coordinates are√

s_q, and bys⁻¹² the vector whose coordinates are ^√¹_s

q, we can derive the inversion in the precedent formula,

J⁻¹= (λ_W +λ_B)ZZ^T + (λ_M −λ_B)Z^TΓ_mZ⁻¹

=

(λ_W +λ_B)diag(s) + (λ_M −λ_B)ss^T m

⁻¹

= 1

λW +λB

diag(s⁻¹²)



I_Q+ λM −λB

λW +λB

s¹²s¹²^T m





−1

diag(s⁻¹²)

= 1

λ_W +λ_Bdiag(s⁻¹²)



I_Q− λM −λB

λ_W +λ_M s¹²s¹²^T

m



diag(s⁻¹²)

= 1

λW +λB

(Z^TZ)⁻¹− λ_M −λ_B (λW +λM)(λW +λB)

1Q1^T_Q m

= 1

λW +λB

((Z^TZ)⁻¹−1_Q1^T_Q

m ) + 1 λW +λM

1_Q1^T_Q m , where we used thatp= ^s

1 2s¹²

T

m = ^s

1 2s¹²

T

ks¹²k²₂ is a projector and therefore(I+αp)⁻¹ =I−_α+1^α p, added to the fact thatdiag(s) =Z^TZ.

Now introducing the equivalence matrix M = Z(Z^TZ)⁻¹Z^T, and using thatZ¹^Q¹

T Q

m Z^T = Γm, we finally obtain

H(W, Z) = λW

2 Tr(W W^T)−λ²_W 2 Tr

W

1

λ_W +λ_B(M −Γm) + 1

λ_W +λ_MΓm

W^T

= λ_W

2 Tr W (I_m−α(M−Γ_m)−βΓ_m)W^T , whereα= _λ ^λ^W

W+λB andβ= _λ ^λ^W

W+λM.

DenotingP =Im−α(M−Γm)−βΓm, one can also use Kronecker’s formula : H(W, Z) = λW

2 Vec(W)^T(P⊗Iδ)Vec(W).

6.2. Clustered Multitask Learning . We derive here the computation of G when clustering tasks. We restrict ourselves to the case of multiclass setting cast as a multitask problem such that each task shares the same input data. GivenKclasses, we denote byW = (w1, ...,w_K)∈R^d×Kthe matrix of linear predictors.

Using a squared lossl(f(x), y) = ¹₂(y−f(x))²the empirical loss is given by :

10

(11)

L(W) = 1 2n

Xn i=1

XK k=1

(y_i^k−w^T_k xi)². Using Kronecker’s product formula, we get

L(W) = 1 2n

XK k=1

w^T_kX^TXwk−1 n

XK k=1

w^T_k X^Ty^k+ 1 2nkyk²2

= 1 2n

XK k=1

e^T_kW^TX^TXWe_k−1

nTr(W X^TY) + 1 2nkyk²2

= 1

2nVec(W)^T(IK⊗X^TX)Vec(W)− 1

nVec(W)^T Vec(X^TY) + 1

2nkyk²2.

Using the expression found by minimizing inCthe regularization penalty, we get an expression forG:

G(M) = min

W L(W) +H(W, Z)

= min

W

1

2nVec(W)^T(IK⊗X^TX+P ⊗Id)Vec(W)− 1

nVec(W)^TVec(X^TY) + 1 2nkyk²2

=− 1

2nVec(X^TY)^T I_K⊗X^TX+λ_WnP ⊗I_d−1

Vec(X^TY) + 1 2nkyk²2

Denote(v1, ...,vd) ∈ R^d×d,(λ1, ..., λd) ∈ R^dand(u1, ...,uS) ∈ R^S×S,(µ1, ..., µS) ∈ R^S the eigenvectors and corresponding eigenvalues respectively of matricesX^TXandP = (I_K−α(M−Γ_K) +βΓ_K).

The eigenvectors and corresponding eigenvalues ofI_K ⊗X^TX+λ_WnP ⊗I_d are (ui⊗vj)_i∈[[1,n]]

j∈[[1,d]]

and (λWnµi+λj)i∈[[1,n]]

j∈[[1,d]]

. The inversion in the expression ofGis then given by

J⁻¹= I_K⊗X^TX+λ_WnP ⊗I_d−1

= Xn i=1

Xd j=1

1 λWnµi+λj

u_iu^T_i ⊗v_jv^T_j.

Then we note that the set of eigenvectors ofP can be decomposed into three sets. Indeed the matrices I_K −M,M −Γ_K andΓ_K are orthogonal projectors on orthogonal subspaces spanning the entire space.

Denote byIW,IB,IM the sets of eigenvectors corresponding respectively toIK−M,M −ΓKandΓK, their corresponding eigenvalues inP can easily be computed and we obtain

P =I_K−M + (1−α)(M−Γ_K) + (1−β)Γ_K

= X

i∈I_W

u_iu^T_i + (1−α) X

i∈I_W

u_iu^T_i + (1−β) X

i∈I_M

u_iu^T_i .

This decomposition can be used for the inversion :

11

(12)

J⁻¹ = X

i∈IW

Xd j=1

1 λWn+λj

u_iu^T_i ⊗v_jv_j^T

+ X

i∈I_B

Xd j=1

1

λ_Wn(1−α) +λj

uiu^T_i ⊗vjv^T_j

+ X

i∈I_W

Xd j=1

1

λ_Wn(1−β) +λ_ju_iu^T_i ⊗v_jv_j^T

= (I_K−M)⊗(X^TX+nλ_WI_d)⁻¹+ (M−Γ_K)⊗(X^TX+nλ_W(1−α)I_d)⁻¹ + ΓK⊗(X^TX+nλW(1−β)I_d)⁻¹.

FinallyGcan be simplified using properties of the Kronecker product

G(M) =− 1

2nVec(X^TY)^T (IK−M)⊗(X^TX+nλWId)⁻¹

Vec(X^TY)

− 1

2nVec(X^TY)^T (M−ΓK)⊗(X^TX+nλW(1−α)I_d)⁻¹

Vec(X^TY)

− 1

2nVec(X^TY)^T Γ_K⊗(X^TX+nλ_W(1−β)I_d)⁻¹

Vec(X^TY) + 1 2nkYk²2

=− 1

2nTr(Y^TX(X^TX+nλ_WI_d)⁻¹X^TY(I_K−M))

− 1

2nTr(Y^TX(X^TX+nλ_W(1−α)I_d)⁻¹X^TY(M −Γ_K))

− 1

2nTr(Y^TX(X^TX+nλW(1−β)Id)⁻¹X^TYΓK) + 1

2nkYk²F. TherforeGis a linear function ofM whose gradient is given by

∇G(M) = 1

2nY^TX (X^TX+nλ_WI_d)⁻¹−(X^TX+nλ_W(1−α)I_d)⁻¹ X^TY.

As1≥α≥0, we get that−∇G(M)0.

For a fixedM,W is given using precedent computations and Kronecker’s formula by

WM = (X^TX+nλWId)⁻¹X^TY(IK−M)

+ (X^TX+nλW(1−α)Id)⁻¹X^TY(M−ΓK) + (X^TX+nλW(1−β)Id)⁻¹X^TYΓK.

6.3. Learning with clustered features. For more generality we cast the problem of learning clustered features in the multiclass learning setting. We restrict here to the soft supervised clustering problem as the hard one has already be done. We keep notations introduced in section 6.2 though we now have W =

12

(13)

(w1, ...,wK)^T ∈R^K×d. The empirical loss is given by L(W) = 1

2n Xn i=1

XK k=1

(y^k_i −w^T_k xi)².

= 1 2n

XK k=1

w^T_kX^TXw_k−1 n

XK k=1

w^T_k X^T y^k+ 1 2nkYk²F

= 1 2n

XK k=1

Tr(W^TW X^TX)− 1

nTr(W X^TY) + 1 2nkYk²F

Adding the regularization penalty and minimizing inCwe obtain

G(M) = min

W

1 2n

XK k=1

Tr W^TW(X^TX+λ_WnP)

− 1

nTr(W X^TY) + 1 2nkYk²2

= 1

2nkYk²2− 1

2n Y^TX(X^TX+λ_WnP)⁻¹X^TY

= 1

2nTr Y Y^T(I_n−X(X^TX+λ_WnP)⁻¹X^T

= 1 2nTr

Y Y^T(In+ 1

λWnXP⁻¹X^T)⁻¹

As detailed before, the inverse ofP can be found analytically by observing that it is composed of orthogonal projectors on orthogonal subspaces spanning the entire space. Hence we haveP⁻¹ =I_d−M+_1−α¹ (M− Γd) +_1−β¹ Γd. We now get

G(M) = 1 2nTr

Y Y^T(Id+ 1 nλB

X(M −Γd)X^T + 1 nλM

XΓdX^T)

Note that we still have−∇G(M)0. For a fixedM,W is given analitically by W_M =Y^TX X^TX+λ_Wn(I_d−α(M−Γ_d)−βΓ_d)⁻¹

.

6.4. Learning multiple predictors. In the setting of learning multiple predictors, we denote by C^q the set of points whose best linear predictor isw_q, having thereforeC1, ...,CQ a partition of[[1, n]]. We let as aboves_q be the cardinal of the setCq,X_q ∈R^s^q^×dis the matrix whose columns are the points belonging to the cluster q, and y_q is the column vector of labels corresponding to clusterq. We use a squared loss l(f(x), y) = ¹₂(y−f(x))².

ForC1, ...,CQfixed (or equivalentlyZfixed), we can computeGusing the Sherman-Woodbury-Morrison formula as

G(M) = min

w1,...,wQ

1 2n

XQ q=1

X

i∈C_q

(y_i−w^T_q x_i)²+µ 2

XQ q=1

s_qkw_qk²2

= XQ q=1

1 2ny^T_q

1

sqµnX_qX_q^T +I_n −1

y_q.

13

(14)

We defineE∈R^n×nthe permutation matrix permuting order of points such that

Ey=



 y_C₁

...

y_C_Q



 EX =



 X1

...

XQ



. We denote for q ∈ [[1, Q]], Rq = P

Pq−1

p=1sp≤i≤Pq

p=1speie^T_i ortohgonal projectors on the ordered sets of points belonging to clusterqsuch that





X₁X₁^T 0 0

0 ... 0

0 0 XQX_Q^T



=diag(EXX^TE^T) = XQ q=1

RqEXX^TE^TRq.

Thus we get

G(M) = 1

2ny^TE^T



 XQ q=1

1

sqµnR_qEXX^TE^TR_q+I_n





−1

Ey

= 1 2ny^T



 XQ q=1

1

sqµnE^TRqEXX^TE^TRqE+In





−1

y.

Then we notice thatE^TR_qE = diag(Z_q)i.e.the projector on the set of points belonging to clusterq, and thatPQ

q=1 1

sqdiag(ZQ)XX^T diag(Zq) =M◦XX^T, where◦denotes the Hadamard product. Hence we finally get

G(M) = 1 2ny^T

1

µnXX^T ◦M+I_n −1

y. Its gradient is given by

2n∇G(M) =− 1

µnXX^T ◦

(I_n+ 1

µnXX^T ◦M)⁻¹y y^T(I_n+ 1

µnXX^T ◦M)⁻¹

, for which we have−∇G(M)0.

For a fixedZ, denoting byX_q =Z_q^TXthe set of points belonging to clusterq, the linear predictorsw_q for each cluster of points are given by

w_q= (nµs_qI_d+X_q^TX_q)⁻¹X_q^T y_q.

Acknowledgements. AA is at CNRS, at the Département d’Informatique at École Normale Supérieure, 23 avenue d’Italie, 75013 Paris, France. INRIA, Sierra project-team, PSL Research University. The authors would like to acknowledge support from a starting grant from the European Research Council (ERC project SIPA), an AMX fellowship, the MSR-Inria Joint Centre, as well as support from the chaire Économie des nouvelles données, thedata sciencejoint research initiative with thefonds AXA pour la rechercheand a gift from Société Générale Cross Asset Quantitative Research.

14