SUPERVISED CLUSTERING IN THE DATA CUBE
VINCENT ROULET, FAJWEL FOGEL, ALEXANDRE D’ASPREMONT, AND FRANCIS BACH
ABSTRACT. We study a supervised clustering problem seeking to cluster either features, tasks or sample points using losses extracted from supervised learning problems. We formulate a unified optimization problem han- dling these three settings and derive algorithms whose core iteration complexity is concentrated in a k-means clustering step, which can be approximated efficiently. We test our methods on both artificial and realistic data sets extracted from movie reviews and 20NewsGroup.
1. INTRODUCTION
Using information from a supervised problem like regression or classification,supervised clusteringseeks to discover hidden clusters of features, tasks or samples that both help inference, and provide additional structural insights on the data.
In the classical multi-task setting, clustering predictors of related tasks often helps prediction. In computer vision for instance, classifiers associated with different species of cats should be quite similar, but well separated from classifiers associated with cars, and incorporating this structural information at the training stage should improve performance. This problem has been well studied in the multi-task learning literature [1,2,3].
Similarly, when there exists some groups of highly correlated features, reducing dimensionality by as- signing the same weights to some groups of features can be beneficial both in terms of prediction and interpretation [4]. This often occurs in text classification, where it is natural to group together words having the same meaning for a given task [5,6].
Finally, in some settings, it can be valuable to cluster sample points, with each cluster having its own distinct prediction function [7]. This last problem may have applications in the context of privacy learning, where the group information of an individual may not be revealed because of confidentiality issues.
Here, we present a unified and flexible framework for supervised clustering over the “data cube” of either tasks, features or sample points (a representation introduced by [8]). We directly formulate supervised clus- tering as an optimization problem on the clustered predictors, where clustering is either a hard constraint or a soft regularization penalty, and losses are adapted to the application at hand (classification or regres- sion, clustering tasks, features or sample points). We propose several optimization schemes to solve these problems efficiently. While the original optimization problem is non-convex, we show that the core non- convexity is concentrated in a subproblem similar to k-means, which we solve using classical approximation techniques [9]. In the particular case of feature clustering for regression, the k-means steps are performed in dimension one, and can therefore be solved exactly by dynamic programming [10,11]. Our formulation is then an explicit convex relaxation which can be solved efficiently using the conditional gradient algorithm [12,13]. We describe experiments on both synthetic and real datasets involving large corpora of text where our method compares favorably with standard benchmarks.
Date: May 11, 2019.
Key words and phrases. Clustering, Multitask, Dimensionality Reduction, Multi Outpout.
The two first authors contributed equally.
1
arXiv:1506.04908v1 [cs.LG] 16 Jun 2015
2. SUPERVISED CLUSTERING OF FEATURES,TASKS OR SAMPLES
We write supervised clustering as a clustering problem penalized by a supervised learning loss. We let nbe the number of training examples in the dataset,dthe number of features (i.e.the ambient dimension) andKthe number of tasks. For regression or binary classificationK = 1, while for multi-classificationK corresponds to the number of classes (using one-versus-all majority vote, training one binary classifier per class vs all others). For simplicity, we consider only square or logistic losses for linear predictors.
Our main variable is amatrix of predictorswith one column per task, writtenW = [w1, ...,wK]∈Rd×K. We letQbe the desired number of clusters andmthe number of items to cluster. When clusteringtasks, m = K and we are grouping the predictors associated with each task, i.e. the columns of W. When clusteringfeatures,m = dand we are clustering the predictors associated with each feature,i.e.the rows ofW. Finally, when clustering sample points,m = nand we introduce individual predictor vectorsW(i) for each sample pointi, which we cluster together to obtainQdistinct predictors associated with a partition of the points into Qgroups. Hence, the term “predictors” either refers to columns of W, rows of W, or individual predictor vectorsW(i), depending on which dimension of the data cube (tasks, features or sample points) the clustering is performed. A summary of these settings is given in Table1. In the following we use the generic notationsU andV to designate predictors.
Aclusteringcan be seen as a partitionC1∪ C2 ∪. . .∪ CQ = [1, m]of predictors, to which corresponds an assignment matrixZ ∈ {0,1}m×Qsuch thatZij = 1if itemiis in clusterCj. As ink-means, we define a matrix of centroidsC = [c1, . . . ,cQ], with each centroidcj equal to the mean of predictors in cluster j. This information is summarized in the matrixV = CZT, which we callmatrix of individual centroids (MIC), whose columnsvi are such thatvi =cjif predictoriis in clusterCj.
Given a supervised learning loss L(·) and a regularizing penaltyΩ(·) on predictors, we formulate the supervised clustering problem as an optimization problem over the set of MIC matrix V. Enforcing the hard constraint that predictors are exactly confounded with centroids, we get the followinghard supervised clusteringproblem (HSC)
minimize L(V) + Ω(V)
subject to V =CZT, Z ∈ {0,1}m×Q, Z1=1. (HSC) If instead we want to allow predictors wi to deviate from their assigned centroid vi, we can relax the hard clustering constraint using a regularizerΩSC(U, V) described below. This yields thesoft supervised clusteringproblem
minimize L(U) + ΩSC(U, V) + Ω(V)
subject to V =CZT, Z ∈ {0,1}m×Q, Z1=1. (SSC) Given Q clusters Cq, we let sq = |Cq|be the size of the qth cluster, with s = ZT1 the corresponding vector. We writeΠ =I− m111T the centering matrix. As in [2], the clustering penaltyΩSC(W, V)can be decomposed into three separable terms as follows (see Figure1for an illustration).
• A measure of how large the barycenter of the centers¯c= m1 PQ
q=1sqcqis Ωmean(V) =λmm
2 ||¯c||22 = λm
2 Tr(V(I−Π)VT).
• A measure of the variance between clusters Ωbetween(V) = λb
2 XQ q=1
sq||cq−c¯||22 = λb
2 Tr(VΠVT).
• A measure of the variance within clusters Ωwithin(U, V) = λw
2 XQ q=1
X
i∈Jq
||ui−cq||22 = λw
2 ||U−V||2F.
2
Ωwithin
Ωbetween Ωmean
0
FIGURE1. Decomposed clustering penalty onmitems.
We then simply add a penalty with parameterµon the Frobenius norm ofV,i.e.the norm of the centroids weighted by the number of items in each clustersq
Ω(V) = µ 2
XQ q=1
sqkcqk22 = µ
2 Tr(VTV).
We now detail the losses associated with each dimension of the data cube: tasks, features or sample points. Input samples are given by the matrixX = [x1, . . . ,xn]T ∈Rn×d, labels by(y1, . . . , yn), andx(j) refers to thejthcoordinate ofx.
2.1. Clustering features. Given a regression or classification task, we want to reduce dimensionality by grouping together features which have a similar influence on the output. We present the linear regression case [4], which can be extended to logistic regression and classification. Imposing that all predictor coef- ficients within a cluster with (scalar) centroidcqare identical means the prediction function can be written f :x→PQ
q=1cqP
j∈Cqx(j), which leads to the following loss L(V) = 1
n Xn
i=1
l
yi, XQ q=1
cq X
j∈Cq
x(j)i
= 1 n
Xn i=1
l
yi, Xd j=1
XQ q=1
Zjqcqx(j)i
, in the variableV =PQ
q=1Zjqcq =CZT ∈ Rdof predictor coefficients, quantized overQvalues. In this case, if we relax the hard clustering by imposing a soft clustering penalty, we lose the benefit of dimension- ality reduction.
2.2. Clustering tasks. Given a set ofK supervised tasks like regression or binary classification, multitask learning aims at jointly solving these tasks, hoping that each task can benefit from the information given by other tasks [1,2,3]. For simplicity, we illustrate the case of multi-classification, which can be extended to the general multitask setting. When performing classification with one-versus-all majority vote, we train one binary classifier for each class vs all others. We define the total empirical loss as
L(U) = 1 n
XK k=1
Xn i=1
l(yi,uTkxi).
in the matrix variableU ∈Rd×K of classifier vectors (one per task).
2.3. Clustering sample points. Imagine for instance that we are interested in measuring the effect of two treatments, e.g. a medicine and a placebo, on uniformly distributed patients. Clearly, the regression function varies a lot between the groups of patients that have a different treatment. Now suppose that we do not know to which patients different treatments were given. Since patients are uniformly distributed, the groups cannot be predicted without supplementary information. We will use the effect of treatments to simultaneously
3
retrieve the groups of patients and the associated regression functions. Note that this setting is different from a mixture of experts model in the sense that the latent cluster assignment variable Z can only be estimated onceyis known and cannot be deduced from the input featuresX(cf.figure2).
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
patients different treatments were given. Since patients are uniformly distributed, the groups cannot be predicted without supplementary information. We will use the effect of treatments to simultane- ously retrieve the groups of patients and the associated regression functions. Note that this setting is different from a mixture of experts model in the sense that the latent cluster assignment variableZ can only be estimated onceyis known and cannot be deduced from the input featuresX.
Z
X Y
Z
X Y
Figure 2: Learning multiple diverse predictors (left), mixture of experts model (right).
Given a regression or multi-classification task, we want to findQgroups of sample points to max- imize the within-group prediction performance using a group specific predictor. This amount to producingQdiverse answers per sample point, considering only the best one. We thus learnQpre- dictors, each predictor having low error rate on some cluster of points. For simplicity, we illustrate the case of regression, which can be extended to multi-classification. We minimize the loss incurred by the best linear predictorfq :x!cTq xfor each point,i.e.
L(V) =L(CZT) = 1 n
Xn
i=1
q2{min1,...,Q}l(yi,cTq xi) = 1 n
Xn
i=1
l yi, XQ
q=1
ZiqcTq xi
! ,
in the matrix variableV 2Rd⇥nof predictor vectors (one per sample point), whereC2Rd⇥Qand Z2 {0,1}n⇥Qare the centroid and assignment matrices defined above.
Loss Dim.U, V Predictors Goal
Features n1Pn
i=1l⇣PQ q=1
Pd
j=1Zjqcqxij
⌘ 1⇥d rows ofW regression
Tasks n1PK
k=1
Pn
i=1l(wTk xi) d⇥K columns ofW classification Samples n1Pn
i=1l⇣PQ
q=1ZiqcTq xi
⌘ d⇥n W(i) regression
Table 1: Summary of the presented supervised clustering settings.
3 Algorithms
We now present optimization strategies to solve these supervised clustering problems. We begin by simple greedy procedures, then propose a non-convex projected gradient descent scheme and finally a more refined convex relaxation solved using conditional gradient and approximations to k-means.
3.1 Simple strategies
A straightforward strategy is to first minimize on predictors, as in a classical supervised learning problem, and then cluster predictors together using k-means. The procedure can be repeated in the case of a soft clustering penalty. In the same spirit, when clustering sample points, one can alternate minimization on the predictors of each group and assignment of each point to the group where its loss is smallest. These methods are fast but very dependent on the initialization. Alternating minimization can optionally be used to refine the solution of the more robust algorithms proposed below.
3.2 Projected gradient descent
A natural strategy is to do a projected gradient descent on the non-convex problems (HSC) or (SSC).
The projection of a matrixV is made by finding argmin
Z,C kV CZTk2F = argmin XQ
q=1
X
i2Cq
kvi cqk22,
FIGURE2. Learning multiple diverse predictors (left), mixture of experts model (right).
Given a regression or multi-classification task, we want to findQgroups of sample points to maximize the within-group prediction performance using a group specific predictor. This amount to producingQdiverse answers per sample point, considering only the best one. We thus learnQpredictors, each predictor having low error rate on some cluster of points. For simplicity, we illustrate the case of regression, which can be extended to multi-classification. We minimize the loss incurred by the best linear predictorfq :x→ cTq x for each point,i.e.
L(V) =L(CZT) = 1 n
Xn i=1
q∈{1,...,Q}min l(yi,cTq xi) = 1 n
Xn i=1
l
yi, XQ q=1
ZiqcTq xi
,
in the matrix variable V ∈ Rd×n of predictor vectors (one per sample point), where C ∈ Rd×Q and Z ∈ {0,1}n×Qare the centroid and assignment matrices defined above.
Loss Dim.U, V Predictors Goal
Features n1 Pn
i=1lPQ q=1
Pd
j=1Zjqcqxij
1×d rows ofW regression Tasks n1 PK
k=1
Pn
i=1l(wTk xi) d×K columns ofW classification Samples 1nPn
i=1lPQ
q=1ZiqcTq xi
d×n W(i) regression
TABLE1. Summary of the presented supervised clustering settings.
3. ALGORITHMS
We now present optimization strategies to solve these supervised clustering problems. We begin by simple greedy procedures, then propose a non-convex projected gradient descent scheme and finally a more refined convex relaxation solved using conditional gradient and approximations to k-means.
3.1. Simple strategies. A straightforward strategy is to first minimize on predictors, as in a classical super- vised learning problem, and then cluster predictors together using k-means. The procedure can be repeated in the case of a soft clustering penalty. In the same spirit, when clustering sample points, one can alternate minimization on the predictors of each group and assignment of each point to the group where its loss is smallest. These methods are fast but very dependent on the initialization. Alternating minimization can optionally be used to refine the solution of the more robust algorithms proposed below.
4
3.2. Projected gradient descent. A natural strategy is to do a projected gradient descent on the non-convex problems (HSC) or (SSC). The projection of a matrixV is made by finding
argmin
Z,C kV −CZTk2F = argmin XQ q=1
X
i∈Cq
kvi−cqk22,
where the minimum is taken over centroidsci and partitions (C1, . . . ,CQ). This can be solved with the k- means++ algorithm which performs alternate minimization on the assignments and the centroids. Although it is a non-convex problem, k-means++ gives general approximation bounds on its solution [9].
Writing k-means++(V, Q)the approximate solution of the projection. Writingφthe objective function and using a backtracking line search for the stepsizeαt, the full procedure is summarized in Algorithms1 and2. Details of gradient computations for each setting are given in the appendix.
Algorithm 1Proj. Gradient Descent (SSC) Input: X, y, Q, , λb, λw, λm, µ
InitializeW0 =V0 = 0
while|φ(Wt, Vt)−φ(Wt−1, Vt−1)| ≥do Wt+1=Wt−αt(∇L(Wt) +∇ΩS,C(Vt, Wt)) Vt+1
2 =Vt−αt∇ΩS,C(Vt, Wt) [Zt+1, Ct+1] =k-means++(Vt+1
2, Q) Vt+1=Ct+1Zt+1T
end while
Z∗andC∗ are given through Kmeans++
Output: W∗, V∗, Z∗, C∗
Algorithm 2Proj. Gradient Descent (HSC) Input: X, y, Q, , µ
InitializeV0 = 0
while|φ(Vt)−φ(Vt−1)| ≥do Vt+1
2 =Vt−αt(∇L(V) +∇Ω(V)) [Zt+1, Ct+1] =k-means++(Vt+1
2, Q) Vt+1 =Ct+1Zt+1T
end while
Z∗andC∗are given through Kmeans++
Output: V∗, Z∗, C∗
3.3. Convex relaxation using approximate conditional gradient . Another algorithmic approach to the supervised clustering problem is to minimize with respect to the assignment matrixZusing the Frank-Wolfe algorithm (a.k.a. conditional gradient, [12,13]). Considering the squared lossl(f(x), y) = 12(y−f(x))2, we use the analytic form of the minimization in(W, V)orV to rewrite the problem. The clustering is then captured in terms of the equivalence matrixM = Z(ZTZ)−1ZT, which satisfies Mij = 1/|Cq|if itemi andjare in the same clusterqandMij = 0otherwise.
We describe here the simple case of supervised clustering of features in a regression task, introduced in§2.1. Detailed computations and explicit procedures for all settings are given in the appendix. When clustering features, we optimize over the coefficients associated with each feature and the k-means step can be performed exactly using dynamic programming [10,11].
The lossLcan be written here
L(V) =L(CZT) = 1 2n
Xn i=1
yi− Xd j=1
XQ q=1
Zjqcqx(j)i
2
,
whereC= [c1, . . . , cQ]∈R1×Q, hence the regularized loss becomes φ(C, Z) = 1
2n Xn i=1
yi−CZTxi2
+µ
2kCZTk22
= 1
2nTr(yT y) + 1
2nTr(CZTXTXZCT)− 1
nTr(CZTXT y) +µ
2 Tr(CZTZCT).
5
Minimizing inCand using the Sherman-Woodbury-Morrison formula we get G(M) := min
C L(CZT) + µ
2kCZTk22
= 1
2n y yT I−XZ(ZTXTXZ+µnZTZ)−1ZTXT y
= 1
2nTr y yT
I+ 1
nµXM XT −1!
.
DefiningMas the set of equivalence matrices of the formM =Z(ZTZ)−1ZT, forZan assignment matrix, each iteration of the conditional gradient method requires solving an affine minimization subproblem over hull(M). The setMbeing discrete, we haveargminM∈hull(M)Tr(M∇G(M)) = argminM∈MTr(M∇G(M)).
WritingP =−∇G(M), we get P = 1
2n2µXT
I+ 1
nµXM XT −1
y yT
I+ 1
nµXM XT −1
X,
which is always semidefinite positive. WritingP12 the matrix square root ofP we have argmin
M∈M
Tr(M∇G(M)) = argmin
M∈M −Tr(M P12P12T)
= argmin
M∈M
Tr((I−M)P12P12T))
= argmin
M∈M kP12 −M P12k2F
= argmin
Z
minC kP12 −ZCTk2F,
where we recognize the k-means problem defined above. In fact, in this particular case, the k-means sub- problem is one-dimensional and can be solved exactly using dynamic programming [10,11]. We use the classical stepsize for conditional gradientαk= k+22 and Frank-Wolfe rounding. The procedure is summa- rized in Algorithm3.
Algorithm 3Conditional gradient on the equivalence matrix Input: X, y, Q,
InitializeM0∈ M
while|G(Mk)−G(Mk−1)| ≥do
Compute the matrix square rootP12 of−∇G(Mk) Get oracle∆k=k-means(P12, Q)
Mk+1=Mk+αk(∆k−Mk) end while
Use Frank Wolfe rounding to get a solutionM∗∈ M Z∗is given by k-means
C∗is given by the analytic solution of the minimization forZ∗fixed Output: C∗, Z∗, M∗
3.4. Complexity. The core complexity of Algorithms1and2is concentrated in the inner k-means subprob- lem, which standard alternating minimization approximates inO(tQS), wheretis the number of alternating steps,Qis the number of clusters, andSis the product of the dimensions ofV (see Table1). Using a proper conditioning of the gradient, the number of iterations before convergence is typically below 100, which make Algorithms1and2both fast and scalable. For Algorithm3, we also need to compute a matrix square
6
root of the gradient at each iteration, which can slow down computations for large datasets. The choice of the number of clusters can be done given an a priori on the problem (e.g.if we know the hierarchical structure of classes in a classification problem), or cross-validation, idem for the other regularization parameters.
4. NUMERICALEXPERIMENTS
4.1. Synthetic dataset.
Supervised clustering of sample points. We generatendata points(xi, yi)fori = 1, . . . , nwithxi ∈ Rd andyi ∈R, divided in two clusters corresponding to regression tasks with weightsw1andw2. Regression labels for pointsxiin clusterqare given byyi =wTq xi+η, whereη ∼ N(0, σ2). We test the robustness of the algorithms to noise dimensions,i.e.we completexi withdndimensions of noiseηd ∼ N(0, σd). The results are reported in Table2.
Here, the intrinsic dimension is 10. “Oracle” refers to the least square fit given the true assignments. It can be seen as the best error rate that can be achieved. PGK refers to projected gradient, OM refers to conditional gradient onM, AM refers to alternate minimization. PGK and OM were followed by AM refinement. 200 points were used for training, 200 for testing. The regularization parameter was set to λ = 10−2 for all experiments. Noise on labels and added dimensionsσy =σd= 2×10−1. Results were averaged over 100 experiments, figures after the sign±correspond to one standard deviation.
dim-noise=0 dim-noise=100 dim-noise=300 dim-noise=900 Oracle 4.2±0.5 29.5±7.0 17.3±6.2 81.6±30.0
PGK 5.8±15.1 251.0±154.4 187.5±98.7 275.2±167.2 OM 4.3±0.5 254.4±144.6 209.2±93.5 220.9±93.0 AM 50.4±121.1 897.0±456.1 445.3±207.1 455.4±192.3
TABLE2. Supervised clustering of sample points.
It appears that that PGK and OM give very similar results, significantly improving on the naive alternating minimization scheme. In view of standard deviations, OM seems more robust.
Supervised clustering of features. We generatendata points (xi, yi) for i = 1, . . . , n with xi ∈ Rdand yi ∈ R. Regression weights have only 10 different valueswq for q = 1, . . . ,10, uniformly distributed around 0. Regression labels are given byyi = wT xi+η, whereη ∼ N(0, σ2). We test the robustness of the algorithms to the number of learning examples, i.e.the size of the training set n. We measure thel2 norm of the difference between the true vector of weights and the estimated ones.
We compare in Table 3 the proposed algorithms to OSCAR [4], Ridge, Lasso and Ridge followed by k-means on the weights (using associated centroids as predictors). “Oracle” refers to the mean square fit given the true assignments of features. It can be seen as the best error rate that can be achieved. PGK refers to projected gradient and was intitialized with the solution of Ridge followed by k-means, OM refers to conditional gradient onM and was followed by PGK. Noise on labels is set toσ = 10−1. Algorithm parameters were all cross-validated using a logarithmic grid. Results were averaged over 30 experiments and figures after the sign±correspond to one standard deviations. Results were multiplied by 100 to shorten the table.
PGK and OM appear to give significantly better results than other methods and even reach the perfor- mance of the Oracle forn > d, while forn≤dresults are in the same range. Note that contrary to OSCAR, Lasso and Ridge, reduction of dimensionality is guaranteed (the number of clusters is fixed).
4.2. Real data.
7
n=50 n=75 n=100 n=150 n=200 Oracle 3.3±1.2 2.6±0.9 2.2±0.5 1.8±0.4 1.5±0.4
PGK 72.9±2.7 58.0±2.7 41.6±12.1 1.9±1.0 1.5±0.4
OM+PGK 73.1±2.5 55.9±3.5 43.5±7.9 4.6±7.4 1.5±0.4 Ridge+Kmeans 72.1±1.6 58.6±2.7 46.9±20.6 7.7±4.1 2.4±1.0 OSCAR 80.3±7.5 65.9±1.4 32.8±4.2 14.0±1.9 9.9±0.9 Lasso 91.2±0.6 81.9±4.2 56.1±30.5 13.9±1.7 10.0±1.0 Ridge 69.9±0.5 58.0±0.8 33.0±5.8 14.0±1.6 10.1±0.9
TABLE3. Supervised clustering of features.
4.2.1. 20NewsGroup classification. We perform classification on 2800 documents extracted from the pub- licly available 20NewsGroup dataset, which contains 20 different newsgroups, each one corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g.comp.sys.ibm.pc.hardware and comp.sys.mac.hardware), while others are highly unrelated (e.g.misc.forsale and soc.religion.christian).
Our goal is to retrieve clusters of related newsgroups, while performing competitive classification. We used a dictionary of 5000 words chosen by tf-idf and took the empirical distribution over words as covariates for each document.
In Table4, we compare our approach to other classical regularizations such as the Frobenius norm and the trace norm, as implemented in [3], using either a ridge or a logistic loss. As the projected gradient descent scheme is not convex we initialize it by the solution given by the logistic loss non regularized. All algorithms were 5-fold cross-validated on 80% of the data then tested on the remaining 20%. The number of clusters was set to 5, as suggested by the names of newsgroups. Figures after the sign±correspond to one standard deviation when varying the training and test sets. As reported in Table4, it appears that the clustering regularization helps prediction on topics.
Frobenius Logistic Trace Ridge OM Ridge PGK Logistic
5.8±2.0 4.3±0.8 4.9±0.8 2.6±1.5
TABLE4. 100×mean absolute errors for predicting topics on 20NewsGroup dataset.
4.2.2. Predicting ratings associated with reviews using groups of words. We perform “sentiment” analysis of newspaper movie reviews. We use the publicly available dataset introduced in [14] which contains movie reviews paired with star ratings. We treat it as a regression problem, taking log-responses fory in (0,1) and the empirical distribution over words as covariates. With a 5000 term vocabulary chosen by tf-idf, the corpus contains 5006 documents and comprises 1.6M words. We evaluate our algorithms for regression with clustered features against standard regression approaches: Lasso, Ridge, and Ridge followed by k- means on predictors. All algorithms were 5-fold cross-validated on 80% of the dataset and then tested on the remaining 20%. The number of clusters was arbitrarily set to 10, though we did not notice significant changes when varying it. Results are reported in Table5, figures after the sign±correspond to one standard deviation when varying the training and test sets. While all methods including ours give 10% mean absolute errors on the test set (up to 0.5% accuracy), our approach has the additional benefit of providing clusters of words which have a similar influence. Moreover we noticed that the clusters with highest absolute weights are also the ones with smallest number of words, which confirms the intuition that only a few words are very discriminative. We illustrate this in Table6, listing words of the first and last clusters.
8
PGK OM+PGK Ridge+Kmeans Lasso Ridge 9.7±0.1 9.8±0.1 10.0±0.1 9.8±0.1 9.9±0.1
TABLE5. 100×mean absolute errors for predicting movie ratings associated with reviews.
Cluster 1 stupidity,shred,inept,sophomoric,sit,drivel,uninteresting,clue,ludicrous,tedious,single, (negative) remotely,crass,tolerable,bland,predictable,devoid,waste,fail,embarrassing,mess,idea,inane, size 47 lifeless,abysmal,badly,disgusting,bearable,poorly,care,leaden,annoying,turkey,attempt,
interesting,problem,lame,lack,worse, unfunny,watchable,worst,ridiculous,flat,dull,boring, awful,suppose,bad
Cluster 10 perfect,fascinating,great,powerful,hilarious,fine,brilliant,delightful,strongly,easy,rare, (positive) wonderfully, recommendation,mature,nomination,simple,perfectly,goodspirited, size 87 masterpiece,strong,potent,marvelous,excellent,fortunately,world,intelligent,memorable,
unique,send,small,funniest,unexpected,carefully,academy,superbly,important,life,poignancy, speak,wonderful,outstanding,mesmerizing,man,imaginative,simplicity,confidence,delightfully,
delight,touching, nuanced,study,melville,brilliance,work,picture,honest,oscar, classification,masterfully,favorite,effective,masterful,intense,bind,unrated,surprisingly,superb,
triumph,flawless,move,lyrical,gem,devastate,intricate,highly,subtle,power,accomplishment, keen,uplifting,empathize,element,quibble,minor,love,finest,crisp
TABLE6. Supervised clustering of words on movie reviews.
5. DISCUSSION
We have developed a unified framework for supervised clustering over tasks, features or samples and provided two robust algorithms for solving the associated optimization problems. Results on synthetic and realistic text datasets suggest that our method is competitive against standard regression and classification methods, while having the additional benefit of providing clusters of tasks, features or samples. Similarly as in compressed sensing, optimization is made on a union of subspaces, hence in analogy with RIP conditions for iterative hard thresholding [15], it would be interesting to see under which assumptions our algorithms can recover the optimal solution to the supervised clustering problem.
6. APPENDIX
We give here detailed computations of the function Gcorresponding to the convex relaxation defined in3.3in all settings.
We always supposeZTZ inversiblei.e.∀q, sq 6= 0 (there is no empty cluster). For any integerp, we let1p ∈ Rp be the vector whose coordinates are all ones, Ip the identity matrix inRp×p, Γp = 1pp1Tp and Πp =Ip−Γpthe corresponding centering matrix.
For all settings input samples are represented by the matrixX= (x1, ...,xn)T ∈Rn×dand their respec- tive labels byy = (y1, ....yn) ∈ Y. For regression problemsY = R. Classification problems are casted into the multitask setting, each class corresponding to a task. We denote byyk = (y1k, ..., ykn)the vector of binary labels corresponding to the classk∈[[1, K]]andY = (y1, ...,yK).
6.1. Minimization in C for Soft Clustering Problems . For soft supervised clustering problems, the min- imization in the variableCforW, Z fixed can be made without assumptions on the specific lossL. We let (δ, m)be the dimensions of the predictors of interest, such thatW ∈Rδ×m,C ∈Rδ×QandZ ∈ {0,1}m×Q. We denote byλW, λB, λM the weights associated respectively to the within, between and mean penalties.
Minimization inCis then given by
9
H(W, Z) := min
C∈Rδ×Q
λW
2 kW −CZTk2F +λB
2 Tr(CZTΠmZCT) +λM
2 Tr(CZTΓmZCT)
= min
C∈Rδ×Q
λW
2 kWk2F +1
2Tr C (λW +λB)ZTZ+ (λM −λB)ZTΓmZ CT
−λWTr(CTW Z)
= λW
2 Tr(W WT)− λ2W 2 Tr
W Z (λW +λB)ZZT + (λM −λB)ZTΓmZ−1
ZTW . Lets = ZT1m ∈ RQ be the vector whose coordinatessq are the sizes of the clusters, denoting bys12 the vector whose coordinates are√
sq, and bys−12 the vector whose coordinates are √1s
q, we can derive the inversion in the precedent formula,
J−1= (λW +λB)ZZT + (λM −λB)ZTΓmZ−1
=
(λW +λB)diag(s) + (λM −λB)ssT m
−1
= 1
λW +λB
diag(s−12)
IQ+ λM −λB
λW +λB
s12s12T m
−1
diag(s−12)
= 1
λW +λBdiag(s−12)
IQ− λM −λB
λW +λM s12s12T
m
diag(s−12)
= 1
λW +λB
(ZTZ)−1− λM −λB (λW +λM)(λW +λB)
1Q1TQ m
= 1
λW +λB
((ZTZ)−1−1Q1TQ
m ) + 1 λW +λM
1Q1TQ m , where we used thatp= s
1 2s12
T
m = s
1 2s12
T
ks12k22 is a projector and therefore(I+αp)−1 =I−α+1α p, added to the fact thatdiag(s) =ZTZ.
Now introducing the equivalence matrix M = Z(ZTZ)−1ZT, and using thatZ1Q1
T Q
m ZT = Γm, we finally obtain
H(W, Z) = λW
2 Tr(W WT)−λ2W 2 Tr
W
1
λW +λB(M −Γm) + 1
λW +λMΓm
WT
= λW
2 Tr W (Im−α(M−Γm)−βΓm)WT , whereα= λ λW
W+λB andβ= λ λW
W+λM.
DenotingP =Im−α(M−Γm)−βΓm, one can also use Kronecker’s formula : H(W, Z) = λW
2 Vec(W)T(P⊗Iδ)Vec(W).
6.2. Clustered Multitask Learning . We derive here the computation of G when clustering tasks. We restrict ourselves to the case of multiclass setting cast as a multitask problem such that each task shares the same input data. GivenKclasses, we denote byW = (w1, ...,wK)∈Rd×Kthe matrix of linear predictors.
Using a squared lossl(f(x), y) = 12(y−f(x))2the empirical loss is given by :
10
L(W) = 1 2n
Xn i=1
XK k=1
(yik−wTk xi)2. Using Kronecker’s product formula, we get
L(W) = 1 2n
XK k=1
wTkXTXwk−1 n
XK k=1
wTk XTyk+ 1 2nkyk22
= 1 2n
XK k=1
eTkWTXTXWek−1
nTr(W XTY) + 1 2nkyk22
= 1
2nVec(W)T(IK⊗XTX)Vec(W)− 1
nVec(W)T Vec(XTY) + 1
2nkyk22.
Using the expression found by minimizing inCthe regularization penalty, we get an expression forG:
G(M) = min
W L(W) +H(W, Z)
= min
W
1
2nVec(W)T(IK⊗XTX+P ⊗Id)Vec(W)− 1
nVec(W)TVec(XTY) + 1 2nkyk22
=− 1
2nVec(XTY)T IK⊗XTX+λWnP ⊗Id−1
Vec(XTY) + 1 2nkyk22
Denote(v1, ...,vd) ∈ Rd×d,(λ1, ..., λd) ∈ Rdand(u1, ...,uS) ∈ RS×S,(µ1, ..., µS) ∈ RS the eigen- vectors and corresponding eigenvalues respectively of matricesXTXandP = (IK−α(M−ΓK) +βΓK).
The eigenvectors and corresponding eigenvalues ofIK ⊗XTX+λWnP ⊗Id are (ui⊗vj)i∈[[1,n]]
j∈[[1,d]]
and (λWnµi+λj)i∈[[1,n]]
j∈[[1,d]]
. The inversion in the expression ofGis then given by
J−1= IK⊗XTX+λWnP ⊗Id−1
= Xn i=1
Xd j=1
1 λWnµi+λj
uiuTi ⊗vjvTj.
Then we note that the set of eigenvectors ofP can be decomposed into three sets. Indeed the matrices IK −M,M −ΓK andΓK are orthogonal projectors on orthogonal subspaces spanning the entire space.
Denote byIW,IB,IM the sets of eigenvectors corresponding respectively toIK−M,M −ΓKandΓK, their corresponding eigenvalues inP can easily be computed and we obtain
P =IK−M + (1−α)(M−ΓK) + (1−β)ΓK
= X
i∈IW
uiuTi + (1−α) X
i∈IW
uiuTi + (1−β) X
i∈IM
uiuTi .
This decomposition can be used for the inversion :
11
J−1 = X
i∈IW
Xd j=1
1 λWn+λj
uiuTi ⊗vjvjT
+ X
i∈IB
Xd j=1
1
λWn(1−α) +λj
uiuTi ⊗vjvTj
+ X
i∈IW
Xd j=1
1
λWn(1−β) +λjuiuTi ⊗vjvjT
= (IK−M)⊗(XTX+nλWId)−1+ (M−ΓK)⊗(XTX+nλW(1−α)Id)−1 + ΓK⊗(XTX+nλW(1−β)Id)−1.
FinallyGcan be simplified using properties of the Kronecker product
G(M) =− 1
2nVec(XTY)T (IK−M)⊗(XTX+nλWId)−1
Vec(XTY)
− 1
2nVec(XTY)T (M−ΓK)⊗(XTX+nλW(1−α)Id)−1
Vec(XTY)
− 1
2nVec(XTY)T ΓK⊗(XTX+nλW(1−β)Id)−1
Vec(XTY) + 1 2nkYk22
=− 1
2nTr(YTX(XTX+nλWId)−1XTY(IK−M))
− 1
2nTr(YTX(XTX+nλW(1−α)Id)−1XTY(M −ΓK))
− 1
2nTr(YTX(XTX+nλW(1−β)Id)−1XTYΓK) + 1
2nkYk2F. TherforeGis a linear function ofM whose gradient is given by
∇G(M) = 1
2nYTX (XTX+nλWId)−1−(XTX+nλW(1−α)Id)−1 XTY.
As1≥α≥0, we get that−∇G(M)0.
For a fixedM,W is given using precedent computations and Kronecker’s formula by
WM = (XTX+nλWId)−1XTY(IK−M)
+ (XTX+nλW(1−α)Id)−1XTY(M−ΓK) + (XTX+nλW(1−β)Id)−1XTYΓK.
6.3. Learning with clustered features. For more generality we cast the problem of learning clustered features in the multiclass learning setting. We restrict here to the soft supervised clustering problem as the hard one has already be done. We keep notations introduced in section 6.2 though we now have W =
12
(w1, ...,wK)T ∈RK×d. The empirical loss is given by L(W) = 1
2n Xn i=1
XK k=1
(yki −wTk xi)2.
= 1 2n
XK k=1
wTkXTXwk−1 n
XK k=1
wTk XT yk+ 1 2nkYk2F
= 1 2n
XK k=1
Tr(WTW XTX)− 1
nTr(W XTY) + 1 2nkYk2F
Adding the regularization penalty and minimizing inCwe obtain
G(M) = min
W
1 2n
XK k=1
Tr WTW(XTX+λWnP)
− 1
nTr(W XTY) + 1 2nkYk22
= 1
2nkYk22− 1
2n YTX(XTX+λWnP)−1XTY
= 1
2nTr Y YT(In−X(XTX+λWnP)−1XT
= 1 2nTr
Y YT(In+ 1
λWnXP−1XT)−1
As detailed before, the inverse ofP can be found analytically by observing that it is composed of orthogonal projectors on orthogonal subspaces spanning the entire space. Hence we haveP−1 =Id−M+1−α1 (M− Γd) +1−β1 Γd. We now get
G(M) = 1 2nTr
Y YT(Id+ 1 nλB
X(M −Γd)XT + 1 nλM
XΓdXT)
Note that we still have−∇G(M)0. For a fixedM,W is given analitically by WM =YTX XTX+λWn(Id−α(M−Γd)−βΓd)−1
.
6.4. Learning multiple predictors. In the setting of learning multiple predictors, we denote by Cq the set of points whose best linear predictor iswq, having thereforeC1, ...,CQ a partition of[[1, n]]. We let as abovesq be the cardinal of the setCq,Xq ∈Rsq×dis the matrix whose columns are the points belonging to the cluster q, and yq is the column vector of labels corresponding to clusterq. We use a squared loss l(f(x), y) = 12(y−f(x))2.
ForC1, ...,CQfixed (or equivalentlyZfixed), we can computeGusing the Sherman-Woodbury-Morrison formula as
G(M) = min
w1,...,wQ
1 2n
XQ q=1
X
i∈Cq
(yi−wTq xi)2+µ 2
XQ q=1
sqkwqk22
= XQ q=1
1 2nyTq
1
sqµnXqXqT +In −1
yq.
13
We defineE∈Rn×nthe permutation matrix permuting order of points such that
Ey=
yC1
...
yCQ
EX =
X1
...
XQ
. We denote for q ∈ [[1, Q]], Rq = P
Pq−1
p=1sp≤i≤Pq
p=1speieTi ortohgonal projectors on the ordered sets of points belonging to clusterqsuch that
X1X1T 0 0
0 ... 0
0 0 XQXQT
=diag(EXXTET) = XQ q=1
RqEXXTETRq.
Thus we get
G(M) = 1
2nyTET
XQ q=1
1
sqµnRqEXXTETRq+In
−1
Ey
= 1 2nyT
XQ q=1
1
sqµnETRqEXXTETRqE+In
−1
y.
Then we notice thatETRqE = diag(Zq)i.e.the projector on the set of points belonging to clusterq, and thatPQ
q=1 1
sqdiag(ZQ)XXT diag(Zq) =M◦XXT, where◦denotes the Hadamard product. Hence we finally get
G(M) = 1 2nyT
1
µnXXT ◦M+In −1
y. Its gradient is given by
2n∇G(M) =− 1
µnXXT ◦
(In+ 1
µnXXT ◦M)−1y yT(In+ 1
µnXXT ◦M)−1
, for which we have−∇G(M)0.
For a fixedZ, denoting byXq =ZqTXthe set of points belonging to clusterq, the linear predictorswq for each cluster of points are given by
wq= (nµsqId+XqTXq)−1XqT yq.
Acknowledgements. AA is at CNRS, at the D´epartement d’Informatique at ´Ecole Normale Sup´erieure, 23 avenue d’Italie, 75013 Paris, France. INRIA, Sierra project-team, PSL Research University. The authors would like to acknowledge support from a starting grant from the European Research Council (ERC project SIPA), an AMX fellowship, the MSR-Inria Joint Centre, as well as support from the chaire ´Economie des nouvelles donn´ees, thedata sciencejoint research initiative with thefonds AXA pour la rechercheand a gift from Soci´et´e G´en´erale Cross Asset Quantitative Research.
14