Adaptive Clustering through Semidefinite Programming

(1)

HAL Id: hal-01524677

https://hal.archives-ouvertes.fr/hal-01524677

Preprint submitted on 18 May 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution - ShareAlike| 4.0 International

Adaptive Clustering through Semidefinite Programming

Martin Royer

To cite this version:

Martin Royer. Adaptive Clustering through Semidefinite Programming. 2017. �hal-01524677�

(2)

Adaptive Clustering through Semidefinite Programming

Martin Royer

Laboratoire de Math´ ematiques d’Orsay, Univ. Paris-Sud, CNRS, Universit´ e Paris-Saclay, 91405 Orsay, France.

Abstract

We analyze the clustering problem through a flexible probabilistic model that aims to identify an optimal partition on the sample X ₁ , ..., X _n . We perform exact clustering with high probability using a convex semidefinite estimator that interprets as a corrected, relaxed version of K-means.

The estimator is analyzed through a non-asymptotic framework and showed to be optimal or near-optimal in recovering the partition. Furthermore, its performances are shown to be adaptive to the problem’s effective dimension, as well as to K the unknown number of groups in this partition. We illustrate the method’s performances in comparison to other classical clustering algorithms with numerical experiments on simulated data.

1 Introduction

Clustering, a form of unsupervised learning, is the classical problem of assembling n observations X ₁ , ..., X _n from a p-dimensional space into K groups. Applied fields are craving for robust clus- tering techniques, such as computational biology with genome classification, data mining or image segmentation from computer vision. But the clustering problem has proven notoriously hard when the embedding dimension is large compared to the number of observations (see for instance the recent discussions from [2, 21]).

A famous early approach to clustering is to solve for the geometrical estimator K-means [12, 13, 19].

The intuition behind its objective is that groups are to be determined in a way to minimize the total intra-group variance. It can be interpreted as an attempt to ”best” represent the observations by K points, a form of vector quantization. Although the method shows great performances when observations are homoscedastic, K-means is a NP-hard, ad-hoc method. Clustering with probabilistic frameworks are usually based on maximum likelihood approaches paired with a variant of the EM algorithm for model estimation, see for instance the works of Fraley & Raftery [10] and Dasgupta & Schulman [9]. These methods are widespread and popular, but they tend to be very sensitive to initialization and model misspecifications.

Several recent developments establish a link between clustering and semidefinite programming.

Peng & Wei [16] show that the K-means objective can be relaxed into a convex, semidefinite

program, leading Mixon et al. [15] to use this relaxation under a subgaussian mixture model to

estimate the cluster centers. Chr´ etien et al. [8] use a slightly different form of a semidefinite

program, inspired by work on community detection by Gu´ edon & Vershynin [11], to recover the

adjacency matrix of the cluster graph with high probability. Lastly in the different context of

variable clustering, Bunea et al. [5] present a semidefinite program with a correction step to

produce non-asymptotic exact recovery results.

(3)

In this work, we introduce a semidefinite, penalized estimator for point clustering inspired by [16]

and adapted from the work and context of [5]. We analyze it through a flexible probabilistic model inducing an optimal partition that we aim to recover. We investigate the optimal conditions of exact clustering recovery with high probability and show optimal performances – including in high dimensions, improving on [15], as well as adaptability to the effective dimension of the problem.

We also show that our results continue to hold without knowledge of the number of groups K.

Lastly we provide evidence of our method’s efficiency from simulated data and suggest a coherent alternative in case our estimator is too costly to compute.

Notation. Throughout this work we use the convention 0/0 := 0 and [n] = {1, ..., n}. We take a _n . b _n to mean that a _n is smaller than b _n up to an absolute constant factor. Let S _d−1 denote the unit sphere in R ^d . For q ∈ N ^∗ ∪ {+∞}, ν ∈ R ^d , |ν| _q is the l _q -norm and for M ∈ R ^d×d

0

, |M | _q ,

|M| _F , |M | ∗ and |M| _op are respectively the entry-wise l q -norm, the Frobenius norm associated with scalar product h., .i, the nuclear norm and the operator norm. |D| _V is the variation semi-norm for a diagonal matrix D, the difference between its maximum and minimum element. Let A < B mean that A − B is symmetric, positive semidefinite.

2 Probabilistic modeling of point clustering

Consider X 1 , ..., X n and let ν a = E [X a ]. The variable X a can be decomposed into

X a = ν a + E a , a = 1, ..., n, (2.1)

with E a stochastic centered variables in R ^p .

Definition 1. For K > 1, µ = (µ ₁ , ..., µ _K ) ∈ ( R ^p ) ^K , δ > 0 and G = {G ₁ , ..., G _K } a partition of [n], we say X ₁ , ..., X _n are (G, µ, δ)-clustered if ∀k ∈ [K], ∀a ∈ G _k , |ν _a − µ _k | ₂ 6 δ. We then call

∆(µ) := min

k<l |µ _k − µ l | ₂ (2.2)

the separation between the cluster means, and

ρ(G, µ, δ) := ∆(µ)/δ (2.3) the discriminating capacity of (G, µ, δ).

In this work we assume that X ₁ , ..., X _n are (G, µ, δ)-clustered. Notice that this definition does not impose any constraint on the data: for any given G, there exists a choice of µ, means and radius δ important enough so that X 1 , ..., X n are (G, µ, δ)-clustered. But we are interested in partitions with greater discriminating capacity, i.e. that make more sense in terms of group separation.

Indeed remark that if ρ(G, µ, δ) < 2, the population clusters {ν _a } _a∈G

₁

, ..., {ν _a } _a∈G

_K

are not linearly separable, but a high ρ(G, µ, δ) implies that they are well-separated from each other. Furthermore, we have the following result.

Proposition 1. Let (G _K ^∗ , µ ^∗ , δ ^∗ ) ∈ arg max ρ(G, µ, δ) for (G, µ, δ) such that X ₁ , ..., X _n are (G, µ, δ)- clustered, and |G| = K. If ρ(G _K ^∗ , µ ^∗ , δ ^∗ ) > 4 then G _K ^∗ is the unique maximizer of ρ(G, µ, δ).

So G _K ^∗ is the partition maximizing the discriminating capacity over partitions of size K . Therefore in this work, we will assume that there is a K > 1 such that X ₁ , ..., X _n is (G, µ, δ)-clustered with

|G| = K and ρ(G, µ, δ) > 4. By Proposition 1, G is then identifiable. It is the partition we aim to recover.

We also assume that X ₁ , ..., X _n are independent observations with subgaussian behavior. Instead

of the classical isotropic definition of a subgaussian random vector (see for example [20]), we use a

more flexible definition that can account for anisotropy.

(4)

Definition 2. Let Y be a random vector in R ^d , Y has a subgaussian distribution if there exist Σ ∈ R ^d×d such that ∀x ∈ R ^d ,

E h

e ^x

^T

^(Y ⁻ ^E ^Y ⁾ i

6 e ^x

^T

^Σx/2 . (2.4)

We then call Σ a variance-bounding matrix of random vector Y , and write shorthand Y ∼ subg(Σ).

Note that Y ∼ subg(Σ) implies Cov(Y ) 4 Σ in the semidefinite sense of the inequality. To sum-up our modeling assumptions in this work:

Hypothesis 1. Let X 1 , ..., X n be independent, subgaussian, (G, µ, δ)-clustered with ρ(G, µ, δ) > 4.

Remark that the modelization of Hypothesis 1 can be connected to another popular probabilistic model: if we further ask that X 1 , ..., X n are identically-distributed within a group (and hence δ = 0), the model becomes a realization of a mixture model.

3 Exact partition recovery with high probability

Let G = {G ₁ , ..., G _K } and m := min _k∈[K] |G _k | denote the minimum cluster size. G can be repre- sented by its caracteristic matrix B ^∗ ∈ R ^n×n defined as ∀k, l ∈ [K] ² , ∀(a, b) ∈ G _k × G _l ,

B _ab ^∗ :=

1/|G _k | if k = l

0 otherwise.

In what follows, we will demonstrate the recovery of G through recovering its caracteristic matrix B ^∗ . We introduce the sets of square matrices

C _K ^{0,1} := {B ∈ R ^n×n + : B ^T = B, tr(B) = K, B1 n = 1 n , B ² = B} (3.1) C _K := {B ∈ R ^n×n + : B ^T = B, tr(B) = K, B1 n = 1 n , B < 0} (3.2) C := [

K∈ N

C _K . (3.3)

We have: C _K ^{0,1} ⊂ C _K ⊂ C and C _K is convex. Notice that B ^∗ ∈ C _K ^{0,1} . A result by Peng, Wei (2007) [16] shows that the K-means estimator ¯ B can be expressed as

B ¯ = arg max

B∈C

_K^{0,1}

h Λ, Bi b (3.4)

for Λ := (hX b _a , X _b i) _(a,b)∈[n]

2

∈ R ^n×n , the observed Gram matrix. Therefore a natural relaxation is to consider the following estimator:

B b := arg max

B∈C

_K

h Λ, Bi. b (3.5)

Notice that E Λ = Λ + Γ for Λ := (hν b _a , ν b i) _(a,b)∈[n]

2

∈ R ^n×n , and Γ := E [hE _a , E b i] _(a,b)∈[n]

2

= diag (| Var(E _a )| _∗ ) _16a6n ∈ R ^n×n . The following two results demonstrate that Λ is the signal structure that lead the optimizations of (3.4) and (3.5) to recover B ^∗ , whereas Γ is a bias term that can hurt the process of recovery.

Proposition 2. There exist c 0 > 1 absolute constant such that if ρ ² (G, µ, δ) > c 0 (6 + √

n/m) and m∆ ² (µ) > 8|Γ| _V , then we have

arg max

B∈C

_K^{0,1}

hΛ + Γ, Bi = B ^∗ = arg max

B∈C

K

hΛ + Γ, Bi. (3.6)

(5)

This proposition shows that the B b estimator, as well as the K-means estimator, would recover partition G on the population Gram matrix if the variation semi-norm of Γ were sufficiently small compared to the cluster separation. Notice that to recover the partition on the population version, we require the discriminating capacity to grow as fast as 1 + ( √

n/m) ^1/2 instead of simply 1 from Hypothesis 1. The following proposition demonstrates that if the condition on the variation semi- norm of Γ is not met, G may not even be recovered on the population version.

Proposition 3. There exist G, µ, δ and Γ such that ρ ² (G, µ, δ) = +∞ but we have m∆ ² (µ) < 2|Γ| _V and

B ^∗ ∈ / arg max

B∈C

_K^{0,1}

hΛ + Γ, Bi and B ^∗ ∈ / arg max

B∈C

K

hΛ + Γ, Bi. (3.7) So Proposition 3 shows that even if the population clusters are perfectly discriminated, there is a configuration for the variances of the noise that makes it impossible to recover the right cluster- ing by K-means. This shows that K-means may fail when the random variable homoscedasticity assumption is violated, and that it is important to correct for Γ.

The estimator from [5] can be adapted to our context. We introduce the following estimator, for (a, b) ∈ [n] ² let V (a, b) := max (c,d)∈([n]\{a,b})

²

hX _a − X _b , _|X ^X

^c

^−X

^d

c

−X

_d

|

₂

i

, b 1 := arg min _b∈[n]\{a} V (a, b) and b 2 := arg min _b∈[p]\{a,b

₁

_} V (a, b). Then for a ∈ [n], let

Γ b ^corr := diag hX _a − X b

1

, X a − X b

2

i _a∈[n]

. (3.8)

Computing Γ b ^corr can be interpreted as a correcting term to de-bias Λ as an estimator of Λ. The b result from Proposition 2 demonstrates the interest of studying the following semi-definite estimator of the projection matrix B ^∗ , let

B b ^corr := arg max

B∈C

_K

h Λ b − Γ b ^corr , Bi. (3.9) In order to demonstrate the recovery of B ^∗ by this estimator, we introduce different quantitative measures of the ”spread” of our stochastic variables, that affect the quality of the recovery. By Hypothesis 1 there exist Σ 1 , ..., Σ n such that ∀a ∈ [n], X a ∼ subg(Σ a ). Let

σ ² := max

a∈[n] |Σ _a | _op , V ² := max

a∈[n] |Σ _a | _F , γ ² := max

a∈[n] |Σ _a | ∗ . (3.10) We are now ready to introduce this paper’s main result: a condition on the separation between the cluster means sufficient for ensuring recovery of B ^∗ with high probability.

Theorem 1. Assume that m > 2. For c 1 , c 2 > 0 absolute constants, if m∆ ² (µ) > c 2 σ ² (n + m log n) + V ² ( p

n + m log n) + γ(σ p

log n + δ) + δ ² ( √

n + m)

, (3.11) then with probability larger than 1 − c 1 /n we have B b ^corr = B ^∗ , and therefore G b ^corr = G.

We call the right hand side of (3.11) the separating rate. Notice that we can read two kinds of requirements coming from the separating rate: requirements on the radius δ, and requirements on σ ² , V ² , γ dependent on the distributions of observations. It appears as if δ + σ √

log n can be interpreted as a geometrical width of our problem. If we ask that δ is of the same order as σ √

log n, a maximum gaussian deviation for n variables, then all conditions on δ from (3.11) can be removed.

Thus for convenience of the following discussion we will now assume δ . σ √

log n.

(6)

How optimal is the result from Theorem 1? Notice that our result is adapted to anisotropy in the noise, but to discuss optimality it is easier to look at the isotropic scenario: V ² = √

pσ ² and γ ² = pσ ² . Therefore ∆ ² (µ)/σ ² represents a signal-to-noise ratio. For simplicity let us also assume that all groups have equal size, that is |G ₁ | = ... = |G _K | = m so that n = mK and the sufficient condition (3.11) becomes

∆ ² (µ)

σ ² & K + log n +

r

(K + log n) pK

n . (3.12)

Optimality. To discuss optimality, we distinguish between low and high dimensional setups.

In the low-dimensional setup n ∨ m log n & p, we obtain the following condition:

∆ ² (µ)

σ ² & K + log n

. (3.13)

Discriminating with high probability between n observations from two gaussians in dimension 1 would require a separating rate of at least σ ² log n. This implies that when K . log n, our result is minimax. Otherwise, to our knowledge the best clustering result on approximating mixture center is from [15], and on the condition that ∆ ² (µ)/σ ² & K ² . Furthermore, the K & log n regime is known in the stochastic-block-model community as a hard regime where a gap is surmised to exist between the minimal information-theoretic rate and the minimal achievable computational rate (see for example [7]).

In the high-dimensional setup n ∨ m log n . p, condition (3.12) becomes:

∆ ² (µ) σ ² &

r

(K + log n) pK

n . (3.14)

There are few information-theoretic bounds for high-dimension clustering. Recently, Banks, Moore, Vershynin, Verzelen and Xu (2017) [3] proved a lower bound for Gaussian mixture clustering de- tection, namely they require a separation of order p

K(log K)p/n. When K . log n, our condition is only different in that it replaces log(K) by log(n), a price to pay for going from detecting the clusters to exactly recovering the clusters. Otherwise when K grows faster than log n there might exist a gap between the minimal possible rate and the achievable, as discussed previously.

Adaptation to effective dimension. We can analyse further the condition (3.11) by introduc- ing an effective dimension r ∗ , measuring the largest volume repartition for our variance-bounding matrices Σ 1 , ..., Σ n . Let

r ∗ := γ ²

σ ² = max _a∈[n] |Σ _a | _∗

max _a∈[n] |Σ _a | _op , (3.15)

r ∗ can also be interpreted as a form of global effective rank of matrices Σ a . Indeed, define Re(Σ) :=

|Σ| _∗ /|Σ| _op , then we have r ∗ 6 max _a∈[n] Re(Σ a ) 6 max _a∈[n] rank(Σ a ) 6 p.

Now using V ² 6 √

r ∗ σ ² and γ = √

r ∗ σ, condition (3.11) can be written as

∆ ² (µ)

σ ² & K + log n +

r

(K + log n) r ∗ K

n . (3.16)

By comparing this equation to (3.12), notice that r ∗ is in place of p, indeed playing the role of an

effective dimension for the problem. This also shows that our estimator adapts to this effective

dimension, without any dimension reduction step. In consequence, equation (3.16) distinguishes

between an actual high-dimensional setup: n ∨ m log n . r ∗ and a ”low” dimensional setup r ∗ .

(7)

n ∨ m log n under which, regardless of the actual value of p, our estimators recovers under the near-minimax condition of (3.13).

This informs on the effect of correcting term b Γ ^corr in the theorem above when n+m log n . r ∗ . The un-corrected version of the semi-definite program (3.5) has a leading separating rate of γ ² /m = σ ² r ∗ /m, but with the Γ b ^corr correction on the other hand, (3.16) has leading separating factor smaller than σ ² p

(K + log n)r ∗ /m = σ ² √

n + m log n × √

r ∗ /m. This proves that in a high-dimensional setup, our correction enhances the separating rate of at least a factor p

(n + m log n)/r ∗ .

4 Adaptation to the unknown number of group K

It is rarely the case that K is known, but we can proceed without it. We produce an estimator adaptive to the number of groups K: let b κ ∈ R + , we now study the following adaptive estimator:

B e ^corr := arg max

B∈C

hb Λ − Γ b ^corr , Bi − b κ tr(B ). (4.1) Theorem 2. Suppose that m > 2 and (3.11) is satisfied. For c 3 , c 4 , c 5 > 0 absolute constants suppose that the following condition on b κ is satisfied

c 4

V ² √

n + σ ² n + γ(σ p

log n + δ) + δ ² √ n

< c 5 b κ < m∆ ² (µ), (4.2) then we have B e ^corr = B ^∗ with probability larger than 1 − c 3 /n

Notice that condition (4.2) essentially requires κ b to be seated between m∆ ² (µ) and some compo- nents of the right-hand side of (3.11). So under (4.2), the results from the previous section apply to the adaptive estimator B e ^corr as well and this shows that it is not necessary to know K in order to perform well for recovering G. Finding an optimized, data-driven parameter b κ using some form of cross-validation is outside of the scope of this paper.

5 Numerical experiments

We illustrate our method on simulated Gaussian data in two challenging, high-dimensional setup experiments for comparing clustering estimators. Our sample are drawn from K = 3 identically- sized, identically distributed and perfectly discriminated clusters of non-isovolumic Gaussians. The distributions are chosen to be isotropic, and the ratio between the lowest and the highest standard deviation is of 1 to 10. We draw points of a R ^p space in two different scenarii. In (S ₁ ), for a given dimension space p = 2000 and a fixed isotropic noise level, we report the algorithms’ compared performances as the signal-to-noise ratio ∆ ² (µ)/σ ² is increased from 1 to 20. In (S ₂ ) we impose a fixed signal to noise ratio, and observe the algorithm’s decay in performance as the space dimension p is increased from 100 to 400 000. All points of the simulated space are reported as a median value with asymmetric standard deviations in the form of errorbars over a hundred simulations.

Solving for estimator B b ^corr is a hard problem as n grows. For this task we implemented an ADMM solver from the work of Boyd et al. [4] with multiple stopping criterions including a fixed number of iterations of T = 3000. The results we report use n = 30 samples. For reference, we compare the recovering capacities of G b ^corr , labeled ’pecok’ in Figure 1 with other classical clustering algorithm.

We chose three different but standard clustering procedures: Lloyd’s K-means algorithm [12] with

K-means++ initialization [1] (although in scenario (S ₂ ), it is too slow to converge as p grows so

we do not report it), Ward’s method for Hierarchical Clustering [22] and the low-rank clustering

(8)

615

∆

²

( µ ) /σ

²

VSOLWMRLQG

,

cG

NPHDQV SHFRN ORZUDQNVSHFWUDO KLHUDUFKLFDO FRUG

Scenario (S 1 )

GLPHQVLRQS

VSOLWMRLQG

,

cG

SHFRN ORZUDQNVSHFWUDO KLHUDUFKLFDO FRUG

Scenario (S 2 )

Figure 1: Performance comparison for classical clustering estimators and ours G b ^corr , labeled ’pecok’

in reference to [5]. The lower split-join, the better the clustering performance and split-join(G, G) = b 0 implies G b = G.

algorithm applied to the Gram matrix, a spectral method appearing in McSherry [14]. Lastly we include the CORD algorithm from Bunea et al. [6].

We measure the performances of estimators by computing the split-join metric on the cluster graphs, counting the number of edges to remove or add to go from one graph to the other. In the two experiments, the results of G b ^corr are markedly better than that of other methods. Scenario (S ₁ ) shows it can achieve exact recovery with a lesser signal to noise ratio than its competitors, whereas scenario (S ₂ ) shows its performances are decaying at a much lower rate than the others when the space dimension is increased.

Because of the slow convergence of ADMM, G b ^corr comes with important computation times. Of course all of the compared methods have a very hard time reaching high sample sizes n in the high dimensional context and to that regard, the low-rank clustering method is by far the most promising.

6 Conclusion

In this paper we analyzed a new semidefinite positive algorithm for clustering within the context of a flexible probabilistic model and exhibit the key quantities that guarantee non-asymptotic exact recovery. It implies an essential bias-removing correction that significanty improves the recovering rate in the high-dimensional setup. Hence we showed the estimator to be near-minimax, adapted to an effective dimension of the problem. We demonstrated that our estimator can in theory be optimally adapted to a data-driven choice of K. Lastly we illustrated on high-dimensional experiments that our approach is empirically stronger than other classical clustering methods.

Our method is computationally intensive even though it is of polynomial order. As the Γ b ^corr

(9)

correction step of the algorithm can be interpreted as an independent, denoising step for the Gram matrix, we suggest using it as such for other notably faster algorithm such as the spectral algorithms.

Acknowledgements

This work is supported by a public grant overseen by the French National research Agency (ANR) as part of the “Investissement d’Avenir” program, through the “IDI 2015” project funded by the IDEX Paris-Saclay, ANR-11-IDEX-0003-02. It is also supported by the CNRS PICS funding HighClust.

We thank Christophe Giraud for a shrewd, unwavering thesis direction.

References

[1] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA

’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Math- ematics.

[2] Martin Azizyan, Aarti Singh, and Larry Wasserman. Efficient Sparse Clustering of High- Dimensional Non-spherical Gaussian Mixtures. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 37–45, San Diego, California, USA, 09–12 May 2015. PMLR.

[3] J. Banks, C. Moore, N. Verzelen, R. Vershynin, and J. Xu. Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization. ArXiv e-prints, July 2016.

[4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found.

Trends Mach. Learn., 3(1):1–122, January 2011.

[5] F. Bunea, C. Giraud, M. Royer, and N. Verzelen. PECOK: a convex optimization approach to variable clustering. ArXiv e-prints, June 2016.

[6] Florentina Bunea, Christophe Giraud, and Xi Luo. Minimax optimal variable clustering in g-models via cord. arXiv preprint arXiv:1508.01939, 2015.

[7] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J. Mach. Learn.

Res., 17(1):882–938, January 2016.

[8] St´ ephane Chr´ etien, Cl´ ement Dombry, and Adrien Faivre. A semi-definite programming ap- proach to low dimensional embedding for unsupervised clustering. CoRR, abs/1606.09190, 2016.

[9] Sanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of em for mixtures of sepa- rated, spherical gaussians. J. Mach. Learn. Res., 8:203–226, May 2007.

[10] Chris Fraley and Adrian E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458):611–631, 2002.

[11] Olivier Gu´ edon and Roman Vershynin. Community detection in sparse networks via grothendieck’s inequality. CoRR, abs/1411.4686, 2014.

[12] S. Lloyd. Least squares quantization in pcm. IEEE Trans. Inf. Theor., 28(2):129–137, Septem- ber 1982.

[13] J. MacQueen. Some methods for classification and analysis of multivariate observations. In

Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol-

ume 1: Statistics, pages 281–297, Berkeley, Calif., 1967. University of California Press.

(10)

[14] F. McSherry. Spectral partitioning of random graphs. In Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science, FOCS ’01, pages 529–, Washington, DC, USA, 2001. IEEE Computer Society.

[15] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures with k-means. In 2016 IEEE Information Theory Workshop (ITW), pages 211–215, Sept 2016.

[16] Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite program- ming. SIAM J. on Optimization, 18(1):186–205, February 2007.

[17] Philippe Rigollet. High-Dimensional Statistics. Massachusetts Institute of Technology: MIT OpenCourseWare, Spring 2015.

[18] Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentra- tion. Electron. Commun. Probab, 18(82):1–9, 2013.

[19] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, 1:801–804, 1956.

[20] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. Chapter 5 of:

Compressed Sensing, Theory and Applications. Cambridge University Press, 2012.

[21] N. Verzelen and E. Arias-Castro. Detection and Feature Selection in Sparse Mixture Models.

ArXiv e-prints, May 2014.

[22] Joe H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American

Statistical Association, 58(301):236–244, 1963.

(11)

Appendix

A Intermediate results

A.1 Generic controls for exact recovery

Let Γ be any estimator of Γ and let b B b := arg max _B∈C

_K

h Λ b − Γ, Bi. b

Theorem 3. For c 1 , c 2 > 0 absolute constants suppose that |b Γ −Γ| _V 6 γ ¯ _n ² with probability 1 − c 1 /n, and that

m∆ ² (µ) > c 2

σ ² (n + m log n) + V ² ( p

n + m log n) + ¯ γ _n ² + δ ² ( √ n + m)

, (A.1)

then we have B b = B ^∗ with probability larger than 1 − c 1 /n

In the case where the number of groups is unknown we study B e := arg max _B∈C hb Λ − Γ, Bi − b b κ tr(B ) for b κ ∈ R .

Theorem 4. For c 3 , c 4 , c 5 > 0 absolute constants suppose that |b Γ − Γ| _∞ 6 γ ¯ _n ² with probability 1 − c ₃ /n. Suppose that (A.1) is satisfied and that the following condition on b κ is satisfied

c ₄ V ² √

n + σ ² n + ¯ γ _n ² + δ ² √ n

< c ₅ κ < m∆ b ² (µ), (A.2) then we have B e = B ^∗ with probability larger than 1 − c ₃ /n

A.2 On estimating Γ

In the general case we have b Γ = 0 hence a deterministic perturbation term ¯ γ _n ² = |Γ| ∞ weighing on the separation requirements. For b Γ ^corr , we have the following result.

Proposition 4. Assume that m > 2. For c 6 , c 7 > 0 absolute constants, with probability larger than 1 − c ₆ /n we have

|b Γ ^corr − Γ| _∞ 6 c ₇

σ ² log n + (δ + σ p

log n)γ + δ ²

. (A.3)

A.3 Concentration of random subgaussian Gram matrices

A key result in our proof is the following concentration bound on the Gram matrix of centered, subgaussian, independent random variables.

Lemma 1. For some absolute constant c ∗ > 0, for a ∈ [n] let E a be centered, independent random vectors in R ^d , E a ∼ subg(Σ a ). Let E :=

h ...

E

^T_a

...

i

∈ R ^n×d then ∀t > 0

P

|EE ^T − E EE ^T

| _op > 2 max

a∈[n] |Σ _a | _F √

t + 2 max

a∈[n] |Σ _a | _op t

6 9 ⁿ 2e ^−c

^∗

^t . (A.4)

(12)

B Main proofs

B.1 Proof of Proposition 1: identifiability

Suppose that X ₁ , ..., X _n are (G, µ, δ)-clustered with |G| = K, and ρ(G, µ, δ) > 4. Then we remark that for (a, b) ∈ [n] ² , a ∼ ^G b is equivalent to |ν _a − ν _b | ₂ 6 2δ because:

• if a ∼ ^G b then there exist k ∈ [K] such that |ν _a − ν b | ₂ 6 |ν _a − µ k | ₂ + |µ _k − ν b | ₂ 6 2δ

• if a

G

6∼ b then there exist (k, l) ∈ [K ] ² such that |ν _a − ν b | ₂ > |µ _k − µ l | ₂ − |ν _a − µ k | ₂ − |ν _b − µ l | ₂ >

4δ − 2δ > 2δ.

Now suppose there exist G ⁰ such that X ₁ , ..., X _n are (G ⁰ , µ ⁰ , δ ⁰ )-clustered with |G ⁰ | = K and ρ(G ⁰ , µ ⁰ , δ ⁰ ) > 4. By symmetry we can assume δ ⁰ 6 δ, and the previous remark shows that G ⁰ is a sub-partition of G, ie G preserves the structure of G ⁰ . But since |G| = |G ⁰ | this implies G = G ⁰ . B.2 Exact recovery with high probability

The proof for Theorem 1 (respectively Theorem 2) is a composition of Theorem 3 (respectively Theorem 4) and Proposition 4.

In this section, under Hypothesis 1, we have ∀k ∈ [K], ∀a ∈ G _k : X a ∼ subg(Σ a ). For k ∈ [K ], we define σ ² _k := max a∈G

_k

|Σ _a | _op 6 σ ² , V _k ² := max a∈G

_k

|Σ _a | _F 6 V ² , γ _k ² := max a∈G

_k

|Σ _a | _∗ 6 γ ² .

A number of proofs in this section are adapted from the proof ensemble of [5]. In it the authors use a latent model for variable clustering. A comparable model in this work would require to impose the following conditions on X ₁ , ..., X _n : identically distributed variables within a group (implying δ = 0) and isovolumic, Gaussian distributions.

B.2.1 Proof of Theorem 3

In this theorem we only need to consider B ∈ C _K , but the proof of Theorem 4 is similar to this one, hence we will start by considering the more general B ∈ C and use B ∈ C _K at a later stage of the proof. Thus we want to prove that under some conditions, with high probability:

h Λ b − Γ, B b ^∗ − Bi > 0 for all B ∈ C \ {B ^∗ } (B.1) For (a, b) ∈ G _k × G _l for (k, l) ∈ [K] ² , let:

(S 1 ) ab := −|µ _k − µ l | ² ₂ /2 (B.2)

(W 1 ) ab := hν _a − µ k , ν b − µ l i

(W ₂ ) _ab := hµ _k − ν _a + ν _b − µ _l + E _b − E _a , µ _k − µ _l i (W ₃ ) _ab := hE _b − E _a , ν _a − µ _k + µ _l − ν _b i

(W ₄ ) _ab := (hE _a , E _b i − Γ _ab ) (W ₅ ) _ab := (Γ − Γ) b _ab Lemma 2. Proving (B.1) reduces to proving

hS ₁ + W 1 + W 2 + W 3 + W 4 + W 5 , B ^∗ − Bi > 0 for all B ∈ C \ {B ^∗ }. (B.3)

(13)

The proof for Lemma 2 is found in section B.2.3. So we need only concern ourselves with the quan- tities S ₁ , W ₁ , W ₂ , W ₃ , W ₄ , W ₅ . The term S ₁ contains our uncorrupted signal and since hS ₁ , B ^∗ i = 0 it writes:

hS ₁ , B ^∗ − Bi = X

16k6=l6K

1 2 |µ _k − µ l | ² ₂ |B _G

_k

_G

_l

| ₁ (B.4) The other parts are noisy and must be controlled. The term W ₂ is a simple subgaussian form controlled through the following lemma, proved in section B.2.4:

Lemma 3. For c ⁰ ₂ > 0 absolute constant, with probability greater than 1 − 1/n:

∀B ∈ C, |hW ₂ , B ^∗ − Bi| 6 X

16k6=l6K

2δ +

q

c ⁰ ₂ (log n)(σ _k ² + σ _l ² )

|µ _k − µ l | ₂ |B _G

_k

_G

_l

| ₁ . (B.5) To control the other noisy terms we now introduce a deterministic result:

Lemma 4. For any symmetric matrix W ∈ R ^n×n we have:

∀B ∈ C, |hW, B ^∗ − B i| 6 6|B ^∗ W | ∞

X

16k6=l6K

|B _G

_k

_G

_l

| ₁ + |W | _op h X

16k6=l6K

|B _G

_k

_G

_l

| ₁ /m + (tr(B ) − K) i

.

(B.6) The proof for Lemma 4 will be found in [5], p.21-22 until eq. (58).

As B ^∗ 1 = 1 and B ^∗ > 0, |B ^∗ W | _∞ 6 |W | _∞ so we use the lemma on terms W ₁ and W ₃ by bounding

|W | ∞ and |W | _op : for the term W ₁ we use |W ₁ | ∞ 6 δ ² so |W ₁ | _op 6 δ ² √

n. To control the term W ₃ , we use the subgaussian tail bound of (B.25) with |ν _a − µ k + µ l − ν b | ₂ 6 2δ and a union bound over (a, b) ∈ [n] ² . We get that for c ⁰ ₃ > 0 absolute constant, with probability greater than 1 − 1/n,

|W ₃ | ∞ 6 p

c ⁰ ₃ (log n)σ ² δ ² and |W ₃ | _op 6 p

c ⁰ ₃ (log n)σ ² δ ² × √

n therefore with probability greater than 1 − 1/n, ∀B ∈ C:

|hW ₁ , B ^∗ − Bi| 6 δ ² h X

16k6=l6K

|B _G

_k

_G

_l

| ₁ (6 +

√ n m ) + √

n(tr(B) − K) ₊ i

(B.7)

|hW ₃ , B ^∗ − Bi| 6 q

c ⁰ ₃ (log n)σ ² δ ² h X

1 6 k6=l 6 K

|B _G

_k

_G

_l

| ₁ (6 +

√ n m ) + √

n(tr(B) − K) ₊ i

(B.8) For the term W ₄ we introduce the following lemma, proved in section B.2.5:

Lemma 5. For c ⁰ ₄ , c ⁰⁰ ₄ > 0 absolute constants, with probability larger than 1 − 2/n:

∀B ∈ C, |hW ₄ , B ^∗ − Bi| 6 h

6c ⁰ ₄ (V ² p

log n + σ ² log n)/ √

m + c ⁰⁰ ₄ (V ² √

n + σ ² n)/m i X

16k6=l6K

|B _G

_k

_G

_l

| ₁ + (tr(B ) − K) ₊ c ⁰⁰ ₄ (V ² √

n + σ ² n). (B.9)

Lastly as the term W ₅ is diagonal we have |W ₅ | _op = |W ₅ | _∞ and |B ^∗ W ₅ | _∞ 6 |W ₅ | _∞ /m therefore:

∀B ∈ C, |hW ₅ , B ^∗ − B i| 6 |W ₅ | ∞

h 7 m

X

16k6=l6K

|B _G

_k

_G

_l

| ₁ + (tr(B) − K) +

i

(B.10)

(14)

Using those controls of W ₁ , W ₂ , W ₃ , W ₄ , W ₅ , in combination in a union bound in (B.3) we get for c ⁰ ₁ > 0 absolute constant, with probability greater than 1 − c ⁰ ₁ /n: ∀B ∈ C,

hS ₁ + W ₁ + W ₂ + W ₃ + W ₄ + W ₅ , B ^∗ − Bi > X

16k6=l6K

h 1

2 |µ _k − µ _l | ² ₂ − 2δ +

q

2c ⁰ ₂ (log n)σ ²

|µ _k − µ _l | ₂

− (6c ⁰ ₄ V ² √

log n + σ ² log n

√ m + c ⁰⁰ ₄ V ² √

n + σ ² n

m ) − 7

m |W ₅ | _∞ − (6 +

√ n m )(δ ² +

q

c ⁰ ₃ (log n)σ ² δ ² ) i

|B _G

_k

_G

_l

| ₁

− (tr(B ) − K) + [c ⁰⁰ ₄ (V ² √

n + σ ² n) + (δ ² + q

c ⁰ ₃ (log n)σ ² δ ² ) √

n + |W ₅ | _∞ ] (B.11)

We now use the fact that for this theorem we are only considering B ∈ C _K , ie matrices such that tr(B) = K so we can discard the last line of (B.11). In this particular context we can improve the control provided by Lemma 4 for W ₅ : as tr(B ^∗ ) = K, we have for α ∈ R : |hW ₅ , B ^∗ − B i| 6

|hW ₅ − αI _n , B ^∗ − Bi| + |α(tr(B ) − K)|. So by choosing α = (max _a (W ₅ ) _aa + min _a (W ₅ ) _aa )/2, we have |W ₅ − αI n | _op = |W ₅ − αI n | _∞ = |W ₅ | _V /2 and therefore:

∀B ∈ C _K |hW ₅ , B ^∗ − B i| 6 |W ₅ | _V 7 2m

X

16k6=l6K

|B _G

_k

_G

_l

| ₁ (B.12) In consequence we can replace |W ₅ | _∞ by |W ₅ | _V /2 in the second line of (B.11), and with another union bound, by assumption we replace |W ₅ | _V /2 by ¯ γ _n ² /2.

Lastly Lemma 3 p. 17 from [5] shows the only matrix in C _K whose support is included in supp(B ^∗ ) is B ^∗ , therefore B ∈ C _K \{B ^∗ } implies P

16k6=l6K |B _G

_k

_G

_l

| ₁ > 0. Hence for c 2 > 0 absolute constant, the following condition on ∆(µ) is sufficient to ensure exact recovery with probability larger than 1 − c 1 /n:

∆ ² (µ) > c ₂

σ ² m log n + V ² p

m log n + V ² √

n + σ ² n + ¯ γ _n ² + δ ² ( √

n + m)

× 1

m (B.13)

This concludes the proof for Theorem 3.

B.2.2 Proof of Theorem 4: adaptive exact recovery

In this Theorem we need to take into account the additional penalization term b κ tr(B). Notice it is equivalent to a correction by κI b n of our estimator Λ− b Γ, therefore for b B ∈ C, h Λ− b Γ b − κI b n , B ^∗ −Bi = h Λ b − Γ, B b ^∗ − Bi + b κ × (tr(B) − K). Therefore for Theorem 4 we can follow the same proof as in Theorem 3 until establishing (B.11), at which point we can use a union bound to use the assumption

|W ₅ | _∞ 6 γ ¯ _n ² . Consequently we have with probability greater than 1 − c ⁰ ₁ /n: ∀B ∈ C, hS ₁ + W ₁ + W ₂ + W ₃ + W ₄ + W ₅ , B ^∗ − Bi > X

16k6=l6K

h 1

2 |µ _k − µ _l | ² ₂ − 2δ +

q

2c ⁰ ₂ (log n)σ ²

|µ _k − µ _l | ₂

− (6c ⁰ ₄ V ² √

log n + σ ² log n

√ m + c ⁰⁰ ₄ V ² √

n + σ ² n

m ) − 7

m γ ¯ _n ² − (6 +

√ n m )(δ ² +

q

c ⁰ ₃ (log n)σ ² δ ² ) i

|B _G

_k

_G

_l

| ₁

− (tr(B ) − K) ₊ [c ⁰⁰ ₄ (V ² √

n + σ ² n) + (δ ² + q

c ⁰ ₃ (log n)σ ² δ ² ) √

n + ¯ γ ² _n ] + b κ(tr(B ) − K) (B.14) Using the assumption (A.1) of Theorem 4 there exist c ⁰ ₂ > 0 such that with probability greater than 1 − c ⁰ ₁ /n: ∀B ∈ C,

hS ₁ + W 1 + W 2 + W 3 + W 4 , B ^∗ − Bi > c ⁰ ₂ ∆ ² (µ) X

16k6=l6K

|B _G

_k

_G

_l

| ₁

− (tr(B ) − K) ₊ [c ⁰⁰ ₄ (V ² √

n + σ ² n) + (δ ² + q

c ⁰ ₃ (log n)σ ² δ ² ) √

n + ¯ γ _n ² ] + κ(tr(B b ) − K) (B.15)

(15)

From here, when tr(B) > K, the left-hand side of (A.2) is sufficient to ensure recovery. When tr(B) = K, we already established that P

16k6=l6K |B _G

_k

_G

_l

| ₁ > 0 for all matrices B ∈ C _K \ {B ^∗ } so (A.1) is sufficient in that case. Lastly note that K − tr(B) 6 _m ¹ P

16k6=l6K |B _G

_k

_G

_l

| ₁ (see [5] eq.

(57) p.21) so the right-hand side of (A.2) is sufficient condition for recovery when tr(B) − K < 0.

This concludes the proof of Theorem 4.

B.2.3 Proof of Lemma 2

( Λ b − Γ) b _ab = hX _a , X _b i − Γ b _ab = hν _a , ν _b i + hν _a , E _b i + hν _b , E _a i + hE _a , E _b i − b Γ _ab (B.16)

= hν _a , ν _b i + hν _a − ν _b , E _b − E a i + hν _a , E a i + hν _b , E _b i + (W 4 + W 5 ) _ab (B.17)

= hν _a , ν b i + hµ _k − µ l , E b − E a i + (W 3 ) ab + hν _a , E a i + hν _b , E b i + (W 4 + W 5 ) ab (B.18)

= −hµ _k , µ l i + hν _a − µ k , ν b − µ l i + hν _a , µ l i + hµ _k , ν b i

+ hµ _k − µ _l , E _b − E _a i + (W ₃ ) _ab + hν _a , E _a i + hν _b , E _b i + (W ₄ + W ₅ ) _ab (B.19)

= −(S ₁ ) _ab − 1

2 (|µ _k | ² ₂ + |µ _l | ² ₂ ) + (W ₁ ) _ab + hν _a , µ _l i + hµ _k , ν _b i

+ hµ _k − µ _l , E _b − E _a i + (W ₃ ) _ab + hν _a , E _a i + hν _b , E _b i + (W ₄ + W ₅ ) _ab (B.20)

= −(S ₁ ) _ab − 1

2 (|µ _k | ² ₂ + |µ _l | ² ₂ ) + (W ₁ ) _ab + hν _a , µ _k i + hµ _l , ν _b i

+ hµ _k − µ l , ν b − ν a + E b − E a i + (W 3 ) ab + hν _a , E a i + hν _b , E b i + (W 4 + W 5 ) ab (B.21)

= −(S ₁ ) _ab − 1

2 (|µ _k | ² ₂ + |µ _l | ² ₂ ) + (W ₁ ) _ab + hν _a , µ _k i + hµ _l , ν _b i

+ 2(S 1 ) ab + (W 2 ) ab + (W 3 ) ab + hν _a , E a i + hν _b , E b i + (W 4 + W 5 ) ab (B.22) Now since (hν _a , µ k i) _(a,b)∈[n]

2

= (hν _a , µ k i) _a∈[n] ×1 ^T _n , (|µ _k | ² ₂ ) _(a,b)∈[n]

²

= (|µ _k | ² ₂ ) a∈[n] ×1 ^T _n , (hν _b , µ l i) _(a,b)∈[n]

2

= 1 n × (hν _b , µ l i) _b∈[n] , (|µ _l | ² ₂ ) _(a,b)∈[n]

²

= 1 n × (|µ _l | ² ₂ ) b∈[n] , (hν _a , E a i) _(a,b)∈[n]

2

= (hν _a , E a i) _a∈[n] × 1 ^T _n , (hν _b , E _b i) _(a,b)∈[n]

2

= 1 n × (hν _b , E _b i) _b∈[n] and since B1 n = B ^∗ 1 n = (1 ^T _n B) ^T = (1 ^T _n B ^∗ ) ^T = 1 n , we have:

hb Λ − b Γ, B ^∗ − B i = hS ₁ + W ₁ + W ₂ + W ₃ + W ₄ + W ₅ , B ^∗ − Bi (B.23)

B.2.4 Proof of Lemma 3: control of |hW ₂ , B ^∗ − Bi|

By definition, (W 2 ) ab = 0 when k = l and (B ^∗ ) ab = 0 when k 6= l so we have hW ₂ , B ^∗ i = 0. Let hA, Bi _G

_k

_G

_l

= P

(a,b)∈G

_k

×G

_l

A _ab B _ab , we have:

hW ₂ , B ^∗ − Bi = −hW ₂ , Bi = − X

16k6=l6K

hW ₂ , Bi _G

_k

_G

_l

6 X

16k6=l6K

|W _2|G

k

G

l

| _∞ |B _G

_k

_G

_l

| ₁ (B.24) Let (a, b) ∈ G k × G l , we look at (W 2 ) ab = hE _b − E a − (ν a − µ k ) + (ν b − µ l ), µ k − µ l i = hE _a − E _b , µ _k − µ _l i + h−(ν _a − µ _k ) + (ν _b − µ _l ), µ _k − µ _l i. The term on the right is a constant offset bounded by 2δ|µ _k − µ _l | ₂ . Let z := µ _k − µ _l , by Lemma 7 hE _a − E _b , zi is a subgaussian variable with variance bounded by (σ ² _k + σ ² _l )|z| ² ₂ therefore its tails are characteristically bounded (see for example [20]), there exist c ∗ > 0 absolute constant such that ∀t > 0:

P

|hE _b − E _a , zi| > |z| ₂ q

σ _k ² + σ _l ² × t

6 e ^1−c

^∗

^t

²

(B.25)

(16)

This implies that ∀t > 0, P h

|(W ₂ ) _ab | > |µ _k − µ _l | ₂ (2δ + q

σ _k ² + σ _l ² × t) i

6 e ^1−c

^∗

^t

²

. We conclude with a union bound over all (a, b) ∈ G k × G l , a union bound over all (k, l) ∈ [K] ² , k 6= l and by taking t = p

(1 + 3 log n)/c ∗ .

B.2.5 Proof of Lemma 5: control of |hW ₄ , B ^∗ − Bi|

Recall (W ₄ ) _ab = hE _a , E _b i − Γ _ab . We will prove Lemma 5 by using the derivation of (B.6) combined with Lemma 1 for control of the operator norm and the following lemma for the remaining part.

Lemma 6. For c ⁰ ₄ > 0 absolute constant, with probability greater than 1 − 1/n:

|B ^∗ W ₄ | _∞ 6 c ⁰ ₄ × (V ² p

log n + σ ² log n)/ √

m. (B.26)

Proof. Let (a, b) ∈ G _k × G _l , we rewrite (B ^∗ W ₄ ) _ab as the sum of the following two terms:

(B ^∗ W ₄ ) _ab = u _b

|G _k | × 1 _k=l + h E e _k , E _b i with

( u b := |E _b | ² ₂ − Γ bb

E e k := _|G ¹

k

|

P

c∈G

_k

,c6=b E c (B.27) The bound for u _b uses Lemma 9: ∀t > 0 P

||E _b | ² ₂ − E |E _b | ² ₂ | > V _l ² √

t + σ ² _l t

6 2e ^−c

^∗

^t so only the scalar product remains to be controlled. Notice that by Lemma 7, p

|G _k | E e _k is a centered subgaussian with variance-bounding matrix Σ = e _|G ¹

k

|

P

c∈G

_k

,c6=b Σ c , therefore |e Σ| _F 6 V _k ² and |e Σ| _op 6 σ _k ² . So using Lemma 9 again we find ∀t > 0:

P h

2| p

|G _k |h E e _k , E _b i| > √

2h Σ, e Σ _b i ^1/2 √

t + |e Σ ^1/2 Σ ^1/2 _b | _op t i

6 2e ^−c

^∗

^t (B.28) Therefore using a union bound, then he Σ, Σ b i ^1/2 6 V _k V _l 6 V ² (Cauchy-Schwarz) and applying another union bound over all (a, b) ∈ [n] ² with t = (log 4 + 3 log n)/c ∗ yields the result.

We are ready to wrap-up the proof. From Lemma 1 applied to W 4 , taking t = (log 2 + n log 9 + log n)/c ∗ there exists c ⁰⁰ ₄ > 0 absolute constant such that we have with probability greater than 1 − 1/n: |W ₄ | _op 6 c ⁰⁰ ₄ (V ² √

n + σ ² n). Now applying Lemma 4 to W 4 :

|hW ₄ , B ^∗ − B i| 6 6|B ^∗ W ₄ | _∞ X

16k6=l6K

|B _G

_k

_G

_l

| ₁ + |W ₄ | _op h X

16k6=l6K

|B _G

_k

_G

_l

| ₁ /m + (tr(B) − K) i (B.29) Therefore combining the lemma with the derivations above and a union bound, we get with prob- ability greater than 1 − 2/n:

|hW ₄ , B ^∗ − Bi| 6 h

6c ⁰ ₄ (V ² p

log n + σ ² log n)/ √

m + c ⁰⁰ ₄ (V ² √

n + σ ² n)/m i X

16k6=l6K

|B _G

_k

_G

_l

| ₁ + (tr(B) − K) + c ⁰⁰ ₄ (V ² √

n + σ ² n) (B.30)

This concludes the proof for Lemma 5.

(17)

B.3 Proof of Proposition 4, Gamma estimator Γ b ^corr

Let a ∈ G _k , b ₁ ∈ G _l

₁

, b ₂ ∈ G _l

₂

, using (2.1) and 2|xy| 6 x ² + y ² we have for a ∈ [n]:

|b Γ aa − Γ aa | = |hX _a − X _b

₁

, X a − X _b

₂

i − Γ aa | 6 U 1 + 3

2 U 2 + 2U 3 + 3U 4 (B.31) where: U 1 := ||E _a | ² ₂ − Γ aa |

U ₂ := |ν _a − ν _b

₁

| ² ₂ + |ν _a − ν _b

₂

| ² ₂ U 3 := sup

(b,c)∈[n]

²

h ν a − ν c

|ν _a − ν _c | ₂ , E _b i ² U 4 := sup

(b,c)∈[n]

²

,b6=c

|hE _b , E c i|

Control of U ₁ = ||E _a | ² ₂ − Γ _aa |: by using the first inequality from Lemma 9 with t = (2 log n + log 2)/c ∗ there exists c ⁰ ₁ > 0 such that with probability greater than 1 − 1/n ² :

U ₁ 6 c ⁰ ₁ × (V _k ² p

log n + σ _k ² log n) (B.32)

Control of U ₃ = sup _(b,c)∈[n]

2

h _|ν ^ν

^a

^−ν

^c

a

−ν

_c

|

₂

, E _b i ² : write z = (ν _a − ν _c )/|ν _a − ν _c | ₂ and Y = Σ ^−1/2 _b E _b ∼ subg(I p ) and A = Σ ^1/2 _b ^T (zz ^T )Σ ^1/2 _b , so that: hz, E _b i ² = E _b ^T zz ^T E b = Y ^T AY . Because |z| ₂ = 1 and zz ^T is symmetric of rank 1 we have |A| _F = |A| _op = tr(A) 6 σ ² therefore we use Lemma 8 with t = (4 log n + log 2)/c ∗ and then a union bound over all (b, c) ∈ [n] ² so that with probability greater than 1 − 1/n ² :

U ₃ 6 c ⁰ ₃ × σ ² log n (B.33)

Control of U ₄ = sup _(b,c)∈[n]

2

,b6=c |hE _b , E _c i|: using the fact that E _b and E _c are independent and the second inequality of Lemma 9 with t = (4 log n + log 2)/c ∗ , a union bound over all (b, c) ∈ [n] ² , there exists c ⁰ ₄ > 0 such that we have with probability greater than 1 − 1/n ² :

U ₄ 6 c ⁰ ₄ × (σ ² log n + V ² p

log n) (B.34)

Control of U ₂ = |ν _a − ν _b

₁

| ² ₂ + |ν _a − ν _b

₂

| ² ₂ : here we use the requirement that all groups are of length at least m > 3, there exist (a 1 , a 2 ) ∈ G _k \ {a}, (c, d) ∈ ([n] \ {a, a ₁ , a 2 }) ² , let Z = (X c − X _d )/|X _c − X _d | ₂ . For a _u ∈ {a ₁ , a ₂ } we have hX _a − X _a

_u

, Z i = hν _a − ν _a

_u

, Z i + hE _a − E _a

_u

, Z i. By independence and Lemma 7, hE _a − E a

u

, Z i is subgaussian with variance bounded by 2σ ² . Therefore using the subgaussian tail bounds of (B.25) and a union bound, there exists c ⁰ ₂ > 0 absolute constant such that with probability over 1 − 1/n ² : V (a, a ₁ ) ∨ V (a, a ₂ ) 6 2δ + c ⁰ ₂ σ √

log n. Hence for b _u ∈ {b ₁ , b ₂ } with probability over 1 − 1/n ² :

|hX _a − X _b

_u

, X _c − X _d i| 6 (2δ + c ⁰ ₂ σ p

log n)|X _c − X _d | ₂ (B.35) Now suppose l ₁ 6= k, choose c ∈ G _k \ {a}, d ∈ G _l

₁

\ {b ₁ }. We have |X _c − X _d | ₂ 6 |µ _k − µ _l

₁

| ₂ + 2δ +

|E _c −E _d | ₂ . We also have hX _a −X _b

₁

, X c − X d i = hν _a − ν b

1

+ E a − E b

1

, ν c − ν d + E c −E _d i = hµ _k − µ l

1

+ δ _ab + E a −E _b

₁

, µ _k −µ _l

₁

+δ _cd +E c −E _d i for δ _ab = (ν a −ν _b

₁

)−(µ _k −µ _l

₁

) and δ _cd = (ν c −ν _d )−(µ _k −µ _l

₁

).

Therefore:

|hX _a − X _b

₁

, X c − X _d i| > |µ _k − µ _l

₁

| ² ₂ /2 − 4δ|µ _k − µ _l

₁

| ₂ (B.36)

− 1

2 h µ _k − µ _l

₁

|µ _k − µ l

1

| ₂ , E _c + E _a − E _d − E _b

₁

i ² − 2 sup

(b,c,d)∈[n]

³

h δ _cd

|δ _cd | ₂ , E _b i ² − 4U ₄ − 12δ ²

> |µ _k − µ l

1

| ² ₂ /2 − 4δ|µ _k − µ l

1

| ₂ − 8U ₃ ⁰ − 2U ₃ ⁰⁰ − 4U 4 − 12δ ² (B.37)

(18)

where U ₃ ⁰ = sup (b,l)∈[n]×[K] h _|µ ^µ

^k

^−µ

^l

k

−µ

_l

|

₂

, E _b i ² , U ₃ ⁰⁰ = sup (b,c,d)∈[n]

³

h _|δ ^δ

^cd

cd

|

₂

, E _b i ² . So combining the last derivations:

|µ _k − µ _l

₁

| ² ₂ /2 − 4δ|µ _k − µ _l

₁

| ₂ 6 (2δ + c ⁰ ₂ σ p

log n)(|µ _k − µ _l

₁

| ₂ + 2δ + |E _c − E _d | ₂ )

+8U ₃ ⁰ + 2U ₃ ⁰⁰ + 4U 4 + 12δ ² (B.38) Notice that U ₃ ⁰ , U ₃ ⁰⁰ can be controlled exactly as U ₃ was, and simultaneously: for c ⁰⁰ ₃ > 0 absolute constant, with probability greater than 1 − 1/n ² : 8U ₃ ⁰ + 2U ₃ ⁰⁰ 6 c ⁰⁰ ₃ σ ² log n.

We now control |E _c − E _d | ₂ : notice that by Lemma 7, E c − E _d is subg(Σ c + Σ _d ). We have E

|E _c − E _d | ² ₂

6 |Σ _c + Σ _d | _∗ 6 2γ ² , |Σ _c + Σ _d | _F 6 2V ² 6 2σγ and |Σ _c + Σ _d | _op 6 2σ ² . There- fore by the first inequality of Lemma 9 with t = (4 log n + log 2)/c ∗ and a union bound over all (c, d) ∈ [n] ² , there exists c ⁰⁰ ₂ > 0 absolute constant such that we have simultaneously with probability greater than 1 − 1/n ² :

sup

(c,d)∈[n]

²

|E _c − E d | ₂ 6 c ⁰⁰ ₂ q

γ ² + σγ p

log n + σ ² log n 6 c ⁰⁰ ₂ (γ + σ p

log n) (B.39)

Therefore with a union bound, with probability greater than 1 − 4/n ² :

|µ _k − µ _l

₁

| ² ₂ /2 − (c ⁰ ₂ σ p

log n + 6δ)|µ _k − µ _l

₁

| ₂ 6 (2δ + c ⁰ ₂ σ p

log n) 2δ + (γ + σ p

log n)(c ⁰⁰ ₂ + c ⁰⁰ ₃ c ⁰ ₂ + 4c ⁰ ₄

c ⁰ ₂ )

+ 12δ ² (B.40)

Hence for c ⁰ ₅ > 0 absolute constant we have with probability greater than 1 − 4/n ² : |µ _k − µ _l

₁

| ² ₂ 6 c ⁰ ₅ (δ + σ √

log n)(δ + σ √

log n + γ). The same control can be derived simultaneously for |µ _k − µ _l

₂

| ² ₂ by replacing d ∈ G l

1

\ {b ₁ } by d ⁰ ∈ G l

2

\ {b ₁ , b 2 }. We conclude that for c ⁰⁰ ₅ > 0 absolute constant, we have with probability greater than 1 − 4/n ² :

U ₂ 6 2|µ _k − µ _l

₁

| ² ₂ + 2|µ _k − µ _l

₂

| ² ₂ + 16δ ² 6 c ⁰⁰ ₅ (δ + σ p

log n)(δ + σ p

log n + γ ) (B.41) Therefore with a union bound over all four terms U 1 , U 2 , U 3 , U 4 and a ∈ [n], for c 6 , c 7 > 0 absolute constants we have with probability greater than 1−c ₆ /n: |b Γ−Γ| _∞ 6 c ₇ (δ+σ √

log n)(δ +σ √

log n+γ).

This concludes the proof of Proposition 4

B.4 Proof of Proposition 2

For this proof we rely heavily on the proof of Theorem 3: let Γ = 0 so that b W 5 = Γ, notice that W ₃ and W ₄ are centered. We take expectation of (B.3), therefore proving hΛ + Γ, B ^∗ − Bi >

Adaptive Clustering through Semidefinite Programming

HAL Id: hal-01524677

https://hal.archives-ouvertes.fr/hal-01524677

Preprint submitted on 18 May 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution - ShareAlike| 4.0 International

Adaptive Clustering through Semidefinite Programming

Martin Royer

To cite this version:

Martin Royer. Adaptive Clustering through Semidefinite Programming. 2017. �hal-01524677�

Adaptive Clustering through Semidefinite Programming

Martin Royer

Laboratoire de Math´ ematiques d’Orsay, Univ. Paris-Sud, CNRS, Universit´ e Paris-Saclay, 91405 Orsay, France.

Abstract

We analyze the clustering problem through a flexible probabilistic model that aims to identify an optimal partition on the sample X 1 , ..., X n . We perform exact clustering with high probability using a convex semidefinite estimator that interprets as a corrected, relaxed version of K-means.

1 Introduction

A famous early approach to clustering is to solve for the geometrical estimator K-means [12, 13, 19].

Several recent developments establish a link between clustering and semidefinite programming.

Peng & Wei [16] show that the K-means objective can be relaxed into a convex, semidefinite

program, leading Mixon et al. [15] to use this relaxation under a subgaussian mixture model to

estimate the cluster centers. Chr´ etien et al. [8] use a slightly different form of a semidefinite

program, inspired by work on community detection by Gu´ edon & Vershynin [11], to recover the

adjacency matrix of the cluster graph with high probability. Lastly in the different context of

variable clustering, Bunea et al. [5] present a semidefinite program with a correction step to

produce non-asymptotic exact recovery results.

In this work, we introduce a semidefinite, penalized estimator for point clustering inspired by [16]

We also show that our results continue to hold without knowledge of the number of groups K.

Lastly we provide evidence of our method’s efficiency from simulated data and suggest a coherent alternative in case our estimator is too costly to compute.

, |M | q ,

2 Probabilistic modeling of point clustering

Consider X 1 , ..., X n and let ν a = E [X a ]. The variable X a can be decomposed into

X a = ν a + E a , a = 1, ..., n, (2.1)

with E a stochastic centered variables in R p .

Definition 1. For K > 1, µ = (µ 1 , ..., µ K ) ∈ ( R p ) K , δ > 0 and G = {G 1 , ..., G K } a partition of [n], we say X 1 , ..., X n are (G, µ, δ)-clustered if ∀k ∈ [K], ∀a ∈ G k , |ν a − µ k | 2 6 δ. We then call

∆(µ) := min

k<l |µ k − µ l | 2 (2.2)

the separation between the cluster means, and

ρ(G, µ, δ) := ∆(µ)/δ (2.3) the discriminating capacity of (G, µ, δ).

Indeed remark that if ρ(G, µ, δ) < 2, the population clusters {ν a } a∈G

, ..., {ν a } a∈G

are not linearly separable, but a high ρ(G, µ, δ) implies that they are well-separated from each other. Furthermore, we have the following result.

Proposition 1. Let (G K ∗ , µ ∗ , δ ∗ ) ∈ arg max ρ(G, µ, δ) for (G, µ, δ) such that X 1 , ..., X n are (G, µ, δ)- clustered, and |G| = K. If ρ(G K ∗ , µ ∗ , δ ∗ ) > 4 then G K ∗ is the unique maximizer of ρ(G, µ, δ).

So G K ∗ is the partition maximizing the discriminating capacity over partitions of size K . Therefore in this work, we will assume that there is a K > 1 such that X 1 , ..., X n is (G, µ, δ)-clustered with

|G| = K and ρ(G, µ, δ) > 4. By Proposition 1, G is then identifiable. It is the partition we aim to recover.

We also assume that X 1 , ..., X n are independent observations with subgaussian behavior. Instead

of the classical isotropic definition of a subgaussian random vector (see for example [20]), we use a

more flexible definition that can account for anisotropy.

Definition 2. Let Y be a random vector in R d , Y has a subgaussian distribution if there exist Σ ∈ R d×d such that ∀x ∈ R d ,

E h

e x

(Y − E Y ) i

6 e x

Σx/2 . (2.4)

We then call Σ a variance-bounding matrix of random vector Y , and write shorthand Y ∼ subg(Σ).

Note that Y ∼ subg(Σ) implies Cov(Y ) 4 Σ in the semidefinite sense of the inequality. To sum-up our modeling assumptions in this work:

Hypothesis 1. Let X 1 , ..., X n be independent, subgaussian, (G, µ, δ)-clustered with ρ(G, µ, δ) > 4.

Remark that the modelization of Hypothesis 1 can be connected to another popular probabilistic model: if we further ask that X 1 , ..., X n are identically-distributed within a group (and hence δ = 0), the model becomes a realization of a mixture model.

3 Exact partition recovery with high probability

Let G = {G 1 , ..., G K } and m := min k∈[K] |G k | denote the minimum cluster size. G can be repre- sented by its caracteristic matrix B ∗ ∈ R n×n defined as ∀k, l ∈ [K] 2 , ∀(a, b) ∈ G k × G l ,

B ab ∗ :=

1/|G k | if k = l

0 otherwise.

In what follows, we will demonstrate the recovery of G through recovering its caracteristic matrix B ∗ . We introduce the sets of square matrices

C K {0,1} := {B ∈ R n×n + : B T = B, tr(B) = K, B1 n = 1 n , B 2 = B} (3.1) C K := {B ∈ R n×n + : B T = B, tr(B) = K, B1 n = 1 n , B < 0} (3.2) C := [

K∈ N

C K . (3.3)

We have: C K {0,1} ⊂ C K ⊂ C and C K is convex. Notice that B ∗ ∈ C K {0,1} . A result by Peng, Wei (2007) [16] shows that the K-means estimator ¯ B can be expressed as

B ¯ = arg max

B∈C

h Λ, Bi b (3.4)

for Λ := (hX b a , X b i) (a,b)∈[n]

∈ R n×n , the observed Gram matrix. Therefore a natural relaxation is to consider the following estimator:

B b := arg max

B∈C

h Λ, Bi. b (3.5)

Notice that E Λ = Λ + Γ for Λ := (hν b a , ν b i) (a,b)∈[n]

∈ R n×n , and Γ := E [hE a , E b i] (a,b)∈[n]

= diag (| Var(E a )| ∗ ) 16a6n ∈ R n×n . The following two results demonstrate that Λ is the signal structure that lead the optimizations of (3.4) and (3.5) to recover B ∗ , whereas Γ is a bias term that can hurt the process of recovery.

Proposition 2. There exist c 0 > 1 absolute constant such that if ρ 2 (G, µ, δ) > c 0 (6 + √

We analyze the clustering problem through a flexible probabilistic model that aims to identify an optimal partition on the sample X ₁ , ..., X _n . We perform exact clustering with high probability using a convex semidefinite estimator that interprets as a corrected, relaxed version of K-means.

, |M | _q ,

with E a stochastic centered variables in R ^p .

Definition 1. For K > 1, µ = (µ ₁ , ..., µ _K ) ∈ ( R ^p ) ^K , δ > 0 and G = {G ₁ , ..., G _K } a partition of [n], we say X ₁ , ..., X _n are (G, µ, δ)-clustered if ∀k ∈ [K], ∀a ∈ G _k , |ν _a − µ _k | ₂ 6 δ. We then call

k<l |µ _k − µ l | ₂ (2.2)

Indeed remark that if ρ(G, µ, δ) < 2, the population clusters {ν _a } _a∈G

, ..., {ν _a } _a∈G

Proposition 1. Let (G _K ^∗ , µ ^∗ , δ ^∗ ) ∈ arg max ρ(G, µ, δ) for (G, µ, δ) such that X ₁ , ..., X _n are (G, µ, δ)- clustered, and |G| = K. If ρ(G _K ^∗ , µ ^∗ , δ ^∗ ) > 4 then G _K ^∗ is the unique maximizer of ρ(G, µ, δ).

So G _K ^∗ is the partition maximizing the discriminating capacity over partitions of size K . Therefore in this work, we will assume that there is a K > 1 such that X ₁ , ..., X _n is (G, µ, δ)-clustered with

We also assume that X ₁ , ..., X _n are independent observations with subgaussian behavior. Instead

Definition 2. Let Y be a random vector in R ^d , Y has a subgaussian distribution if there exist Σ ∈ R ^d×d such that ∀x ∈ R ^d ,

e ^x

^(Y ⁻ ^E ^Y ⁾ i

6 e ^x

^Σx/2 . (2.4)

Let G = {G ₁ , ..., G _K } and m := min _k∈[K] |G _k | denote the minimum cluster size. G can be repre- sented by its caracteristic matrix B ^∗ ∈ R ^n×n defined as ∀k, l ∈ [K] ² , ∀(a, b) ∈ G _k × G _l ,

B _ab ^∗ :=

1/|G _k | if k = l

In what follows, we will demonstrate the recovery of G through recovering its caracteristic matrix B ^∗ . We introduce the sets of square matrices

C _K ^{0,1} := {B ∈ R ^n×n + : B ^T = B, tr(B) = K, B1 n = 1 n , B ² = B} (3.1) C _K := {B ∈ R ^n×n + : B ^T = B, tr(B) = K, B1 n = 1 n , B < 0} (3.2) C := [

C _K . (3.3)

We have: C _K ^{0,1} ⊂ C _K ⊂ C and C _K is convex. Notice that B ^∗ ∈ C _K ^{0,1} . A result by Peng, Wei (2007) [16] shows that the K-means estimator ¯ B can be expressed as

for Λ := (hX b _a , X _b i) _(a,b)∈[n]

∈ R ^n×n , the observed Gram matrix. Therefore a natural relaxation is to consider the following estimator:

Notice that E Λ = Λ + Γ for Λ := (hν b _a , ν b i) _(a,b)∈[n]

∈ R ^n×n , and Γ := E [hE _a , E b i] _(a,b)∈[n]

= diag (| Var(E _a )| _∗ ) _16a6n ∈ R ^n×n . The following two results demonstrate that Λ is the signal structure that lead the optimizations of (3.4) and (3.5) to recover B ^∗ , whereas Γ is a bias term that can hurt the process of recovery.

Proposition 2. There exist c 0 > 1 absolute constant such that if ρ ² (G, µ, δ) > c 0 (6 + √

n/m) and m∆ ² (µ) > 8|Γ| _V , then we have

hΛ + Γ, Bi = B ^∗ = arg max

n/m) ^1/2 instead of simply 1 from Hypothesis 1. The following proposition demonstrates that if the condition on the variation semi- norm of Γ is not met, G may not even be recovered on the population version.

Proposition 3. There exist G, µ, δ and Γ such that ρ ² (G, µ, δ) = +∞ but we have m∆ ² (µ) < 2|Γ| _V and

B ^∗ ∈ / arg max

hΛ + Γ, Bi and B ^∗ ∈ / arg max

The estimator from [5] can be adapted to our context. We introduce the following estimator, for (a, b) ∈ [n] ² let V (a, b) := max (c,d)∈([n]\{a,b})

hX _a − X _b , _|X ^X

^−X

, b 1 := arg min _b∈[n]\{a} V (a, b) and b 2 := arg min _b∈[p]\{a,b

_} V (a, b). Then for a ∈ [n], let

Γ b ^corr := diag hX _a − X b

i _a∈[n]

Computing Γ b ^corr can be interpreted as a correcting term to de-bias Λ as an estimator of Λ. The b result from Proposition 2 demonstrates the interest of studying the following semi-definite estimator of the projection matrix B ^∗ , let

B b ^corr := arg max

σ ² := max

a∈[n] |Σ _a | _op , V ² := max

a∈[n] |Σ _a | _F , γ ² := max

a∈[n] |Σ _a | ∗ . (3.10) We are now ready to introduce this paper’s main result: a condition on the separation between the cluster means sufficient for ensuring recovery of B ^∗ with high probability.

Theorem 1. Assume that m > 2. For c 1 , c 2 > 0 absolute constants, if m∆ ² (µ) > c 2 σ ² (n + m log n) + V ² ( p

log n + δ) + δ ² ( √

, (3.11) then with probability larger than 1 − c 1 /n we have B b ^corr = B ^∗ , and therefore G b ^corr = G.

We call the right hand side of (3.11) the separating rate. Notice that we can read two kinds of requirements coming from the separating rate: requirements on the radius δ, and requirements on σ ² , V ² , γ dependent on the distributions of observations. It appears as if δ + σ √

How optimal is the result from Theorem 1? Notice that our result is adapted to anisotropy in the noise, but to discuss optimality it is easier to look at the isotropic scenario: V ² = √

pσ ² and γ ² = pσ ² . Therefore ∆ ² (µ)/σ ² represents a signal-to-noise ratio. For simplicity let us also assume that all groups have equal size, that is |G ₁ | = ... = |G _K | = m so that n = mK and the sufficient condition (3.11) becomes

∆ ² (µ)

σ ² & K + log n +

∆ ² (µ)

σ ² & K + log n

∆ ² (µ) σ ² &