• Aucun résultat trouvé

Fast and Consistent Algorithm for the Latent Block Model

N/A
N/A
Protected

Academic year: 2021

Partager "Fast and Consistent Algorithm for the Latent Block Model"

Copied!
23
0
0

Texte intégral

(1)

HAL Id: hal-01455682

https://hal.archives-ouvertes.fr/hal-01455682

Preprint submitted on 3 Feb 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Model

Vincent Brault, Antoine Channarond

To cite this version:

Vincent Brault, Antoine Channarond. Fast and Consistent Algorithm for the Latent Block Model.

2016. �hal-01455682�

(2)

arXiv: arXiv:0000.0000

Fast and Consistent Algorithm for the Latent Block Model

Vincent Brault

Univ. Grenoble Alpes, LJK, F-38000 Grenoble, France CNRS, LJK, F-38000 Grenoble, France

e-mail: [email protected]

Antoine Channarond

UMR6085 CNRS, Laboratoire de Mathématiques Raphaël Salem, Université de Rouen Normandie, 76800 Saint-Étienne-du-Rouvray, France

e-mail: [email protected]

Abstract: In this paper, the algorithm Largest Gaps is introduced, for simultaneously clustering both rows and columns of a matrix to form ho- mogeneous blocks. The denition of clustering is model-based: clusters and data are generated under the Latent Block Model. In comparison with al- gorithms designed for this model, the major advantage of the Largest Gaps algorithm is to cluster using only some marginals of the matrix, the size of which is much smaller than the whole matrix. The procedure is linear with respect to the number of entries and thus much faster than the classical algorithms. It simultaneously selects the number of classes as well, and the estimation of the parameters is then made very easily once the classica- tion is obtained. Moreover, the paper proves the procedure to be consistent under the LBM, and it illustrates the statistical performance with some numerical experiments.

MSC 2010 subject classications: Primary 62H30, 62-07.

Keywords and phrases: Latent Block Model, Largest Gaps Algorithm, Model Selection, Data analysis.

Contents

1 Introduction . . . . 2

2 Notations and model . . . . 2

3 Algorithm Largest Gaps . . . . 3

3.1 Concept . . . . 4

3.2 Algorithm . . . . 4

4 Consistency . . . . 7

4.1 Distance on the parameters and the label switching issue . . . . . 7

4.2 Assumptions . . . . 8

4.3 Consistency of the method with xed thresholds . . . . 9

4.4 Main result: consistency of the method . . . . 10

5 Simulations . . . . 10

6 Conclusion . . . . 12

A Main theoretical results . . . . 15

1

(3)

A.1 Proof of Theorem 4.1 . . . . 15

A.2 Proof of Proposition A.1 . . . . 15

A.3 Proof of Proposition A.2 . . . . 18

B Proof of Theorem 4.2: consistency . . . . 20

Acknowledgements . . . . 21

References . . . . 21 1. Introduction

Block clustering methods aim at clustering rows and columns of a matrix si- multaneously to form homogeneous blocks. There are a lot of applications of this method: genomics [8, 9], recommendation system [1, 13], archeology [5] or sociology [7, 11, 14] for example. Among the methods proposed to solve this question, the Latent Block Model or LBM [6] provides a chessboard structure induced by the classication of the rows and the classication of the columns.

In this model, we suppose that a population of n observations described with d binary variables of the same nature is available. Saying that the binary variables are of the same nature means that it is possible to code them in the same (and natural) way. This assumption is needed to ensure that decomposing the dataset in a block structure makes sense.

Given the number of blocks and in order to estimate the parameters, Govaert and Nadif [6] suggest to use a variational algorithm, Keribin et al. [10] propose an adaptation of the Stochastic Expectation Maximisation introduced by Celeux et al. [2] in the mixture case, Keribin et al. [11] studied a bayesian version of these two algorithms and Wyse and Friel [14] propose a bayesian algorithm including the estimation of the number of blocks. However, these algorithms have a complexity in O ndN Block 2 N Algo

with N Block is the maximal supposed number of blocks and N Algo is the number of iterations for each algorithm.

Moreover, the asymptotic behavior of the estimators is not well understood yet (although there exist some results under stronger conditions, see Celisse et al.

[3], Mariadassou and Matias [12]).

In this article, we propose an adaptation of the Largest Gaps algorithm introduced by Channarond et al. [4] in the Stochastic Block M odel with a complexity in O(nd) (Section 3) and prove that the estimators of each parameter are consistent (Section 4) and we illustrate these results on simulated data (Section 5). For ease of reading, the proofs are made available in the appendices.

2. Notations and model

The Latent Block Model (LBM) is as follows. Let x = (x ij ) i=1,...,n;j=1,...,d be the data matrix where x ij ∈ {0, 1} .

It is assumed that there exists a partition into g row clusters

z = (z ik ) i=1,...,n;k=1,...,g and a partition into m column clusters

w = (w j` ) j=1,...,d;`=1,...,m . The z ik s (resp. w j` s ) are binary indicators of row

(4)

i (resp. column j ) belonging to row cluster k (resp. column cluster ` ), such that the random variables x ij are independent conditionally on z and w with parametric density ϕ(x ij ; α k` ) z

ik

w

j`

, where α k` is the parameter of the condi- tional density of the data given z ik = 1 and w j` = 1 . Thus, the density of x conditionally on z and w is

f(x|z, w; α) =

n

Y

i=1 d

Y

j=1 g

Y

k=1 m

Y

`=1

ϕ(x ij ; α k` ) z

ik

w

j`

=: Y

i,j,k,`

ϕ(x ij ; α k` ) z

ik

w

j`

where α = (α k` ) k=1,...,g;`=1,...,m . Moreover, it is assumed that the row and column labels are independent: p(z, w) = p(z)p(w) with p(z) = Q

i,k π z k

ik

and p(w) = Q

j,` ρ w `

j`

, where (π k = P (z ik = 1), k = 1, . . . , g) and (ρ ` = P (w j` = 1), ` = 1, . . . , m) are the mixing proportions. Hence, the density of x is

f (x; θ) = X

(z,w)∈Z×W

p(z; π)p(w; ρ)f (x|z, w; α),

where Z and W denoting the sets of all possible row labels z and column labels w , and θ = (π, ρ, α) , with π = (π 1 , . . . , π g ) and ρ = (ρ 1 , . . . , ρ m ) . The density of x can be written as

f (x; θ) = X

z,w

Y

i,k

π z k

ik

Y

j,`

ρ w `

j`

Y

i,j,k,`

ϕ(x ij ; α k` ) z

ik

w

j`

(2.1)

= X

z,w

Y

k

π z k

+,k

Y

`

ρ w `

+,`

Y

i,j,k,`

ϕ(x ij ; α k` ) z

ik

w

j`

where z +,k = P n

i=1 z ik ( resp. w +,` = P d

j=1 w j` ) represent the number of rows (resp. columns) in the class k (resp. ` ).

The LBM involves a double missing data structure, namely z and w , which makes the statistical inference more dicult than for standard mixture models.

Finally, as we study the binary case, we have ϕ(x ij ; α) = x α ij (1 − x ij ) 1−α .

To estimate the parameters, many algorithms exist (for example [6], [11] or [14]) but these algorithms have a complexity larger than O (ndgmN algo ) where N algo is the number of iterations associated to each algorithm. This makes their use on large matrices dicult.

In the Stochastic Block Model (SBM), rows and columns are associated with the same individuals, which allows to represent a graph, whereas LBM allows to represent digraphs. Channarond et al. [4] suggested a fast algorithm, called LG , based on a marginal of the matrix x , the degrees.

3. Algorithm Largest Gaps

Before the introduction of the algorithm Largest Gaps ( LG ), let us recall the

concept.

(5)

3.1. Concept

Assume that the class of the row i is known (for example, k ). In this case, we have for every j ∈ {1, . . . , d}

P (X ij = 1|z ik = 1) =

m

X

`=1

P (X ij = 1|z ik = 1, w j` = 1) P (z ik = 1|w j` = 1)

=

m

X

`=1

α k` π k =: τ k . (3.1)

This equation implies that the sum of the cells of row i , denoted by X i,+ , is binomially distributed Bin (d, τ k ) conditionally on z ik = 1 . Therefore by condi- tional independences, the distribution of X i,+ is a mixture of binomial distribu- tions. It appears that the mixture can be identied if and only if the components of the vector τ = (τ 1 , . . . , τ g ) are distinct. Under this assumption, variables X i,+

fastly concentrate around the mean associated with their class, and asymptoti- cally form groups separated by large gaps. The idea consists in identifying those large gaps and thus the classes.

In their article, Channarond et al. [4] assume that the number Q of classes is known and partition the population into Q clusters by nding the Q − 1 largest gaps. In order to choose Q , a model selection procedure could be made separately and before the classication. Here our alternative algorithm directly yields both the clusters and the numbers of classes. Instead of selecting the g −1 (resp. m − 1 ) largest gaps for some g (resp. m ), it selects the gaps larger than a properly chosen threshold the paper provides.

On the middle right picture of Figure 1, an example of histogram of X i,+ for a simulated matrix is displayed; the ve classes can be clearly seen. The middle left picture of Figure 1 display the corresponding values sorted in ascending order and the bottom left picture of Figure 1, the jumps between all successive sorted values.

3.2. Algorithm

The algorithm Largest Gaps is given in Table 1 and a illustration is provided in Figure 1. In the sequel, the estimators provided by the algorithm are denoted by b z , w b and θ b .

Estimator of θ . In the algorithm 1, the estimator b θ of θ ? is based on b z and w b . c π k (resp. ρ b ` ) is the proportion of class k (resp. ` ) in the partition b z (resp.

w b ). And the estimator of α b is for all (k, `) ∈ {1, . . . , b g } × {1, . . . , m} b : d α k` =

P n i=1

P d

j=1 z c ik w d j` x ij z d +k w d +`

.

(6)

Input: data matrix x , threshold for row S

g

and for column S

m

. // Computation of jumps

for i ∈ {1, . . . , n} do

Computation of X

i

=

xi+d

.

// O(nd) Ascending sort of X

(1)

, . . . , X

(n)

. // O(n log n)

for i ∈ {2, . . . , n} do

Computation of the jumps G

i

= X

(i)

− X

(i−1)

.

// O(n) // Computation of b g

Selection of i

1

< . . . < i

gb−1

such that (G

i1

, . . . , G

i

bg−1

) are every greater than S

g

. // O(n)

// Computation of b z for i ∈ {(1), . . . , (n)} do

Denition of z b

(i)k

= 1 if and only if (i

k−1

) < (i) ≤ (i

k

) with i

0

= 0 and i

gb

= n . // O(n) // Computation of m b and w b

Do the same on the columns. // O(dn + dlog d)

// Computation of b θ for k ∈ {1, . . . , b g} do

Computation of c π

k

=

zb+kn

.

// O(b g n) for ` ∈ {1, . . . , m} b do

Computation of ρ b

`

=

wb+`d

.

// O( md) b Computation of α b = ( b z)

T

x w/ b

π c

k

( ρ b

`

)

T

× nd . // O (nd [ b g + m]) b Output: Numbers of classes b g and m b , matrices b z and w b and parameter b θ .

Algorithm 1: Algorithm Largest Gaps .

(7)

Figure 1. Top-left: Initial matrix. Top-right: Example of a vector X

(1)

, . . . , X

(d)

. Middle-left: representation of the vector X

(1)

, . . . , X

(d)

sorted in increasing order. Middle- right: Histograms of X

(1)

, . . . , X

(d)

. Bottom-left: representation of the vector of jumps (G

2

, . . . , G

d

) where for all j ∈ {2, . . . , d} , G

j

= X

(j)

− X

(j−1)

. Bottom-right: reorganized matrix.

Remark 3.1. Complexity of the algorithm

As we will see in the section 4, log n is required to be much smaller than d and log d much smaller than n . In this case, the complexity is O (nd [ g b + m]) b . Moreover, we know that P n

i=2 G i = 1 and for all k ∈ {1, . . . , g b − 1} , G i

k

> S g

then, in the worst case, we have b g < 1/S g + 1 .

(8)

Conclusion, the complexity is O (nd [1/S g + 1/S m ]) and, if only the classication is wanted, the complexity is O (nd) .

4. Consistency

This section presents the main result (Theorem 4.2), that is the consistency of the method. Before stating this theorem, some notations are introduced, in particular related to the label switching problem, and assumptions are done on the model parameters and on the algorithm thresholds (S g , S m ) , in order to ensure consistency of the method.

4.1. Distance on the parameters and the label switching issue For any two parameters θ = (π, ρ, α) with (g, m) classes and θ

0

= (π

0

, ρ

0

, α

0

) , with (g

0

, m

0

) classes, we dene their distance as follows:

d

(y, y

0

) =

max {kπ − π

0

k

, kρ − ρ

0

k

, kα − α

0

k

} if g = g

0

, m = m

0

+∞ otherwise,

where k·k

denotes the norm dened for any y ∈ R g by kyk

= max 1≤k≤g |y k | . We assume that two matrices z, z

0

∈ M n×g ({0, 1}) are equivalent, denoted z ≡

Z

z

0

, if there exists a permutation s ∈ S ({1, . . . , g}) such that for all (i, k) ∈ {1, . . . , n} × {1, . . . , g} , z i,s(k) = z ik . By convention, we assume that two matrices with dierent numbers of columns are not equivalent. We introduce the similar notation ≡

W

for the matrix w .

For all parameter θ = (π, ρ, α) with (g, m) classes and for all permutions (s, t) ∈ S ({1, . . . , g}) × S ({1, . . . , m}) , we denote θ s,t = (π s , ρ t , α s,t ) , by:

π s = π s(1) , . . . , π s(g)

, ρ t = ρ t(1) , . . . , ρ t(m)

and α s,t = α s(1),t(1) , α s(1),t(2) , . . . , α s(1),t(m) , α s(2),t(1) , . . . , α s(g),t(m) .

As classes are dened up to a permutation (known as label switching issue), the distance between two parameters must be calculated after permuting their coordinates, from the actual label allocation done by the classication algorithm to the original label allocation of the model. Moreover such a permutation exists and is unique when the classication is right, that is, when b z ≡

Z

z ? (respec- tively w b ≡

W

w ? ). This permutation will be thus denoted by s

Z

(resp. t

W

) on the event { b z ≡

Z

z ? } (resp. { w b ≡

W

w ? } ). Thus the consistency of the pa- rameter estimators amounts to proving that the following quantity vanishes in probability when (n, d) tends to innity:

d

b θ s

Z

,t

W

, θ ? .

Outside of the event { b z ≡

Z

z ? } (resp. w b ≡

W

w ? ), s

Z

(resp. t

W

) will be de-

ned as any arbitrary permutation in S ({1, . . . , b g }) (resp. S ({1, . . . , m}) b ), the

identity for instance.

(9)

4.2. Assumptions Assumptions on the model Notation 4.1. Key parameters

Let us dene π min and ρ min the minimal probabilities of being member of a class:

π min = min

1≤k≤g

?

π k ? and ρ min = min

1≤`≤m

?

ρ ? ` .

and the minimal distance between any two conditional expectations of the nor- malized degrees:

δ

π

= min

1≤k6=k

0≤g?

k ? − τ k ?

0

| and δ

ρ

= min

1≤`6=`

0≤m?

? ` − ξ ` ?

0

|

where τ ? = α ? ρ ? and ξ ? = π ?T α ? are the proportions of the binomial distribu- tions dened in Equation (3.1).

Some assumptions on the model are needed to obtain the consistency:

Assumption M.1 Each row class (respectively column class) has a positive probability to have at least one member:

π min > 0 and ρ min > 0. (M.1) Assumption M.2 Conditional expected degrees are all distinct:

δ

π

> 0 and δ

ρ

> 0. (M.2) The rst assumption is classical in mixture models: proportions of all classes are positive. Otherwise, classes with proportion zero would be actually nonexis- tent. The second one is more original: it ensures the separability of the classes in the degree distribution. Otherwise, the conditional distributions of the degrees of at least two classes would be equal and these classes would be concentrated around the same expected value. Note that the set of parameters such that two conditional expected degrees are equal has zero-measure. These two assumptions are another formulation of the sucient conditions of Keribin et al. [11].

Assumptions on the algorithm

The algorithm has two threshold parameters, (S g , S m ) which must be properly chosen to obtain consistency. Two assumption sets will be considered in this paragraph: both parameters and thresholds xed (Assumption (AL.1)) or van- ishing thresholds and xed parameters (Assumption (AL.2)). They both ensure consistency but play distinct roles.

Assumption AL.1

(S g , S m ) xed and S g ∈]0, δ

π

[ and S m ∈]0, δ

ρ

[. (AL.1)

(10)

Assumption AL.2 S n,d g −→

n,d→+∞ 0, S m n,d −→

n,d→+∞ 0, lim

n,d→+∞

S n,d g q n

log d > √

2 and lim

n,d→+∞

S m n,d q

d log n > √

2. (AL.2)

The rst one is only theoretical: in practice, it cannot be checked that it is satised because it would require unknown key parameters of the model δ

π

and δ

ρ

. This assumption is used essentially to establish intermediate results like non- asymptotic bounds (Proposition A.1 and Theorem 4.1). On the contrary, the second one is designed for practical cases (Theorem 4.2). Instead of being xed, thresholds are assumed to be vanishing, in order to be small enough asymptot- ically. More precisely, the assumption provides the admissible convergence rate of the thresholds to guarantee consistency.

Assumptions on admissible convergence rates when parameters vary

Finally, we also consider varying model parameters, and provide admissible con- vergence rates in this case for both parameters and thresholds. It thus tells how robust the consistency is. For example, δ

π

and δ

ρ

are allowed to vanish when (n, d) tends to innity, which makes the classication even harder. Assumption (MA) gives a range of convergence rates such that the classication is neverthe- less consistent (stated in Theorem 4.2).

Assumption MA.

Condition on δ

π

n,d (resp. δ

ρ

n,d ):

lim

n,d→+∞

δ

n,dπ

S

gn,d

> 2, and lim

n,d→+∞

δ

n,dρ

S

n,dm

> 2.

Conditions on g ?n,d , π n,d min , m ?n,d and ρ n,d min :

π n,d min ρ n,d min 2

min(n, d) −→

n,d→+∞ +∞ and lim

n,d→+∞

( π

minn,d

ρ

n,dmin

)

2

min(n,d)

log(g

? n,d

m

? n,d

) > 1.

(MA)

4.3. Consistency of the method with xed thresholds

This paragraph presents the main theoretical result: a non-asymptotic upper bound when thresholds (S g , S m ) are xed (Assumption (AL.1)), which directly implies the strong consistency of the method in that case.

Theorem 4.1. Concentration inequality

Under Assumption (AL.1), we have for all t > 0 : P

g b 6= g ? or m b 6= m ? or b z 6≡

Z

z ? or w b 6≡

W

w ? or d

b θ s

Z

,t

W

, θ ?

> t

(11)

≤ 4n exp

− d

2 min(δ

π

− S g , S g ) 2

+ 2g ? (1 − π min ) n +4d exp

− n

2 min(δ

ρ

− S m , S m ) 2

+ 2m ? (1 − ρ min ) d +2g ? m ?

e

−πmin

ρ

min

ndt

2

+ 2e

minρmin )

2n

8

+ 2e

minρmin )

2d 8

+2g ? e

−2nt2

+ 2m ? e

−2dt2

.

The proof (in Appendix A.1) is made in two steps, emphasizing the original- ity of the method in comparison with EM-like algorithms: here the classication is completely done rst, and parameters are then estimated afterwards. Thus an upper bound on classications and selection of class numbers will be rst established (Proposition A.1), and secondly an upper bound on the parameter estimators, given that both classications and class numbers are right (Propo- sition A.2).

4.4. Main result: consistency of the method

Theorem 4.1 cannot be used in practice: since δ

π

and δ

ρ

are unknown, the thresholds (S g , S m ) cannot be chosen properly. Theorem 4.2 provides a proce- dure to choose the thresholds as functions of (n, d) only. Two assumption sets are proposed: in the rst one, model parameters are xed, and in the second one, they are allowed to vary with respect to (n, d) in the manner described in Assumption (MA). See Subsection 4.2 for further comments and details.

Theorem 4.2. Consistency of the method Under these assumption sets:

• θ is xed with respect to (n, d) and (M.1), (M.2), (AL.2);

• θ depends on (n, d) and (M.1), (M.2), (AL.2) and (MA);

classications, model selection and estimators are consistent, that is, for all t > 0 :

P

b g 6= g ? or m b 6= m ? or b z 6≡

Z

z ? or w b 6≡

W

w ? or d

b θ s

Z

,t

W

, θ ?

> t

n,d→+∞ −→ 0.

Remark 4.1. The assumption (AL.2) of the theorem implies that n/ log d and d/ log n tend to +∞ . Therefore, x is allowed to have an oblong shape.

The proof is available in Appendix B.

5. Simulations

We use an experimental design to illustrate the results of Theorem 4.2. As the

number of row classes (resp. column classes) is the basis of the other estimations,

(12)

this is the only parameter studied in this section. The experimental design is dened with g ? = 5 and m ? = 4 and the following parameters

α ? =

ε ε ε ε

1 − ε ε ε ε

1 − ε 1 − ε ε ε 1 − ε 1 − ε 1 − ε ε 1 − ε 1 − ε 1 − ε 1 − ε

with ε ∈ {0.05, 0.1, 0.15, 0.2, 0.25} . For the class proportions, we suppose two possibilities

• Balanced proportions:

π ? =

 0.2 0.2 0.2 0.2 0.2

and ρ ? =

 0.25 0.25 0.25 0.25

 with the following parameters

π min = 0.2 and δ

π

= 0.25 − 0.5ε.

• Arithmetic proportions:

π ? =

 0.1 0.15

0.2 0.25

0.3

and ρ ? =

 0.1 0.2 0.3 0.4

 with the following parameters

π min = 0.1 and δ

π

= 0.1 − 0.2ε.

The number of rows n and the number of columns d uctuate between 20 and 4000 by step 20 and for each conguration, 1000 matrices were simulated. For the choice of the thresholds, we studied four cases:

1. Constant threshold: S 1 = δ

π

/2 . 2. Lower limit threshold: S 2 n,d = p

2 log n/d + 10

−10

. 3. Middle limit threshold: S 3 n,d = 2 p

2 log n/d . 4. Upper limit threshold: S 4 n,d = (log n/d) 1/4 .

Figures 2 and 3 display the proportions of true estimations of g ? following the parameter ε , the number of rows n , the numbers of columns d and the thresholds used. It appears that the best threshold is S 1 = δ

π

/2 but this threshold can not be used in practice because of δ

π

is unknown. For the scalable thresholds, S 2 n,d = p

2 log n/d + 10

−10

is the best.

(13)

We can see that the larger the number of rows n , the worse the estimation and the larger the number of columns d , the better the estimation. In the case of n = d (case of Channarond et al. [4]), the quality of the estimation increases with n . π min has a weak eect because it is rare to have an empty class but the eect of δ

π

is greater.

6. Conclusion

The Largest Gaps algorithm gives a consistent estimation of each parameter of the Latent Block Model with a complexity much lower than the other existing algorithms. Moreover, it appears that the substantial part of the complexity is the computation of the vector (X (1) , . . . , X (n) ) .

However, it appears in the simulations that the estimation of the number of

classes is underestimated and it would be interesting to estimate the class in

row with a mixture model on the variables (X (1)

, . . . , X (n)

) ; this will be the

subject of a future work. The tricky part will be to deal with the dependences

between these variables.

(14)

π

min

= 0 . 2

S

1

S

n,d2

S

3n,d

S

4n,d

δ

π

= 0 . 225 δ

π

= 0 . 2 δ

π

= 0 . 175 δ

π

= 0 . 15 δ

π

= 0 . 125

Figure 2. Proportions of true estimations of g

?

following the parameter ε (rows) and the

thresholds used (columns) for the balanced case: for each graphic, the number of rows n and

the number of columns d uctuate between 20 and 4000 by step 20.

(15)

π

min

= 0 . 1

S

1

S

n,d2

S

3n,d

S

4n,d

δ

π

= 0 . 09 δ

π

= 0 . 08 δ

π

= 0 . 07 δ

π

= 0 . 06 δ

π

= 0 . 05

Figure 3. Proportions of true estimations of g

?

following the parameter ε (rows) and the

thresholds used (columns) for the arithmetic case: for each graphic, the number of rows n and

the number of columns d uctuate between 20 and 4000 by step 20.

(16)

Appendix A: Main theoretical results A.1. Proof of Theorem 4.1

First of all, note that { b z ≡

Z

z ? } ⊂ { b g = g ? } and { w b ≡

W

w ? } ⊂ { m b = m ? } , hence :

P

b g 6= g ? or m b 6= m ? or b z 6≡

Z

z ? or w b 6≡

W

w ? or d

b θ s

Z

,t

W

, θ ?

> t

= P

b z 6≡

Z

z ? or w b 6≡

W

w ? or d

b θ s

Z

,t

W

, θ ?

> t

= P ( b z 6≡

Z

z ? or w b 6≡

W

w ? ) + P

n d

θ b s

Z

,t

W

, θ ?

> t o

\ { b z 6≡

Z

z ? or w b 6≡

W

w ? }

= P ( b z 6≡

Z

z ? or w b 6≡

W

w ? ) + P

d

b θ s

Z

,t

W

, θ ?

> t, b z ≡

Z

z ? , w b ≡

W

w ?

≤ P ( b z 6≡

Z

z ? ) + P ( w b 6≡

W

w ? ) + P

d

b θ s

Z

,t

W

, θ ?

> t, b z ≡

Z

z ? , w b ≡

W

w ?

To complete the proof, we then need to bound from above the terms of this inequality. The two rst terms are bounded using Proposition A.1, proved in Appendix A.2, and the last term is bounded with Proposition A.2, proved in Appendix A.3.

Proposition A.1. Under Assumptions (M.1), (M.2) and (AL.1):

P ( b g 6= g ? or b z 6≡

Z

z ? ) ≤ 2n exp

− d

2 min(δ

π

− S g , S g ) 2

+ g ? (1 − π min ) n .

P ( m b 6= m ? or w b 6≡

W

w ? ) ≤ 2d exp

− n

2 min(δ

ρ

− S m , S m ) 2

+m ? (1 − ρ min ) d . Proposition A.2. For all t > 0 , we have:

P

d

b θ s

Z

,t

W

, θ ?

> t, b z ≡

Z

z ? , w b ≡

W

w ?

≤ 2g ? m ?

e

−πmin

ρ

min

ndt

2

+ 2e

minρmin )

2n

8

+ 2e

minρmin )

2d 8

+2g ? e

−2nt2

+ 2m ? e

−2dt2

A.2. Proof of Proposition A.1

Let us rst dene the following events.

(17)

• There is at least one individual in each row class, denoted by A g

?

=

g

?

\

k=1

z ? +k 6= 0 .

• Denoting D the maximal distance between X i and the center of the class of row i :

D = max

1≤k≤g

?

sup

1≤i≤n withz?i,k=1

X i − τ k

, we also dene:

A S

g

= {2D < S g < δ

π

− 2D} and A id = A g

?

∩ A S

g

.

Then Proposition A.1 will be a consequence of the two following lemmas:

Lemma A.1.

A id ⊂ { b g = g ? } ∩ { b z ≡

Z

z ? } Lemma A.2.

P A id

≤ 2n exp

− d

2 min(δ

π

− S g , S g ) 2

+ g ? (1 − π min ) n

Lemma A.1 tells that whenever the event A id is satised, then both true number of row classes and their true classication are obtained. Lemma A.2 provides an upper bound of P A id

. From these lemmas, it is directly deduced that:

P ({ b g 6= g ? } ∪ { b z 6≡

Z

z ? }) ≤ P A id

≤ 2n exp

− d

2 min(δ

π

− S g , S g ) 2

+ g ? (1 − π min ) n ,

which is Proposition A.1. Now, let us move on to the proofs of the lemmas.

Proof of Lemma A.1 On the event A S

g

, for any two rows i 6= i

0

∈ {1, . . . , n} , we have two possibilities:

• Either the rows i and i

0

are in the same class k , and then on A S

g

, we have:

X i

− X i

0

X i

− τ k +

X i

0

− τ k

≤ 2D < S g .

• Or row i is in the class k and row i

0

in the class k

0

6= k , and on the event A S

g

, we have:

X i − X i

0

=

X i − τ k

0

− X i

0

− τ k

0

X i

− τ k

0

X i

0

− τ k

0

(18)

X i

− τ k

0

− D

≥ |τ k − τ k

0

| −

X i − τ k

− D

≥ δ

π

− 2D

> S g .

Therefore, G i = X (i)

− X (i−1)

is less than S g if and only if both rows (i − 1) and (i) are in the same class. On A S

g

, the algorithm hence nds the true clas- sication. Moreover, on A g

?

, there is at least one row in each class, then the algorithm nds the true number of classes. As a conclusion, on A id , both b g = g ? and b z ≡

Z

z ? are satised.

Proof of Lemma A.2 Using an union bound, we rst obtain:

P A id

≤ P A g

?

+ P A S

g

Now we bound from above each of these terms. Again with an union bound:

P A g

?

= P

g

?

[

k=1

z +k ? 6= 0

g

?

X

k=1

P

z ? +k 6= 0

=

g

?

X

k=1

P z +k ? = 0

=

g

?

X

k=1 n

Y

i=1

P z i,k ? = 0

=

g

?

X

k=1 n

Y

i=1

(1 − π k )

g

?

X

k=1 n

Y

i=1

(1 − π min )

≤ g ? (1 − π min ) n ,

which gives the upper bound of the rst term. Secondly:

A S

g

= {2D < S g < δ

π

−2D} = {2D < S g , 2D < δ

π

−S g } =

D < 1

2 min(δ

π

− S g , S g )

. Denoting t = min(δ

π

− S g , S g ) ,

P A S

g

= P

D ≥ t 2

= E

P

D ≥ t 2

z ?

= E

 P

g

?

[

k=1

[

i|z

ik

=1

X i − τ k

≥ t

2

z ?

≤ E

g

?

X

k=1

X

i|z

ik

=1

P

X i − τ k

≥ t

2

z ?

 .

(19)

Moreover for all i ∈ {1, . . . , n} , given z i,k ? = 1 , X i,+ has a binomial distri- bution Bin (d, τ k ) . The concentration properties of this distribution are then exploited through the Hoeding inequality:

P

X i

− τ k ≥ t

2

z ?

= P

|X i,+ − dτ k | ≥ dt 2

z ?

≤ 2e

12

dt

2

. And as a conclusion, the bound of the second term is:

P A S

g

≤ E

g

?

X

k=1

X

i|z

ik

=1

2e

12

dt

2

 = 2ne

12

dt

2

.

A.3. Proof of Proposition A.2

The proof consists in obtaining three bounds: one for each parameter. The in- equalities on π and ρ are an application of the Hoeding inequality and are similar to Channarond et al. [4] for the row class proportions. To obtain the inequality for α , it is necessary to study the conditional probability, given the true partition (z ? , w ? ) . Apart from the problem of two asymptotic behaviors, the proof is similar to Channarond et al. [4].

In the sequel, and for ease of reading, we remove the superscripts s

Z

and t

W

. Therefore, for all t > 0 :

P

d

θ, b θ ?

> t, b z ≡

Z

z ? , w b ≡

W

w ?

= P (max (k π b − π ? k

, k b ρ − ρ ? k

, k α b − α ? k

) > t, b z ≡

Z

z ? , w b ≡

W

w ? )

≤ P (k π b − π ? k

> t, b z ≡

Z

z ? , w b ≡

W

w ? ) + P (k ρ b − ρ ? k

> t, b z ≡

Z

z ? , w b ≡

W

w ? ) + P (k α b − α ? k

> t, b z ≡

Z

z ? , w b ≡

W

w ? )

g

?

X

k=1

P (| c π k − π k ? | > t, b z ≡

Z

z ? , w b ≡

W

w ? )

+

m

?

X

`=1

P (| ρ b ` − ρ ? ` | > t, b z ≡

Z

z ? , w b ≡

W

w ? )

+

g

?

X

k=1 m

?

X

`=1

P (| d α k` − α ? k` | > t, b z ≡

Z

z ? , w b ≡

W

w ? ) .

The upper bounds of the rst and second terms are the same as Channarond et al. [4]; only the last term is dierent. For d α k` , rst note that when b z ≡

Z

z ? and w b ≡

W

w ?

α d k` = g α k` = 1 z ? +k w ? +`

X

(i,j) | z

?i,k

w

?j,`

=1

X ij

(20)

and given (z ? , w ? ) , the Hoeding inequality gives for all t > 0 :

P (| d α k` − α ? k` | > t, b z ≡

Z

z ? , w b ≡

W

w ? ) = P (| g α k` − α ? k` | > t, b z ≡

Z

z ? , w b ≡

W

w ? )

≤ P (| g α k` − α ? k` | > t)

≤ E [ P (| g α k` − α ? k` | > t| z ? , w ? )]

≤ E

h 2e

−2z+k?

w

+`?

t

2

i .

For every sequence r n,d > 0 , we have:

E h

2e

−2z?+k

w

+`?

t

2

i

= E h

2e

−2z?+k

w

?+`

t

2

1 {| z

+k?

w

?+`−πk?

ρ

?`

nd |

≤rn,d

} +2 e

−2z?+k

w

+`?

t

2

| {z }

≤1

1 {| z

+k?

w

?+`−πk?

ρ

?`

nd | >r

n,d

} i

≤ E h

2e

−2z?+k

w

?+`

t

2

1 {

−rn,d≤z?+k

w

+`? −π?k

ρ

?`

nd≤r

n,d

} i +2 P

z ? +k w ? +` − π k ? ρ ? ` nd

> r n,d

≤ E h

2e

−2t2

k?

ρ

?`

nd−r

n,d

) i + 2 P

z ? +k w ? +`

nd − π k ? ρ ? `

> r n,d

nd

≤ 2e

−2t

2

r

n,d

π

minρminnd rn,d −1

+ 2 P

z +k ? w +` ?

nd − π ? k ρ ? `

> r n,d

nd

.

For the second term, a new decomposition is necessary:

P

z ? +k w ? +`

nd − π ? k ρ ? `

> r n,d

nd

= P

z ? +k n − π ? k

w +` ?

d +

w ? +`

d − ρ ? `

π k ?

> r n,d

nd

≤ P

z ? +k n − π ? k

w +` ? d > r n,d

2nd

+ P

w +` ? d − ρ ? `

π k ? > r n,d

2nd

≤ P

z ? +k n − π ? k

> r n,d

2nd

+ P

w ? +`

d − ρ ? `

> r n,d

2nd

≤ 2 exp

−2n r n,d

2nd 2

+ 2 exp

−2d r n,d

2nd 2

≤ 2 exp

"

− r 2 n,d 2nd 2

#

+ 2 exp

"

− r 2 n,d 2n 2 d

# .

Finally, for every sequence r n,d > 0 , we have:

P (| g α k` − α ? k` | > t) ≤ 2e

−2t

2

r

n,dπ

minρminnd rn,d −1

+ 4e

r2 n,d 2nd2

+ 4e

r2 n,d 2n2d

.

(21)

As we want the bound to tend to 0 when n and d tend to innity, we have the following condition:

lim

n,d→+∞

π min ρ min nd r n,d

> 1, r 2 n,d

nd 2 −→

n,d→+∞ +∞ and r 2 n,d n 2 d −→

n,d→+∞ +∞.

For example, we can take

r n,d = π min ρ min nd

2 .

Remark A.1. In fact, every sequence r n,d = Cπ min ρ min nd with C ∈]0, 1[ can be used and the other results remain equally true but the optimal constant C has not a closed form ; to do this we take C = 1/2 . However, we see that for each C > 0 ,

2e

−2t

2

min

ρ

min

nd

π

minρminnd rn,d −1

= 2e

−2t2

π

min

ρ

min

nd(1−C)

= o

2e

−(Cπmin

ρ

min

)

2

n + 2e

−(Cπmin

ρ

min

)

2

d ,

the strongest term is 2e

−(Cπmin

ρ

min

)

2

n + 2e

−(Cπmin

ρ

min

)

2

d . Therefore, the optimal constant C n,d tends to 1 with n and d .

Appendix B: Proof of Theorem 4.2: consistency

The proof is based on Theorem 4.1, as n → +∞ and d → +∞ and by the Assumption (M.1), we have on the one hand

g ? (1 − π min ) n + m ? (1 − ρ min ) d −→

n,d→+∞ 0

and on the other hand g ? m ? h

e

−πmin

ρ

min

ndt

2

+ 2e

18

min

ρ

min

)

2

n + 2e

18

min

ρ

min

)

2

d i

−→

n,d→+∞ 0, g ? e

−2nt2

+ m ? e

−2dt2

−→

n,d→+∞ 0.

By the assumption (M.2), we also have:

ne

18

π2

+ de

18

2ρ

−→

n,d→+∞ 0.

For the last terms, we use Assumption (AL.2): there exists a positive constant C > √

2 such that for n and d large enough S g n,d

s d

log n > C = ⇒ S g n,d

√ 2 s

d log n > C

√ 2 > 1

(22)

ne

−d

Sn,d g 2

2

= exp

"

log n − d S n,d g 2 2

#

= exp

log n

1 −

s d log n

S g n,d

√ 2

! 2

≤ exp

 log n

1 − C

√ 2

| {z }

<0

−→

n,d→+∞ 0.

With the same reasoning and by the remark 4.1, we obtain ne

−d

(

δπ−Sgn,d

)

2

2

−→

n,d→+∞ 0.

That concludes the proof.

Acknowledgements

Thanks to Stéphane Robin for his suggestions.

References

[1] J. Bennett and S. Lanning. The netix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35, 2007.

[2] G. Celeux, D. Chauveau, and J. Diebolt. On Stochastic Versions of the EM Algorithm. Rapport de recherche RR-2514, INRIA, 1995. URL http:

//hal.inria.fr/inria-00074164.

[3] A. Celisse, J.-J. Daudin, L. Pierre, et al. Consistency of maximum- likelihood and variational estimators in the stochastic block model. Elec- tronic Journal of Statistics, 6:18471899, 2012.

[4] A. Channarond, J.-J. Daudin, S. Robin, et al. Classication and estimation in the stochastic blockmodel based on the empirical degrees. Electronic Journal of Statistics, 6:25742601, 2012.

[5] G. Govaert. Classication croisée. Thèse d'état, Université Pierre et Marie Curie, 1983.

[6] G. Govaert and M. Nadif. Clustering with block mixture models. Pattern Recognition, 36:463473, 2003.

[7] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York,

NY, USA, 99th edition, 1975. ISBN 047135645X.

(23)

[8] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bitter, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, and M. Raeld. Gene-expression proles in hereditary breast cancer. New Eng. J. Med., 344:539548, 2001.

[9] M. Jagalur, C. Pal, E. Learned-Miller, R. T. Zoeller, and D. Kulp. Ana- lyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering. BMC Bioinformatics, 8(Suppl 10):

S5, 2007. ISSN 14712105.

[10] C. Keribin, V. Brault, G. Celeux, and G. Govaert. Model selection for the binary latent block model. In 20th International Conference on Compu- tational Statistics, Limassol, Chypre, 2012. URL http://hal.inria.fr/

hal-00778145.

[11] C. Keribin, V. Brault, G. Celeux, and G. Govaert. Estimation and selection for the latent block model on categorical data. Statistics and Computing, pages 116, 2014. ISSN 0960-3174. . URL http://dx.doi.org/10.1007/

s11222-014-9472-2.

[12] M. Mariadassou and C. Matias. Convergence of the groups posterior distri- bution in latent or stochastic block models. arXiv preprint arXiv:1206.7101, 2012.

[13] H. Shan and A. Banerjee. Bayesian co-clustering. In Eighth IEEE In- ternational Conference on Data Mining, 2008. ICDM'08, pages 530539, 2008.

[14] J. Wyse and N. Friel. Block clustering with collapsed latent block models.

Statistics and Computing, pages 114, 2010.

Références

Documents relatifs

In this paper we describe how the half-gcd algorithm can be adapted in order to speed up the sequential stage of Copper- smith’s block Wiedemann algorithm for solving large

regression problems and compare it to alternative methods for linear regression, namely, ridge regression, Lasso, bagging of Lasso estimates (Breiman, 1996a), and a soft version of

We tackle the community detection problem in the Stochastic Block Model (SBM) when the communities of the nodes of the graph are assigned with a Markovian dynamic.. To recover

data generating process is necessary for the processing of missing values, which requires identifying the type of missingness [Rubin, 1976]: Missing Completely At Random (MCAR)

Consequently, we assume that the observed data matrix is generated according a mixture where we can decompose the density function into a product of two terms: the first one

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

/ La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur. For

i,NCBr @φ 2 i,HCCBr (see Figure 3, second column), we can indeed observe a lower contraction of the orbitals, manifesting in: i) the formation of a positive region along the