Amélioration de la co-similarité pour la classification de documents

(1)

HAL Id: hal-00743259

https://hal.archives-ouvertes.fr/hal-00743259

Submitted on 18 Oct 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

documents

Clément Grimal, Gilles Bisson

To cite this version:

Clément Grimal, Gilles Bisson. Amélioration de la co-similarité pour la classification de documents.

CAp 2011 - Conférence Francophone d’Apprentissage, May 2011, Chambéry, France. pp.199-215.

�hal-00743259�

(2)

Am´ elioration de la co-similarit´ e pour la classification de documents

Cl´ement Grimal

¹

, Gilles Bisson

¹

Laboratoire d’Informatique de Grenoble, UMR 5217

Clement.Grimal@imag.fr, Gilles.Bisson@imag.fr

Abstract

: La classification conjointe d’objets et de leur descripteurs – par exemple de documents avec les mots les composant – encore appel´ee

co-classification, a ´et´e large-

ment étudiée ces dernières années, car elle permet d’extraire des classes plus pertinents, qu’elle soit explicite ou latente. Dans de précédents travaux (Bisson & Hussain, 2008), nous avons proposé une méthode de calcul simultané des matrice de similarité entre ob- jets et entre descripteurs, chacune étant construite à partir de l’autre. Nous proposons ici une généralisation de cette approche en introduisant une pseudo-norme et un algorithme de seuillage. Nos expérimentations mettent en évidence une amélioration significative de la classification, notamment par rapport à d’autres méthodes.

Mots-cl´es

: co-clustering, similarity measure, text mining

1. Introduction

Clustering task is used to organize data coming from databases. Classically, these data are described as a set of instances characterized by a set of features.

In some cases, these features are homogeneous enough to allow us to cluster

them, in the same way as we do for the instances. For example, when using the

Vector Space Model introduced by Salton (1971), text corpora are represented

by a matrix whose rows represent document vectors and whose columns rep-

resent the word vectors. Thus, the similarity between two documents depends

on the similarity between the words they contain and vice-versa. In the classi-

cal clustering methods, such dependencies are not exploited. The purpose of

co-clustering is to take into account this duality between rows and columns to

identify the relevant clusters. Co-clustering has been largely studied in recent

years both in Document clustering (Dhillon et al., 2003; Long et al., 2005;

(3)

Rege et al., 2008; Liu et al., 2004) and Bioinformatics (Madeira & Oliveira, 2004; Speer et al., 2004; Cheng & Church, 2000).

In text analysis, the advantage of co-clustering is related to the well-known problem that document and words vectors tend to be highly sparse and suffer from the curse of dimensionality (Slonim & Tishby, 2001). Thus, traditional metrics such as Euclidean distance or Cosine similarity do not always make much sense (Beyer et al., 1999). Several methods have been proposed to overcome these limitations by exploiting the dual relationship between docu- ments and words to extract semantic knowledge from the data. Consequently, the concept of higher-order co-occurrences has been investigated in (Livezay

& Burgess, 1998; Lemaire & Denhi`ere, 2008), among others, as a measure of semantic relationship between words; one of the best known approach to ac- quire such knowledge being the Latent Semantic Analysis (Deerwester et al., 1990). The underlying analogy is that humans do not necessarily use the same vocabulary when writing about the same topic. For example, let us consider a corpus in which a subset of documents contains a significant number of co- occurrences between the words sea and waves and another subset in which the words ocean and waves co-occur. A human could infer that the worlds ocean and sea are conceptually related even if they do not directly co-occur in any document. Such a relationship between waves and ocean (or sea and waves) is termed as a first-order co-occurrence and the conceptual association be- tween sea and ocean is called a second-order relationship. This concept can be generalized to higher-order (3

^rd

, 4

^th

, 5

^th

, etc) co-occurrences.

In this context, we recently introduced an algorithm, called χ-Sim (Bisson

& Hussain, 2008), exploiting the duality between words and documents in a corpus as well as their respective higher-order co-occurrences. While most authors have focused to directly co-cluster the data, in χ-Sim, we just built two similarity matrices, one for the rows and one for the columns, each being built iteratively on the basis of the other. We call this process the co-similarity measure. Hence, when the two similarity matrices are built, each of them contains all the information needed to do a ‘separate’ co-clustering of the data (documents and words) by using any classical clustering algorithm.

In this paper, we further analyze the behavior of χ-Sim and we propose

some ideas leading to dramatically improve the quality of the co-similarity

measures. First, we introduce a new normalization schema for this measure

that is more consistent with the framework of the algorithm and that offers

new perspectives of research. Second, we propose an efficient way to deal

with noise in the data and thus to improve the accuracy of the clustering.

(4)

2. The χ-Sim Similarity Measure

Throughout this paper, we will use the classical notations: matrices (in capital letters) and vectors (in small letters) are in bold and variables are in italic.

Data matrix: let M be the data matrix representing a corpus having r rows (documents) and c columns (words); m

_ij

corresponds to the ‘intensity’ of the link between the i

^th

row and the j

^th

column (for a document-word matrix, it can be the frequency of the j

^th

word in the i

^th

document); m

_i:

= [m

_i1

· · · m

_ic

] is the row vector representing the document i and m

_:j

= [m

_1j

· · · m

_rj

] is the column vector corresponding to word j. We will refer to a document as d

_i

when talking about documents casually and refer to it as m

_i:

when specifying its (row) vector in the matrix M. Similarly, we will casually refer to a word as w

_j

and use the notation m

_:j

when emphasizing the vector.

Similarity matrices: SR and SC represent the square and symmetrical row similarity and column similarity matrices of size r × r and c × c respectively, with ∀i, j = 1..r, sr

_ij

∈ [0, 1] and ∀i, j = 1..c, sc

_ij

∈ [0, 1].

Similarity function: generic function F

_s

(·, ·) takes two elements m

_il

and m

_jn

of M and returns a measure of the similarity F

_s

(m

_il

, m

_jn

) between them.

2.1. Similarity measures

χ-Sim is a co-similarity based approach which builds on the idea of simulta- neously generating the similarity matrices SR (documents) and SC (words), each of them built on the basis of the other. Similar ideas have also been used for supervised leaning in (Liu et al., 2004) or for image retrieval in (Wang et al., 2004). First, we present how to compute the matrix SR. Usually, the similarity (or distance) measure between two documents m

_i:

and m

_j:

is de- fined as a function – denoted here as Sim(m

_i:

, m

_j:

) – that is more or less the sum of the similarities between words occurring in both m

_i:

and m

_j:

:

Sim(m

_i:

, m

_j:

) = F

_s

(m

_i1

, m

_j1

) + · · · + F

_s

(m

_ic

, m

_jc

) (1) Now let’s suppose we already know a matrix SC whose entries provide a measure of similarity between the columns (words) of the corpus. In parallel, let’s introduce, by analogy to the norm L

_k

(Minkowski distance), the notion of a pseudo-norm k. Then, Equation (1) can be re-written as follows without changing its meaning if sc

_ll

= 1 and if k = 1:

Sim(m

i:

, m

j:

) =

^k

v u u t

c

X

l=1

(F

s

(m

il

, m

jl

))

^k

× sc

ll

(2)

(5)

Now the idea is to generalize (2) in order to take into account all the pos- sible pairs of features (words) occurring in documents m

_i:

and m

_j:

. In this way, we “capture” not only the similarity between their common words but also the similarity coming from words that are not directly shared by the two documents. Of course, for each pair of words not directly shared by the doc- uments, we weight their contribution to the document similarity sr

_ij

by their own similarity sc

_ln

. Thus, the overall similarity between documents m

_i:

and m

_j:

is defined in (3) in which the terms for l = n are those occurring in (2):

Sim

^k

(m

_i:

, m

_j:

) =

^k

v u u t

c

X

l=1 c

X

n=1

(F

_s

(m

_il

, m

_jn

))

^k

× sc

_ln

(3)

Assuming that F

_s

(m

_il

, m

_jn

) is defined as a product (see (Bisson & Hus- sain, 2008) for further details) of the elements m

_il

and m

_jn

, i.e. F

_s

(m

_il

, m

_jn

) = m

_il

× m

_jn

(as with the cosine similarity), we can rewrite Equation (3) as:

Sim

^k

(m

_i:

, m

_j:

) =

^k

q

(m

_i:

)

^k

× SC × m

^T_j:

k

(4) where (m

_i:

)

^k

= h

(m

_ij

)

^k

· · · (m

_ic

)

^k

i

and m

^T_j:

denotes the transpose of the vec- tor m

_j:

. Finally, let’s introduce the term N (m

_i:

, m

_j:

) that is a normalization function allowing to map the similarity to [0, 1]. We obtain the following equation in which sr

_ij

denotes an element of the SR matrix:

sr

_ij

= q

k

(m

_i:

)

^k

× SC × m

^T_j:

k

N (m

_i:

, m

_j:

) (5) Equation (5) is a classic generalization of several well-known similarity mea- sures. For example, with k = 1, the Jacard index can be obtained by setting SC to I and N (m

_i:

, m

_j:

) to km

_i:

k

₁

+ km

_j:

k

₁

− m

_i:

m

^T_j:

, while the Dice coeffi- cient can be obtained by setting SC to 2I and N (m

_i:

, m

_j:

) to km

_i:

k

₁

+km

_j:

k

₁

. Furthermore, if SC is set to a positive semi-definite matrix A, one can define the following inner product < m

_i:

, m

_j:

>

_A

= m

_i:

× A × m

^T_j:

, along with the associated norm km

_i:

k

_A

= < m

_i:

, m

_i:

>

_A

. Then by setting N (m

_i:

, m

_j:

) to p km

_i:

k

_A

× p

km

_j:

k

_A

, we obtain the Generalized Cosine similarity (Qamar

& Gaussier, 2009), as it corresponds to the Cosine measure in the underlying

inner product space. Of course, by binding A to I, this similarity becomes the

standard Cosine measure between document m

_i:

and m

_j:

.

(6)

2.2. The χ-Sim Co-Similarity Measure

Of course, χ-Sim co-similarity, as defined in (Bisson & Hussain, 2008) can be also reformulated with (5). We set k to 1, and since the maximum value defined by the function sc

_ij

is 1, it follows from (3) that, the upper bound of Sim(m

i:

, m

j:

) for 1 6 i, j 6 r, is given by the product of the sum of elements of m

_i:

and m

_j:

denoted by |m

_i:

| × |m

_j:

| (product of L

1

-norms).

This normalization seems well-suited for textual datasets since it allows us to take into consideration pairs of documents (or words) vectors of uneven length, which is common in text corpora. Therefore, we can rewrite (5) as:

∀i, j ∈ 1..r, sr

_ij

= m

_i:

× SC × m

^T_j:

|m

_i:

| × |m

_j:

| (6a) Symmetrically, the elements sc

ij

of the SC matrix are defined as:

∀i, j ∈ 1..c, sc

_ij

= m

^T_:i

× SR × m

_:j

|m

_:i

| × |m

_:j

| (6b) Equations (6a) and (6b) define a systems of linear equations, whose so- lutions correspond to the (co)-similarities between two documents and two words. Thus, the algorithm of χ-Sim is based on an iterative approach – i.e.

we compute alternatively the values sc

_ij

and sr

_ij

. However, before detailing this algorithm for a more generic case in section 3.3., we are going to explain the meaning, considering the associated bipartite graph, of these iterations.

2.3. Graph Theoretical Interpretation

The graphical interpretation of the method helps to understand the working of the algorithm and provides some intuition on how to improve it. Let’s consider the bipartite graph representation of a sample data matrix in Fig. 1.

Documents and words are represented by square and circle nodes respectively and an edge (of any kind) between a document d

i

and a word w

j

corresponds to a non-zero entry m

_ij

in the document-word matrix. There is only one or- der-1 path between documents d

₁

and d

₂

given by d

₁

−−→

^m¹²

w

₂

−−→

^m²²

d

₂

. If we consider that the SC matrix is initialized with the identity matrix I, at the first iteration, Sim(m

_1:

, m

_2:

) corresponds to the inner product between m

_1:

and m

_2:

as given by (6a) and equals m

₁₂

× m

₂₂

. Omitting the normalization

for the sake of clarity, the matrix SR

⁽¹⁾

= M × M

^T

thus represents all or-

der-1 paths between all the possible pairs of documents d

_i

and d

_j

. Similarly,

(7)

each element of SC

⁽¹⁾

= M

^T

× M represents all order-1 paths between all the possible pairs of words w

_i

and w

_j

. Now, documents d

₁

and d

₄

do not

d

₁

d

₂

d

₃

d

₄

w

₁

w

₂

w

₃

w

₄

w

₅

w

₆

Figure 1: A bi-partite graph view of a sample document-word matrix.

have an order-1 path but are linked together through d

₂

(bold paths in Fig. 1) and d

₃

(dotted paths in Fig. 1). Such paths with one intermediate vertice are called order-2 paths, and will appear during the second iteration. The simi- larity value contributed via the document d

₂

can be explicitly represented as d

₁

−−→

^m¹²

w

₂

−−→

^m²²

d

₂

−−→

^m²⁴

w

₄

−−→

^m⁴⁴

d

₄

. The sub-sequence w

₂

−−→

^m²²

d

₂

−−→

^m²⁴

w

₄

represents an order-1 path between words w

₂

and w

₄

which is the same as sc

⁽¹⁾₂₄

. The contribution of d

2

in the similarity of sr

⁽¹⁾₁₄

can thus be re-written as m

12

× sc

⁽¹⁾₂₄

× m

44

. This is a partial similarity measure since d

2

is not the only document that provides a link between d

₁

and d

₄

. The similarity via d

₃

is equal to m

₁₃

× sc

⁽¹⁾₃₅

× m

₅₅

. To find the overall similarity measure between documents d

₁

and d

₄

, we need to add these partial similarity values given by m

₁₂

× sc

⁽¹⁾₂₄

× m

₄₄

+ m

₁₃

× sc

⁽¹⁾₃₅

× m

₅₅

. Hence, the similarity matrix SR

⁽²⁾

at the second iteration corresponds to paths of order-2 between documents. It can be shown similarly that, the matrices SR

^(t)

and SC

^(t)

represent order-t paths between documents and between words respectively.

Consequently, at each iteration t, when we compute the value of equa- tions (6a) and (6b) one or more new links may be found between previously disjoint objects (documents or words) corresponding to paths with length of order-t, and existing similarity measures may be strengthened. It has been shown that “in the long run”, the ending point of a random walk does not de- pend on its starting point (Seneta, 2006) and hence it is possible to find a path (and hence similarity) between any pair of nodes in a connected graph (Ze- likovitz & Hirsh, 2001) by iterating a sufficiently large number of times.

However, co-occurrences beyond the 3

^rd

and 4

^th

order have little semantic

relevance (Bisson & Hussain, 2008; Lemaire & Denhi`ere, 2008). Therefore,

the number of iterations is usually limited to 4 or less.

(8)

3. Discussion and Improvements of χ-Sim

In this section, first, we discuss a new normalization schema for χ-Sim in order to satisfy (partially) the maximality property of a similarity measure (Sim(a, b) = 1), then we propose a pruning method allowing to remove the

’noisy’ similarity values created during the iterations.

3.1. Normalization

In this paper, we investigate extensions of the Generalized Cosine measure, by relaxing the positive semi-definiteness of the matrix, and by adding a pseudo- norm parameter k. Henceforth, using the equation (4) we define the elements of the matrices SR and SC with the two new equations (7a) and (7b):

∀i, j ∈ 1..r, sr

_ij

= Sim

^k

(m

_i:

, m

_j:

) q

Sim

^k

(m

_i:

, m

_i:

) × q

Sim

^k

(m

_j:

, m

_j:

)

(7a)

∀i, j ∈ 1..c, sc

_ij

= Sim

^k

(m

_:i

, m

_:j

) q

Sim

^k

(m

_:i

, m

_:i

) × q

Sim

^k

(m

_:j

, m

_:j

)

(7b) However, this normalization is what we will call a pseudo-normalization since if it guaranties that sr

_ii

= 1, it does not satisfy that ∀i, j ∈ 1..r, sr

_ij

∈ [0, 1] (and the same for sc

_ij

). Consider for example a corpus having, among many other documents, the documents d

1

containing the word orange (w

1

) and d

₂

containing the words red (w

₂

) and banana (w

₃

), along with SC – the similarity matrix of all the words of the corpus – indicating that the similarity between orange and red is 1, the similarity between orange and banana is 1 and the similarity between red and banana is 0. Thus, Sim

¹

(d

₁

, d

₁

) = 1, Sim

¹

(d

₂

, d

₂

) = 2 and Sim

¹

(d

₁

, d

₂

) = 2. Consequently, sr

₁₂

=

^√_1×2²

>

1. One can notice that this problem arises from the polysemic nature of

the word orange. Indeed, the similarity between these two documents is

overemphasized because of the double analogy between orange (the color)

and red, and between orange (the fruit) and banana. It is possible to cor-

rect this problem by setting k = +∞ since the pseudo-norm-k becomes

max

_16k,l6c

{m

_ik

× sc

_kl

× m

_il

} and thus, Sim

^∞

(d

₁

, d

₁

) = Sim

^∞

(d

₂

, d

₂

) =

Sim

^∞

(d

1

, d

2

) = 1, implying sr

12

= 1. Of course, k = +∞ is not necessarily

a good setting for real tasks and experimentally we observed that the values

of sr

_ij

and sc

_ij

remain generally smaller or equal to 1.

(9)

In this framework it is nevertheless very interesting to investigate the dif- ferent results one can obtain from varying k, including values lower than 1, as suggested in (Aggarwal et al., 2001) for the norm L

_k

, to deal with high dimensional spaces. The resulting χ-Sim algorithm will be denoted by χ-Sim

^k

. However, the situation is different from the norm L

_k

in the sense that our method does not define a proper Normed Vector Space. To under- stand the problem it is worth looking closely to the simple case k = 1 where Sim

¹

(m

_i:

, m

_j:

) = m

_i:

× SC × m

^T_i:

is the general form of an inner product, with the condition that SC is symmetric positive semi-definite (PSD). Un- fortunately, in our case due to the normalization steps, SC is not necessarily PSD, as the condition ∀i, j ∈ 1..c, |sc

_ij

| 6 √

sc

_ii

× sc

_jj

= 1 is not verified (cf. previous example). Thus, our similarity measure is just a bilinear form in a degenerated inner product space (as the conjugate and linearity axioms are trivially satisfied) in which it corresponds to the ’cosine’.

A straightforward solution would be to project SC (and SR) after each iteration onto the set of PSD matrices (Qamar & Gaussier, 2009). By con- straining the similarity matrices to be PSD, we would ensure that the new space remains a proper inner product space. However, we experimentally verified that such an additional step did not improve the results though, as when testing on real datasets, the similarity matrices are very close to the set of PSD matrices. In addition, the projection step is very time consuming, for these reasons, we won’t use it in the remaining of this paper.

3.2. Dealing with ‘noise’ in SC and SR matrices

As explained in section 2.3., the elements of the SR matrix after the first iter- ation are the weighted order-1 paths in the graph: the diagonal elements sr

_ii

correspond to the paths from each document to itself, while the non-diagonal terms sr

ij

count the number of order-1 paths between a document i and a neighbour j, which is based on the number of words they have in common.

SR

⁽¹⁾

is thus the adjacency matrix of the document graph. Iteration t amounts

thus to count the number of order-t between nodes. However, in a corpus, we

can observe there are many words with few occurrences, that are not really

relevant with the topic of the document, or to be more precise, that are not spe-

cific to any families of documents semantically related. These words are sim-

ilar to a ’noise’ in the dataset. Thus, during the iterations, these noisy words

allow the algorithm to create some new paths between the different families

of documents; of course these paths have a very small similarity value but

(10)

they are numerous and we make the assumption that they blur the similarity values between the classes of documents (same for the words). Based on this observation, we thus introduce in the χ-Sim algorithm a parameter, termed pruning threshold and denoted by p, allowing us to set to zero the lowest p % of the similarity values in the matrices SR and SC at each iteration.

In the following, we will refer to this algorithm as χ-Sim

_p

when using the previous normalization factor described in (6a) and (6b), and χ-Sim

^k_p

when using the new pseudo-normalization factor described in (7a) and (7b).

3.3. A Generic χ-Sim Algorithm for χ-Sim

^k_p

Equations (7a) and (7b) allows us to compute the similarities between two rows and two columns. The extension over all pair of rows and all pair of columns can be generalized under the form of a simple matrix multiplication.

We need to introduce a new notation here, M

^◦^k

= (m

_ij

)

^k

i,j

which is the element-wise exponentiation of M to the power of k. The algorithm follows:

1. We initialize the similarity matrices SR (documents) and SC (words) with the identity matrix I, since, at the first iteration, only the similarity between a row (resp. column) and itself equals 1 and zero for all other rows (resp. columns). We denote these matrices as SR

⁽⁰⁾

and SC

⁽⁰⁾

. 2. At each iteration t, we calculate the new similarity matrix between doc-

uments SR

^(t)

by using the similarity matrix between words SC

^(t−1)

:

SR

^(t)

= M

^◦^k

× SC

^(t−1)

× (M

^◦^k

)

^T

and sr

^(t)_ij

←

q

k

sr

^(t)_ij

2k

q

sr

^(t)_ii

× sr

^(t)_jj

(8)

We do the same thing for the columns similarity matrix SC

^(t)

:

SC

^(t)

= (M

^◦^k

)

^T

× SR

^(t−1)

× M

^◦^k

and sc

^(t)_ij

←

q

k

sc

^(t)_ij

2k

q

sc

^(t)_ii

× sc

^(t)_jj

(9)

3. We set to 0 the p % of the lowest similarity values in the similarity matrices SR and SC.

4. Steps 2 and 3 are repeated t times (typically as we saw in section 2 the

value t = 4 is enough) to iteratively update SR

^(t)

and SC

^(t)

.

(11)

It is worth noting here that even though χ-Sim

^k_p

computes the similarity between each pair of documents using all pairs of words, the complexity of the algorithm remains comparable to classical similarity measures like Cosine.

Given that – for a generalized matrix of size n by n – the complexity of matrix multiplication is in O(n

³

) and the complexity to compute M

^◦ⁿ

is in O(n

²

), the overall complexity of χ-Sim is given by O(tn

³

).

4. Experiments

Here, to evaluate our system, we cluster the documents coming from the well- known 20-Newsgroup dataset (NG20) by using the document similarity ma- trices SR generated by χ-Sim. We choose this dataset since it has been widely used as a benchmark for document classification and co-clustering (Dhillon et al., 2003; Long et al., 2005; Zhang et al., 2007; Long et al., 2006), thus allowing us to compare our results with those reported in the literature.

4.1. Preprocessing and Methodology

Test dataset. We replicate the experimental procedures used by Dhillon et al.

(2003); Long et al. (2005, 2006): 10 different samples of each of the 6 subsets described in Table 1 are generated, we ignored the subject lines, we removed stop words and we selected the top 2,000 words based on supervised mutual information (Yang & Pedersen, 1997). We will discuss further this last pre- processing step in section 5. With these six benchmarks, we compared our co-similarity measures based on χ-Sim with four similarity measures: Co- sine, LSA (Deerwester et al., 1990), SNOS (Liu et al., 2004) and CTK (Yen et al., 2009); as well as three co-clustering methods: ITCC (Dhillon et al., 2003), BVD (Long et al., 2005) and RSN (Long et al., 2006).

Table 1: Description of the subsets of the NG20 dataset used. We provide the number of clusters, and the number of documents for every subset.

Newsgroups included #clust. #docs.

M2 talk.politics.mideast, talk.politics.misc 2 500

M5 comp.graphics, rec.motorcycles, rec.sport.baseball, sci.space, talk.politics.mideast 5 500 M10 alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, rec.sport.hockey,

sci.crypt, sci.electronics, sci.med, sci.space, talk.politics.gun

10 500

NG1 rec.sports.baseball, rec.sports.hockey 2 400

NG2 comp.os.ms-windows.misc, comp.windows.x, rec.motorcycles, sci.crypt, sci.space 5 1000 NG3 comp.os.ms-windows.misc, comp.windows.x, misc.forsale, rec.motorcycles,

sci.crypt, sci.space, talk.politics.mideast, talk.religion.misc

8 1600

(12)

Creation of the clusters. For the ’similarity based’ algorithms: χ-Sim, Cosine, LSA, SNOS and CTK, the clusters are generated by an Agglomerative Hierarchical Clustering (AHC) method on the similarity matrices along with Ward’s linkage. Then we cut the clustering tree at the level corresponding to the number of document clusters we are waiting for (two for subset M2, etc).

Implementations. χ-Sim algorithms, as well as Cosine, SNOS and AHC have been implemented in Python, and LSA have been implemented in Mat- Lab. For CTK, we used the MatLab implementation kindly provided by the authors. For ITCC, we used the implementation provided by the authors and the parameters reported in (Dhillon et al., 2003). For BVD and RSN, as we don’t have a running implementation, we directly quote the best values from (Long et al., 2005) and (Long et al., 2006) respectively.

4.2. Experimental Measures

We used the classical micro-averaged precision (Pr) (Dhillon et al., 2003) for comparing the accuracy of the document classification; the Normalized Mu- tual Information (NMI) (Banerjee & Ghosh, 2002) is also used to compare χ-Sim with RSN. For SNOS, we perform four iterations and set the λ pa- rameter to the value proposed by the authors (Liu et al., 2004). For LSA, we tested the algorithm iteratively keeping the h highest singular values from h = 10..200 by steps of 10. We use the value of h providing, on average, the highest micro-averaged precision. For ITCC, we ran three times the algo- rithm using the different numbers of word clusters, as suggested in (Dhillon et al., 2003), for each dataset. For χ-Sim

_p

, we performed the pruning step as described in section 3.3. varying the value of p = 0 to 0.9 by steps of 0.1. For each subset, we report the best micro-averaged precision obtained with p.

The experimental results are summarized in Table 2. In all the versions,

χ-Sim performs better than all the other tested algorithms. Moreover, the

new normalization schema proposed in Section 3. clearly improved the results

of our algorithm over the previous normalization based on the length of the

documents. The SNOS algorithm performs poorly in spite of the fact that it

is very closed to χ-Sim, probably because it uses a different normalization. It

is interesting to notice that the gain obtained with the pruning when using the

previous version of χ-Sim on the M10 and NG3 (the two hardest problems)

is reduced to almost negligible levels with the new algorithm. Finally, the

impact of the parameter k is small for all the subsets but M10 and NG3. In

these more complex datasets, we observe that setting k to a value lower than

(13)

1 slightly improves the clustering but not in a significative way. This result seems to show that the results provided by (Aggarwal et al., 2001), suggesting to use a value of k lower than 1 with the norm L

_k

when dealing with high dimensional space, could be relevant in our framework (see next section).

Table 2: Micro-averaged precision (and NMI for χ-Sim based algorithms and RSN) and standard deviation for the various subsets of NG20.

M2 M5 M10 NG1 NG2 NG3

Cosine Pr 0.60±0.00 0.63±0.07 0.49±0.06 0.90±0.11 0.60±0.10 0.59±0.04 LSA Pr 0.92±0.02 0.87±0.06 0.59±0.07 0.96±0.01 0.82±0.03 0.74±0.03 ITCC Pr 0.79±0.06 0.49±0.10 0.29±0.02 0.69±0.09 0.63±0.06 0.59±0.05

BVD Pr best:0.95 best:0.93 best:0.67 - - -

RSN NMI - - - 0.64±0.16 0.75±0.07 0.70±0.04

SNOS Pr 0.55±0.02 0.25±0.02 0.24±0.06 0.51±0.01 0.24±0.02 0.22±0.05 CTK Pr 0.94±0.01 0.95±0.01 0.71±0.03 0.96±0.01 0.90±0.01 0.87±0.02

χ-Sim Pr 0.91±0.09 0.96±0.00 0.69±0.05 0.96±0.01 0.92±0.01 0.79±0.06

NMI 0.76±0.06 0.79±0.02 0.72±0.03

χ-Sim_p Pr 0.94±0.01 0.96±0.00 0.73±0.03 0.97±0.01 0.92±0.01 0.84±0.05

NMI 0.78±0.05 0.79±0.02 0.73±0.02

χ-Sim¹ Pr 0.95±0.00 0.96±0.02 0.78±0.03 0.97±0.02 0.94±0.01 0.86±0.05

NMI 0.85±0.07 0.83±0.03 0.79±0.03

χ-Sim¹_p Pr 0.95±0.00 0.97±0.01 0.78±0.03 0.98±0.01 0.94±0.01 0.87±0.05

NMI 0.86±0.04 0.83±0.03 0.80±0.02

χ-Sim^0.8 Pr 0.95±0.00 0.97±0.01 0.79±0.02 0.98±0.01 0.94±0.01 0.90±0.01

NMI 0.87±0.05 0.84±0.02 0.81±0.02

χ-Sim^0.8_p Pr 0.95±0.00 0.97±0.01 0.80±0.04 0.98±0.00 0.94±0.01 0.90±0.02

NMI 0.88±0.03 0.85±0.02 0.81±0.03

5. Discussion about the Preprocessing

The feature selection step aims at improving the results by removing words

that are not useful to separate the different clusters of documents. Moreover,

this step is also clearly needed due to the spatial and time complexity of the

algorithms in O(n

³

). Nevertheless, we are performing an unsupervised learn-

ing task, thus using a supervised feature selection method, i.e. selecting the

top 2,000 words based on how much information they bring to one class of

documents or another, introduces some bias since it leads to ease the problem

by building well-separated clusters. In real applications, it is impossible to

use this kind of preprocessing for unsupervised learning. Thus to explore the

potential effects of this bias, we hereby propose to generate similar subsets of

the NG20 dataset but using an unsupervised feature selection method.

(14)

5.1. Unsupervised Feature Selection

To reduce the number of words in the learning set, we used an approach con- sisting in selecting a representative subset (sampling) of the words with the help of the k-medoids algorithm. The procedure is the following: first, we remove from the corpus the words appearing in just one document, as they do not provide information to built the clusters; then, we run k-medoids to get 2,000 classes corresponding to a selection of 2,000 words. We used the implementation of the algorithm provided in the Pycluster package (de Hoon et al., 2004) with the Euclidean distance.

5.2. Results with k-medoids

Here, we use the same methodology as described in section 4.1. except for the feature selection step which is now done with k-medoids instead of the supervised Mutual Information. The results are summarized in Table 3.

Table 3: Micro-averaged precision and standard deviation for the various subsets of NG20, pre-processed using the k-medoids feature selection.

M2 M5 M10 NG1 NG2 NG3

Cosine 0.61±0.04 0.54±0.08 0.39±0.03 0.52±0.01 0.60±0.05 0.49±0.02 LSA 0.79±0.09 0.66±0.05 0.44±0.04 0.56±0.05 0.61±0.06 0.52±0.03 ITCC 0.70±0.05 0.54±0.05 0.29±0.05 0.61±0.06 0.44±0.08 0.49±0.07 SNOS 0.51±0.01 0.26±0.04 0.20±0.02 0.51±0.00 0.24±0.01 0.22±0.02 CTK 0.67±0.10 0.76±0.04 0.54±0.05 0.69±0.14 0.64±0.06 0.54±0.02

χ-Sim 0.58±0.07 0.62±0.12 0.43±0.04 0.54±0.03 0.60±0.12 0.47±0.05

χ-Simp 0.65±0.09 0.68±0.06 0.47±0.04 0.62±0.12 0.63±0.14 0.57±0.04

χ-Sim¹ 0.54±0.06 0.62±0.13 0.36±0.04 0.53±0.02 0.35±0.09 0.30±0.05

χ-Sim¹_p 0.80±0.13 0.77±0.08 0.53±0.05 0.75±0.07 0.73±0.06 0.61±0.03

χ-Sim^0.8 0.54±0.05 0.66±0.07 0.37±0.06 0.52±0.02 0.38±0.08 0.36±0.04

χ-Sim^0.8_p 0.81±0.10 0.79±0.05 0.55±0.04 0.81±0.02 0.72±0.02 0.64±0.04

Here, we can observe that the results are different. The version of χ-Sim

using the previous normalization method obtains more or less the same results

as LSA and is totally overcome by CTK. With the new normalization the re-

sults are more contrasted and now, differently from the first experiments, the

impact of the pruning factor p becomes very strong: without pruning the new

method performs poorly on several problems (M2, M10, NG2, NG3), the re-

sults being lower than the Cosine similarity, but by pruning the smallest val-

ues of the similarity matrices the situation is completely opposite and χ-Sim

_p

(15)

based algorithms obtain the best results on all the datasets. As in section 4., we observe again that setting a value of k lower than 1 improves the clustering in all the dataset but one. Now it is interesting to see with more details the

Figure 2: Evolution of the precision for NG1 using χ-Sim

^0.6_p

against p (left side), and using χ-Sim

^k_0.6

against k (right side). The dotted line represents the supervised feature selection

data, and the plain one the unsupervised feature selection data.

impact of the different values of k and p on a given dataset. Figure 2 (left side) shows the evolution of the accuracy on NG1 subset according to the value of p. When the words are selected by supervised utual information the curve is quite flat, but when the words are selected with k-medoids, the behavior differs: the accuracy first increases with the pruning level, the best value be- ing about 60% (it is worth noticing that this value is very stable among the datasets). This re-enforces our assumption that pruning the similarity matri- ces can be a good way of dealing with ‘noise’. Indeed, when the features are selected with Mutual Information, the classes are relatively well separated thus, similarity propagation as a result of higher order co-occurrences between documents (or words) of different categories as few influence. However, with the unsupervised feature selection, there are more ’noise’ in the data and the pruning process helps significantly to alleviate this problem.

Figure 2 (right side) shows the evolution of the accuracy again on the NG1 subset according to the value of k. As we can see, on this dataset where the document and words vectors tend to be highly sparse, the best values for this parameter seems to be found between 0.5 and 1 as for the case of the norm L

k

(Aggarwal et al., 2001), we choose the value 0.8 in the results tables.

However, this effect can only be seen when the pruning parameter is activated

(plain line on the figure). Finally, it is worth noting that in this experiments

(16)

with k-medoids, the difference between LSA and Cosine strongly decreases.

All these results demonstrate that preprocessing the data with a supervised feature selection approach totally change (unsurprisingly) the behavior of the clustering methods by simplifying too much the problem.

6. Conclusion

In this paper, we proposed two empirical improvements of the χ-Sim co- similarity measure. The new normalization we presented for this measure is more consistent with the framework of the algorithm and also (partially) satisfies the reflexivity property. Furthermore, we showed that the χ-Sim similarity measure is susceptible to noise and proposed a way to alleviate this susceptibility and to improve the precision. On the experimental part, our co-similarity based approach performs significantly better than the other co-clustering algorithms we tested for the task of document clustering. In contrast to (Dhillon et al., 2003; Long et al., 2005), our algorithms does not need to cluster the words (columns) for clustering the documents (rows), thus avoiding the need to know the number of word clusters and the learning pa- rameters p and k introduced here seems relatively easy to tune. However, we will investigate how to automatically find the best values for these parame- ters, using similarity matrix analysis from (Aggarwal et al., 2001). It is also worth noting that our co-similarity measure performs better than LSA and than CTK by a smaller margin. Unfortunately, as we saw in section 3.1., the current method is not well-defined from the theoretical point of view and we need to analyze its behavior in order to understand the role of the pseudo- normalization and to see if it is possible to turn it into a real normalization.

Acknowledgment

This work is partially supported by the French ANR project FRAGRANCES under grant 2008-CORD 00801.

References

A GGARWAL C. C., H INNEBURG A. & K EIM D. A. (2001). On the surprising

behavior of distance metrics in high dimensional space. In Lecture Notes

in Computer Science, p. 420–434: Springer.

(17)

B ANERJEE A. & G HOSH J. (2002). Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres.

B EYER K., G OLDSTEIN J., R AMAKRISHNAN R. & S HAFT U. (1999). When is “nearest neighbor” meaningful? In Int. Conf. on Database Theory, p.

217–235.

B ISSON G. & H USSAIN F. (2008). Chi-sim: A new similarity measure for the co-clustering task. In Proceedings of the Seventh ICMLA, p. 211–217:

IEEE Computer Society.

C HENG Y. & C HURCH G. M. (2000). Biclustering of expression data. In Pro- ceedings of the International Conference on Intelligent System for Molecu- lar Biology, p. 93–103, Boston.

DE H OON M., I MOTO S., N OLAN J. & M IYANO S. (2004). Open source clustering software. Bioinformatics, p. 781.

D EERWESTER S., D UMAIS S. T., F URNAS G. W., T HOMAS & H ARSHMAN

R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407.

D HILLON I. S., M ALLELA S. & M ODHA D. S. (2003). Information- theoretic co-clustering. In Proceedings of the Ninth ACM SIGKDD, p.

89–98.

L EMAIRE B. & D ENHI ERE ` G. (2008). Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters - Behaviour, Brain and Cognition, 18(1).

L IU N., Z HANG B., Y AN J., Y ANG Q., Y AN S., C HEN Z., B AI F. & YING

M A W. (2004). Learning similarity measures in non-orthogonal space. In Proceedings of the 13th ACM CIKM, p. 334–341: ACM Press.

L IVEZAY K. & B URGESS C. (1998). Mediated priming in high-dimensional meaning space: What is ”mediated” in mediated priming? In Proceedings of the Cognitive Science Society.

L ONG B., Z HANG Z. M., W ´ U X. & Y U P. S. (2006). Spectral clustering for multi-type relational data. In Procceedings of ICML’06, p. 585–592, New York, NY, USA: ACM.

L ONG B., Z HANG Z. M. & Y U P. S. (2005). Co-clustering by block value decomposition. In Proceedings of the Eleventh ACM SIGKDD, p. 635–640, New York, NY, USA: ACM.

M ADEIRA S. C. & O LIVEIRA A. L. (2004). Biclustering algorithms for biological data analysis: A survey.

Q AMAR A. M. & G AUSSIER E. (2009). Online and batch learning of gen- eralized cosine similarities. In Proceedings of the Ninth IEEE ICDM, p.

926–931, Washington, DC, USA: IEEE Computer Society.

(18)

R EGE M., D ONG M. & F OTOUHI F. (2008). Bipartite isoperimetric graph partitioning for data co-clustering. Data Min. Knowl. Discov., 16(3), 276–

312. S ALTON G. (1971). The SMART Retrieval System—Experiments in Automatic Document Processing. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.

S ENETA E. (2006). Non-Negative Matrices and Markov Chains. Springer.

S LONIM N. & T ISHBY N. (2001). The power of word clusters for text clas- sification. In In 23rd European Colloquium on Information Retrieval Re- search.

S PEER N., S PIETH C. & Z ELL A. (2004). A memetic clustering algorithm for the functional partition of genes based on the gene ontology.

W ANG X.-J., M A W.-Y., X UE G.-R. & L I X. (2004). Multi-model similar- ity propagation and its application for web image retrieval. In Proceedings of the 12th annual ACM MULTIMEDIA, p. 944–951, New York, NY, USA: