HAL Id: hal-00743259
https://hal.archives-ouvertes.fr/hal-00743259
Submitted on 18 Oct 2012
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
documents
Clément Grimal, Gilles Bisson
To cite this version:
Clément Grimal, Gilles Bisson. Amélioration de la co-similarité pour la classification de documents.
CAp 2011 - Conférence Francophone d’Apprentissage, May 2011, Chambéry, France. pp.199-215.
�hal-00743259�
Am´ elioration de la co-similarit´ e pour la classification de documents
Cl´ement Grimal
1, Gilles Bisson
1Laboratoire d’Informatique de Grenoble, UMR 5217
Clement.Grimal@imag.fr, Gilles.Bisson@imag.frAbstract
: La classification conjointe d’objets et de leur descripteurs – par exemple de documents avec les mots les composant – encore appel´ee
co-classification, a ´et´e large-ment ´etudi´ee ces derni`eres ann´ees, car elle permet d’extraire des classes plus pertinents, qu’elle soit explicite ou latente. Dans de pr´ec´edents travaux (Bisson & Hussain, 2008), nous avons propos´e une m´ethode de calcul simultan´e des matrice de similarit´e entre ob- jets et entre descripteurs, chacune ´etant construite `a partir de l’autre. Nous proposons ici une g´en´eralisation de cette approche en introduisant une pseudo-norme et un algorithme de seuillage. Nos exp´erimentations mettent en ´evidence une am´elioration significative de la classification, notamment par rapport `a d’autres m´ethodes.
Mots-cl´es
: co-clustering, similarity measure, text mining
1. Introduction
Clustering task is used to organize data coming from databases. Classically, these data are described as a set of instances characterized by a set of features.
In some cases, these features are homogeneous enough to allow us to cluster
them, in the same way as we do for the instances. For example, when using the
Vector Space Model introduced by Salton (1971), text corpora are represented
by a matrix whose rows represent document vectors and whose columns rep-
resent the word vectors. Thus, the similarity between two documents depends
on the similarity between the words they contain and vice-versa. In the classi-
cal clustering methods, such dependencies are not exploited. The purpose of
co-clustering is to take into account this duality between rows and columns to
identify the relevant clusters. Co-clustering has been largely studied in recent
years both in Document clustering (Dhillon et al., 2003; Long et al., 2005;
Rege et al., 2008; Liu et al., 2004) and Bioinformatics (Madeira & Oliveira, 2004; Speer et al., 2004; Cheng & Church, 2000).
In text analysis, the advantage of co-clustering is related to the well-known problem that document and words vectors tend to be highly sparse and suffer from the curse of dimensionality (Slonim & Tishby, 2001). Thus, traditional metrics such as Euclidean distance or Cosine similarity do not always make much sense (Beyer et al., 1999). Several methods have been proposed to overcome these limitations by exploiting the dual relationship between docu- ments and words to extract semantic knowledge from the data. Consequently, the concept of higher-order co-occurrences has been investigated in (Livezay
& Burgess, 1998; Lemaire & Denhi`ere, 2008), among others, as a measure of semantic relationship between words; one of the best known approach to ac- quire such knowledge being the Latent Semantic Analysis (Deerwester et al., 1990). The underlying analogy is that humans do not necessarily use the same vocabulary when writing about the same topic. For example, let us consider a corpus in which a subset of documents contains a significant number of co- occurrences between the words sea and waves and another subset in which the words ocean and waves co-occur. A human could infer that the worlds ocean and sea are conceptually related even if they do not directly co-occur in any document. Such a relationship between waves and ocean (or sea and waves) is termed as a first-order co-occurrence and the conceptual association be- tween sea and ocean is called a second-order relationship. This concept can be generalized to higher-order (3
rd, 4
th, 5
th, etc) co-occurrences.
In this context, we recently introduced an algorithm, called χ-Sim (Bisson
& Hussain, 2008), exploiting the duality between words and documents in a corpus as well as their respective higher-order co-occurrences. While most authors have focused to directly co-cluster the data, in χ-Sim, we just built two similarity matrices, one for the rows and one for the columns, each being built iteratively on the basis of the other. We call this process the co-similarity measure. Hence, when the two similarity matrices are built, each of them contains all the information needed to do a ‘separate’ co-clustering of the data (documents and words) by using any classical clustering algorithm.
In this paper, we further analyze the behavior of χ-Sim and we propose
some ideas leading to dramatically improve the quality of the co-similarity
measures. First, we introduce a new normalization schema for this measure
that is more consistent with the framework of the algorithm and that offers
new perspectives of research. Second, we propose an efficient way to deal
with noise in the data and thus to improve the accuracy of the clustering.
2. The χ-Sim Similarity Measure
Throughout this paper, we will use the classical notations: matrices (in capital letters) and vectors (in small letters) are in bold and variables are in italic.
Data matrix: let M be the data matrix representing a corpus having r rows (documents) and c columns (words); m
ijcorresponds to the ‘intensity’ of the link between the i
throw and the j
thcolumn (for a document-word matrix, it can be the frequency of the j
thword in the i
thdocument); m
i:= [m
i1· · · m
ic] is the row vector representing the document i and m
:j= [m
1j· · · m
rj] is the column vector corresponding to word j. We will refer to a document as d
iwhen talking about documents casually and refer to it as m
i:when specifying its (row) vector in the matrix M. Similarly, we will casually refer to a word as w
jand use the notation m
:jwhen emphasizing the vector.
Similarity matrices: SR and SC represent the square and symmetrical row similarity and column similarity matrices of size r × r and c × c respectively, with ∀i, j = 1..r, sr
ij∈ [0, 1] and ∀i, j = 1..c, sc
ij∈ [0, 1].
Similarity function: generic function F
s(·, ·) takes two elements m
iland m
jnof M and returns a measure of the similarity F
s(m
il, m
jn) between them.
2.1. Similarity measures
χ-Sim is a co-similarity based approach which builds on the idea of simulta- neously generating the similarity matrices SR (documents) and SC (words), each of them built on the basis of the other. Similar ideas have also been used for supervised leaning in (Liu et al., 2004) or for image retrieval in (Wang et al., 2004). First, we present how to compute the matrix SR. Usually, the similarity (or distance) measure between two documents m
i:and m
j:is de- fined as a function – denoted here as Sim(m
i:, m
j:) – that is more or less the sum of the similarities between words occurring in both m
i:and m
j::
Sim(m
i:, m
j:) = F
s(m
i1, m
j1) + · · · + F
s(m
ic, m
jc) (1) Now let’s suppose we already know a matrix SC whose entries provide a measure of similarity between the columns (words) of the corpus. In parallel, let’s introduce, by analogy to the norm L
k(Minkowski distance), the notion of a pseudo-norm k. Then, Equation (1) can be re-written as follows without changing its meaning if sc
ll= 1 and if k = 1:
Sim(m
i:, m
j:) =
kv u u t
c
X
l=1
(F
s(m
il, m
jl))
k× sc
ll(2)
Now the idea is to generalize (2) in order to take into account all the pos- sible pairs of features (words) occurring in documents m
i:and m
j:. In this way, we “capture” not only the similarity between their common words but also the similarity coming from words that are not directly shared by the two documents. Of course, for each pair of words not directly shared by the doc- uments, we weight their contribution to the document similarity sr
ijby their own similarity sc
ln. Thus, the overall similarity between documents m
i:and m
j:is defined in (3) in which the terms for l = n are those occurring in (2):
Sim
k(m
i:, m
j:) =
kv u u t
c
X
l=1 c
X
n=1
(F
s(m
il, m
jn))
k× sc
ln(3)
Assuming that F
s(m
il, m
jn) is defined as a product (see (Bisson & Hus- sain, 2008) for further details) of the elements m
iland m
jn, i.e. F
s(m
il, m
jn) = m
il× m
jn(as with the cosine similarity), we can rewrite Equation (3) as:
Sim
k(m
i:, m
j:) =
kq
(m
i:)
k× SC × m
Tj:k(4) where (m
i:)
k= h
(m
ij)
k· · · (m
ic)
ki
and m
Tj:denotes the transpose of the vec- tor m
j:. Finally, let’s introduce the term N (m
i:, m
j:) that is a normalization function allowing to map the similarity to [0, 1]. We obtain the following equation in which sr
ijdenotes an element of the SR matrix:
sr
ij= q
k(m
i:)
k× SC × m
Tj:kN (m
i:, m
j:) (5) Equation (5) is a classic generalization of several well-known similarity mea- sures. For example, with k = 1, the Jacard index can be obtained by setting SC to I and N (m
i:, m
j:) to km
i:k
1+ km
j:k
1− m
i:m
Tj:, while the Dice coeffi- cient can be obtained by setting SC to 2I and N (m
i:, m
j:) to km
i:k
1+km
j:k
1. Furthermore, if SC is set to a positive semi-definite matrix A, one can define the following inner product < m
i:, m
j:>
A= m
i:× A × m
Tj:, along with the associated norm km
i:k
A= < m
i:, m
i:>
A. Then by setting N (m
i:, m
j:) to p km
i:k
A× p
km
j:k
A, we obtain the Generalized Cosine similarity (Qamar
& Gaussier, 2009), as it corresponds to the Cosine measure in the underlying
inner product space. Of course, by binding A to I, this similarity becomes the
standard Cosine measure between document m
i:and m
j:.
2.2. The χ-Sim Co-Similarity Measure
Of course, χ-Sim co-similarity, as defined in (Bisson & Hussain, 2008) can be also reformulated with (5). We set k to 1, and since the maximum value defined by the function sc
ijis 1, it follows from (3) that, the upper bound of Sim(m
i:, m
j:) for 1 6 i, j 6 r, is given by the product of the sum of elements of m
i:and m
j:denoted by |m
i:| × |m
j:| (product of L
1-norms).
This normalization seems well-suited for textual datasets since it allows us to take into consideration pairs of documents (or words) vectors of uneven length, which is common in text corpora. Therefore, we can rewrite (5) as:
∀i, j ∈ 1..r, sr
ij= m
i:× SC × m
Tj:|m
i:| × |m
j:| (6a) Symmetrically, the elements sc
ijof the SC matrix are defined as:
∀i, j ∈ 1..c, sc
ij= m
T:i× SR × m
:j|m
:i| × |m
:j| (6b) Equations (6a) and (6b) define a systems of linear equations, whose so- lutions correspond to the (co)-similarities between two documents and two words. Thus, the algorithm of χ-Sim is based on an iterative approach – i.e.
we compute alternatively the values sc
ijand sr
ij. However, before detailing this algorithm for a more generic case in section 3.3., we are going to explain the meaning, considering the associated bipartite graph, of these iterations.
2.3. Graph Theoretical Interpretation
The graphical interpretation of the method helps to understand the working of the algorithm and provides some intuition on how to improve it. Let’s consider the bipartite graph representation of a sample data matrix in Fig. 1.
Documents and words are represented by square and circle nodes respectively and an edge (of any kind) between a document d
iand a word w
jcorresponds to a non-zero entry m
ijin the document-word matrix. There is only one or- der-1 path between documents d
1and d
2given by d
1−−→
m12w
2−−→
m22d
2. If we consider that the SC matrix is initialized with the identity matrix I, at the first iteration, Sim(m
1:, m
2:) corresponds to the inner product between m
1:and m
2:as given by (6a) and equals m
12× m
22. Omitting the normalization
for the sake of clarity, the matrix SR
(1)= M × M
Tthus represents all or-
der-1 paths between all the possible pairs of documents d
iand d
j. Similarly,
each element of SC
(1)= M
T× M represents all order-1 paths between all the possible pairs of words w
iand w
j. Now, documents d
1and d
4do not
d
1d
2d
3d
4w
1w
2w
3w
4w
5w
6Figure 1: A bi-partite graph view of a sample document-word matrix.
have an order-1 path but are linked together through d
2(bold paths in Fig. 1) and d
3(dotted paths in Fig. 1). Such paths with one intermediate vertice are called order-2 paths, and will appear during the second iteration. The simi- larity value contributed via the document d
2can be explicitly represented as d
1−−→
m12w
2−−→
m22d
2−−→
m24w
4−−→
m44d
4. The sub-sequence w
2−−→
m22d
2−−→
m24w
4represents an order-1 path between words w
2and w
4which is the same as sc
(1)24. The contribution of d
2in the similarity of sr
(1)14can thus be re-written as m
12× sc
(1)24× m
44. This is a partial similarity measure since d
2is not the only document that provides a link between d
1and d
4. The similarity via d
3is equal to m
13× sc
(1)35× m
55. To find the overall similarity measure between documents d
1and d
4, we need to add these partial similarity values given by m
12× sc
(1)24× m
44+ m
13× sc
(1)35× m
55. Hence, the similarity matrix SR
(2)at the second iteration corresponds to paths of order-2 between documents. It can be shown similarly that, the matrices SR
(t)and SC
(t)represent order-t paths between documents and between words respectively.
Consequently, at each iteration t, when we compute the value of equa- tions (6a) and (6b) one or more new links may be found between previously disjoint objects (documents or words) corresponding to paths with length of order-t, and existing similarity measures may be strengthened. It has been shown that “in the long run”, the ending point of a random walk does not de- pend on its starting point (Seneta, 2006) and hence it is possible to find a path (and hence similarity) between any pair of nodes in a connected graph (Ze- likovitz & Hirsh, 2001) by iterating a sufficiently large number of times.
However, co-occurrences beyond the 3
rdand 4
thorder have little semantic
relevance (Bisson & Hussain, 2008; Lemaire & Denhi`ere, 2008). Therefore,
the number of iterations is usually limited to 4 or less.
3. Discussion and Improvements of χ-Sim
In this section, first, we discuss a new normalization schema for χ-Sim in order to satisfy (partially) the maximality property of a similarity measure (Sim(a, b) = 1), then we propose a pruning method allowing to remove the
’noisy’ similarity values created during the iterations.
3.1. Normalization
In this paper, we investigate extensions of the Generalized Cosine measure, by relaxing the positive semi-definiteness of the matrix, and by adding a pseudo- norm parameter k. Henceforth, using the equation (4) we define the elements of the matrices SR and SC with the two new equations (7a) and (7b):
∀i, j ∈ 1..r, sr
ij= Sim
k(m
i:, m
j:) q
Sim
k(m
i:, m
i:) × q
Sim
k(m
j:, m
j:)
(7a)
∀i, j ∈ 1..c, sc
ij= Sim
k(m
:i, m
:j) q
Sim
k(m
:i, m
:i) × q
Sim
k(m
:j, m
:j)
(7b) However, this normalization is what we will call a pseudo-normalization since if it guaranties that sr
ii= 1, it does not satisfy that ∀i, j ∈ 1..r, sr
ij∈ [0, 1] (and the same for sc
ij). Consider for example a corpus having, among many other documents, the documents d
1containing the word orange (w
1) and d
2containing the words red (w
2) and banana (w
3), along with SC – the similarity matrix of all the words of the corpus – indicating that the similarity between orange and red is 1, the similarity between orange and banana is 1 and the similarity between red and banana is 0. Thus, Sim
1(d
1, d
1) = 1, Sim
1(d
2, d
2) = 2 and Sim
1(d
1, d
2) = 2. Consequently, sr
12=
√1×22>
1. One can notice that this problem arises from the polysemic nature of
the word orange. Indeed, the similarity between these two documents is
overemphasized because of the double analogy between orange (the color)
and red, and between orange (the fruit) and banana. It is possible to cor-
rect this problem by setting k = +∞ since the pseudo-norm-k becomes
max
16k,l6c{m
ik× sc
kl× m
il} and thus, Sim
∞(d
1, d
1) = Sim
∞(d
2, d
2) =
Sim
∞(d
1, d
2) = 1, implying sr
12= 1. Of course, k = +∞ is not necessarily
a good setting for real tasks and experimentally we observed that the values
of sr
ijand sc
ijremain generally smaller or equal to 1.
In this framework it is nevertheless very interesting to investigate the dif- ferent results one can obtain from varying k, including values lower than 1, as suggested in (Aggarwal et al., 2001) for the norm L
k, to deal with high dimensional spaces. The resulting χ-Sim algorithm will be denoted by χ-Sim
k. However, the situation is different from the norm L
kin the sense that our method does not define a proper Normed Vector Space. To under- stand the problem it is worth looking closely to the simple case k = 1 where Sim
1(m
i:, m
j:) = m
i:× SC × m
Ti:is the general form of an inner product, with the condition that SC is symmetric positive semi-definite (PSD). Un- fortunately, in our case due to the normalization steps, SC is not necessarily PSD, as the condition ∀i, j ∈ 1..c, |sc
ij| 6 √
sc
ii× sc
jj= 1 is not verified (cf. previous example). Thus, our similarity measure is just a bilinear form in a degenerated inner product space (as the conjugate and linearity axioms are trivially satisfied) in which it corresponds to the ’cosine’.
A straightforward solution would be to project SC (and SR) after each iteration onto the set of PSD matrices (Qamar & Gaussier, 2009). By con- straining the similarity matrices to be PSD, we would ensure that the new space remains a proper inner product space. However, we experimentally verified that such an additional step did not improve the results though, as when testing on real datasets, the similarity matrices are very close to the set of PSD matrices. In addition, the projection step is very time consuming, for these reasons, we won’t use it in the remaining of this paper.
3.2. Dealing with ‘noise’ in SC and SR matrices
As explained in section 2.3., the elements of the SR matrix after the first iter- ation are the weighted order-1 paths in the graph: the diagonal elements sr
iicorrespond to the paths from each document to itself, while the non-diagonal terms sr
ijcount the number of order-1 paths between a document i and a neighbour j, which is based on the number of words they have in common.
SR
(1)is thus the adjacency matrix of the document graph. Iteration t amounts
thus to count the number of order-t between nodes. However, in a corpus, we
can observe there are many words with few occurrences, that are not really
relevant with the topic of the document, or to be more precise, that are not spe-
cific to any families of documents semantically related. These words are sim-
ilar to a ’noise’ in the dataset. Thus, during the iterations, these noisy words
allow the algorithm to create some new paths between the different families
of documents; of course these paths have a very small similarity value but
they are numerous and we make the assumption that they blur the similarity values between the classes of documents (same for the words). Based on this observation, we thus introduce in the χ-Sim algorithm a parameter, termed pruning threshold and denoted by p, allowing us to set to zero the lowest p % of the similarity values in the matrices SR and SC at each iteration.
In the following, we will refer to this algorithm as χ-Sim
pwhen using the previous normalization factor described in (6a) and (6b), and χ-Sim
kpwhen using the new pseudo-normalization factor described in (7a) and (7b).
3.3. A Generic χ-Sim Algorithm for χ-Sim
kpEquations (7a) and (7b) allows us to compute the similarities between two rows and two columns. The extension over all pair of rows and all pair of columns can be generalized under the form of a simple matrix multiplication.
We need to introduce a new notation here, M
◦k= (m
ij)
ki,j
which is the element-wise exponentiation of M to the power of k. The algorithm follows:
1. We initialize the similarity matrices SR (documents) and SC (words) with the identity matrix I, since, at the first iteration, only the similarity between a row (resp. column) and itself equals 1 and zero for all other rows (resp. columns). We denote these matrices as SR
(0)and SC
(0). 2. At each iteration t, we calculate the new similarity matrix between doc-
uments SR
(t)by using the similarity matrix between words SC
(t−1):
SR
(t)= M
◦k× SC
(t−1)× (M
◦k)
Tand sr
(t)ij←
q
ksr
(t)ij2k
q
sr
(t)ii× sr
(t)jj(8)
We do the same thing for the columns similarity matrix SC
(t):
SC
(t)= (M
◦k)
T× SR
(t−1)× M
◦kand sc
(t)ij←
q
ksc
(t)ij2k
q
sc
(t)ii× sc
(t)jj(9)
3. We set to 0 the p % of the lowest similarity values in the similarity matrices SR and SC.
4. Steps 2 and 3 are repeated t times (typically as we saw in section 2 the
value t = 4 is enough) to iteratively update SR
(t)and SC
(t).
It is worth noting here that even though χ-Sim
kpcomputes the similarity between each pair of documents using all pairs of words, the complexity of the algorithm remains comparable to classical similarity measures like Cosine.
Given that – for a generalized matrix of size n by n – the complexity of matrix multiplication is in O(n
3) and the complexity to compute M
◦nis in O(n
2), the overall complexity of χ-Sim is given by O(tn
3).
4. Experiments
Here, to evaluate our system, we cluster the documents coming from the well- known 20-Newsgroup dataset (NG20) by using the document similarity ma- trices SR generated by χ-Sim. We choose this dataset since it has been widely used as a benchmark for document classification and co-clustering (Dhillon et al., 2003; Long et al., 2005; Zhang et al., 2007; Long et al., 2006), thus allowing us to compare our results with those reported in the literature.
4.1. Preprocessing and Methodology
Test dataset. We replicate the experimental procedures used by Dhillon et al.
(2003); Long et al. (2005, 2006): 10 different samples of each of the 6 subsets described in Table 1 are generated, we ignored the subject lines, we removed stop words and we selected the top 2,000 words based on supervised mutual information (Yang & Pedersen, 1997). We will discuss further this last pre- processing step in section 5. With these six benchmarks, we compared our co-similarity measures based on χ-Sim with four similarity measures: Co- sine, LSA (Deerwester et al., 1990), SNOS (Liu et al., 2004) and CTK (Yen et al., 2009); as well as three co-clustering methods: ITCC (Dhillon et al., 2003), BVD (Long et al., 2005) and RSN (Long et al., 2006).
Table 1: Description of the subsets of the NG20 dataset used. We provide the number of clusters, and the number of documents for every subset.
Newsgroups included #clust. #docs.
M2 talk.politics.mideast, talk.politics.misc 2 500
M5 comp.graphics, rec.motorcycles, rec.sport.baseball, sci.space, talk.politics.mideast 5 500 M10 alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, rec.sport.hockey,
sci.crypt, sci.electronics, sci.med, sci.space, talk.politics.gun
10 500
NG1 rec.sports.baseball, rec.sports.hockey 2 400
NG2 comp.os.ms-windows.misc, comp.windows.x, rec.motorcycles, sci.crypt, sci.space 5 1000 NG3 comp.os.ms-windows.misc, comp.windows.x, misc.forsale, rec.motorcycles,
sci.crypt, sci.space, talk.politics.mideast, talk.religion.misc
8 1600
Creation of the clusters. For the ’similarity based’ algorithms: χ-Sim, Cosine, LSA, SNOS and CTK, the clusters are generated by an Agglomerative Hierarchical Clustering (AHC) method on the similarity matrices along with Ward’s linkage. Then we cut the clustering tree at the level corresponding to the number of document clusters we are waiting for (two for subset M2, etc).
Implementations. χ-Sim algorithms, as well as Cosine, SNOS and AHC have been implemented in Python, and LSA have been implemented in Mat- Lab. For CTK, we used the MatLab implementation kindly provided by the authors. For ITCC, we used the implementation provided by the authors and the parameters reported in (Dhillon et al., 2003). For BVD and RSN, as we don’t have a running implementation, we directly quote the best values from (Long et al., 2005) and (Long et al., 2006) respectively.
4.2. Experimental Measures
We used the classical micro-averaged precision (Pr) (Dhillon et al., 2003) for comparing the accuracy of the document classification; the Normalized Mu- tual Information (NMI) (Banerjee & Ghosh, 2002) is also used to compare χ-Sim with RSN. For SNOS, we perform four iterations and set the λ pa- rameter to the value proposed by the authors (Liu et al., 2004). For LSA, we tested the algorithm iteratively keeping the h highest singular values from h = 10..200 by steps of 10. We use the value of h providing, on average, the highest micro-averaged precision. For ITCC, we ran three times the algo- rithm using the different numbers of word clusters, as suggested in (Dhillon et al., 2003), for each dataset. For χ-Sim
p, we performed the pruning step as described in section 3.3. varying the value of p = 0 to 0.9 by steps of 0.1. For each subset, we report the best micro-averaged precision obtained with p.
The experimental results are summarized in Table 2. In all the versions,
χ-Sim performs better than all the other tested algorithms. Moreover, the
new normalization schema proposed in Section 3. clearly improved the results
of our algorithm over the previous normalization based on the length of the
documents. The SNOS algorithm performs poorly in spite of the fact that it
is very closed to χ-Sim, probably because it uses a different normalization. It
is interesting to notice that the gain obtained with the pruning when using the
previous version of χ-Sim on the M10 and NG3 (the two hardest problems)
is reduced to almost negligible levels with the new algorithm. Finally, the
impact of the parameter k is small for all the subsets but M10 and NG3. In
these more complex datasets, we observe that setting k to a value lower than
1 slightly improves the clustering but not in a significative way. This result seems to show that the results provided by (Aggarwal et al., 2001), suggesting to use a value of k lower than 1 with the norm L
kwhen dealing with high dimensional space, could be relevant in our framework (see next section).
Table 2: Micro-averaged precision (and NMI for χ-Sim based algorithms and RSN) and standard deviation for the various subsets of NG20.
M2 M5 M10 NG1 NG2 NG3
Cosine Pr 0.60±0.00 0.63±0.07 0.49±0.06 0.90±0.11 0.60±0.10 0.59±0.04 LSA Pr 0.92±0.02 0.87±0.06 0.59±0.07 0.96±0.01 0.82±0.03 0.74±0.03 ITCC Pr 0.79±0.06 0.49±0.10 0.29±0.02 0.69±0.09 0.63±0.06 0.59±0.05
BVD Pr best:0.95 best:0.93 best:0.67 - - -
RSN NMI - - - 0.64±0.16 0.75±0.07 0.70±0.04
SNOS Pr 0.55±0.02 0.25±0.02 0.24±0.06 0.51±0.01 0.24±0.02 0.22±0.05 CTK Pr 0.94±0.01 0.95±0.01 0.71±0.03 0.96±0.01 0.90±0.01 0.87±0.02
χ-Sim Pr 0.91±0.09 0.96±0.00 0.69±0.05 0.96±0.01 0.92±0.01 0.79±0.06
NMI 0.76±0.06 0.79±0.02 0.72±0.03
χ-Simp Pr 0.94±0.01 0.96±0.00 0.73±0.03 0.97±0.01 0.92±0.01 0.84±0.05
NMI 0.78±0.05 0.79±0.02 0.73±0.02
χ-Sim1 Pr 0.95±0.00 0.96±0.02 0.78±0.03 0.97±0.02 0.94±0.01 0.86±0.05
NMI 0.85±0.07 0.83±0.03 0.79±0.03
χ-Sim1p Pr 0.95±0.00 0.97±0.01 0.78±0.03 0.98±0.01 0.94±0.01 0.87±0.05
NMI 0.86±0.04 0.83±0.03 0.80±0.02
χ-Sim0.8 Pr 0.95±0.00 0.97±0.01 0.79±0.02 0.98±0.01 0.94±0.01 0.90±0.01
NMI 0.87±0.05 0.84±0.02 0.81±0.02
χ-Sim0.8p Pr 0.95±0.00 0.97±0.01 0.80±0.04 0.98±0.00 0.94±0.01 0.90±0.02
NMI 0.88±0.03 0.85±0.02 0.81±0.03
5. Discussion about the Preprocessing
The feature selection step aims at improving the results by removing words
that are not useful to separate the different clusters of documents. Moreover,
this step is also clearly needed due to the spatial and time complexity of the
algorithms in O(n
3). Nevertheless, we are performing an unsupervised learn-
ing task, thus using a supervised feature selection method, i.e. selecting the
top 2,000 words based on how much information they bring to one class of
documents or another, introduces some bias since it leads to ease the problem
by building well-separated clusters. In real applications, it is impossible to
use this kind of preprocessing for unsupervised learning. Thus to explore the
potential effects of this bias, we hereby propose to generate similar subsets of
the NG20 dataset but using an unsupervised feature selection method.
5.1. Unsupervised Feature Selection
To reduce the number of words in the learning set, we used an approach con- sisting in selecting a representative subset (sampling) of the words with the help of the k-medoids algorithm. The procedure is the following: first, we remove from the corpus the words appearing in just one document, as they do not provide information to built the clusters; then, we run k-medoids to get 2,000 classes corresponding to a selection of 2,000 words. We used the implementation of the algorithm provided in the Pycluster package (de Hoon et al., 2004) with the Euclidean distance.
5.2. Results with k-medoids
Here, we use the same methodology as described in section 4.1. except for the feature selection step which is now done with k-medoids instead of the supervised Mutual Information. The results are summarized in Table 3.
Table 3: Micro-averaged precision and standard deviation for the various subsets of NG20, pre-processed using the k-medoids feature selection.
M2 M5 M10 NG1 NG2 NG3
Cosine 0.61±0.04 0.54±0.08 0.39±0.03 0.52±0.01 0.60±0.05 0.49±0.02 LSA 0.79±0.09 0.66±0.05 0.44±0.04 0.56±0.05 0.61±0.06 0.52±0.03 ITCC 0.70±0.05 0.54±0.05 0.29±0.05 0.61±0.06 0.44±0.08 0.49±0.07 SNOS 0.51±0.01 0.26±0.04 0.20±0.02 0.51±0.00 0.24±0.01 0.22±0.02 CTK 0.67±0.10 0.76±0.04 0.54±0.05 0.69±0.14 0.64±0.06 0.54±0.02
χ-Sim 0.58±0.07 0.62±0.12 0.43±0.04 0.54±0.03 0.60±0.12 0.47±0.05
χ-Simp 0.65±0.09 0.68±0.06 0.47±0.04 0.62±0.12 0.63±0.14 0.57±0.04
χ-Sim1 0.54±0.06 0.62±0.13 0.36±0.04 0.53±0.02 0.35±0.09 0.30±0.05
χ-Sim1p 0.80±0.13 0.77±0.08 0.53±0.05 0.75±0.07 0.73±0.06 0.61±0.03
χ-Sim0.8 0.54±0.05 0.66±0.07 0.37±0.06 0.52±0.02 0.38±0.08 0.36±0.04
χ-Sim0.8p 0.81±0.10 0.79±0.05 0.55±0.04 0.81±0.02 0.72±0.02 0.64±0.04