Measures of the Mapping Quality - Graph-Based Clustering and Data Visualization Algorithms

During the mapping process algorithms try to approximate high-dimensional spatial distribution of the objects in a low-dimensional vector space. Different algorithms may result in different low-dimensional visualisations by emphasising different char-acteristics of the objects relationships. While some of them try the preserve distances others neighbourhood relationships. From an other aspects we can see that there are algorithms that emphasise the local structure of the data points, while other methods put the global structure in the focus. According to these approaches the evaluation criteria of mapping can be summarised as follows:

• Distance versus neighbourhood preservation: Mapping methods try to preserve either the distances or the neighbourhood relations among the data points. While dimensionality reduction methods based on distance preservation try to preserve the pairwise distances between the samples, the mapping methods based on neigh-bourhood preservation attempt to preserve the global ordering relation of the data.

There are several numeral measures proposed to express how well the distances are preserved, for example the classical metric MDS [27,28] and the Sammon stress [18] functions. The degree of the preservation of the neighbourhood relations can be measured by the functions of trustworthiness and continuity [29,30].

• Local versus global methods. On the other hand, the analysis of the considered mapping methods can be based on the evaluation of the mapping qualities in local and global environment of the objects. Local approaches attempt to preserve the local geometry of data, namely they try to map nearby points in the input space to nearby points in the output space. Global approaches attempt to preserve geometry at all scales by mapping nearby points in the input space to nearby points in the output space, and faraway points in the input space to faraway points in the output space.

To measure distance preservation of the mapping methods mostly the Sammon stress function, classical MDS stress function and residual variance are used most commonly.

Sammon stress and classical MDS stress functions are similar to each other.

Both functions calculate pairwise distances in the original and in the reduced low-dimensional space as well. Both measures can be interpreted as an error between orig-inal distances in the high-dimensional vector space and the mapped low-dimensional vector space. The difference between the Sammon stress and the classical MDS stress functions is that the Sammon stress contains a normalizing factor. In the Sammon stress errors are normalised by distances of the input data objects. The Sammon stress is calculated as it is shown in Eqs. 3.1and3.2demonstrate the classical MDS stress function.

ESM = 1 N i<j

d_i^∗_,_j N

i<j

(d_i^∗_,_j−di,j)²

d_i^∗_,_j , (3.1)

Emetric_MDS= 1 N i<j

d_i^∗_,²_j N

i<j

(d_i^∗_,_j−di,j)², (3.2)

In both equations d_i^∗_,_j denotes the distance between the i th and j th original objects, and di,j yields the distance for the mapped data points in the reduced vector space.

Variable N yields the number of the objects to be mapped.

The error measure is based on the residual variance defined as:

1−R²(D^∗X,DY), (3.3)

where DY denotes the matrix of Euclidean distances in the low-dimensional output space (DY = [di,j]), and D^∗_X, D^∗_X = [d_i^∗_,_j]is the best estimation of the distances of the data to be projected. The pairwise dissimilarities of the objects in the input space may arising from the Euclidean distances or may be estimated by graph distances of the objects. R is the standard linear correlation coefficient, taken over all entries of D^∗_X and DY.

In the following in this book, when the examined methods utilise geodesic or a graph distances to calculate the pairwise dissimilarities of the objects in the high-dimensional space the values of the dissimilarities of these objects (d_i^∗_,_j) are also estimated based on this principle.

The neighbourhood preservation of the mappings and the local and global mapping qualities can be measured by functions of trustworthiness and continu-ity. Kaski and Vienna pointed out that every visualisation method has to make a tradeoff between gaining good trustworthiness and preserving the continuity of the mapping [30,31].

A projection is said to be trustworthy [29,30] when the nearest neighbours of a point in the reduced space are also close in the original vector space. Let N be the number of the objects to be mapped, Uk(i)be the set of points that are in the k size neighbourhood of the sample i in the visualisation display but not in the original data space. Trustworthiness of visualisation can be calculated in the following way:

M1(k)=1− 2

where r(i,j)denotes the ranking of the objects in input space.

The projection onto a lower dimensional output space is said to be continuous [29,30] when points near to each other in the original space are also nearby in the output space. Continuity of visualisation is calculated by the following equation:

M2(k)=1− 2 denotes the set of those data points that belong to the k-neighbours of data sample i in the original space, but not in the mapped space used for visualisation.

In this book when mappings are based on geodesic distances, the ranking values of the objects in both cases (trustworthiness and continuity) are calculated based on the geodesic distances.

Mapping quality of the applied methods in local and in global area can be expressed by trustworthiness and continuity. Both measures are function of the number of neighbours k. Usually, trustworthiness and continuity are calculated for k =1,2, . . . ,kmax, where kmaxdenotes the maximum number of the objects to be taken into account. At small values of parameter k the local reconstruction perfor-mance of the model can be tested, while at larger values of parameter k the global reconstruction is measured.

Topographic error and topographic product quality measures may also be used to give information about the neighbourhood preservation of mapping algorithms.

Topographic error [32] takes only the first and second neighbours of each data point into account and it analyzes whether the nearest and the second nearest neighbours remain neighbours of the object in the mapped space or not. If these data points are not adjacent in the mapped graph the quality measure considers this a mapping error.

The sum of errors is normalized to a range from 0 to 1, where 0 means the perfect topology preservation.

Topographic product introduced by Bauer in 1992 [33] was developed for qual-ifying the mapping result of SOM. This measure has an input parameter k and it takes not only the two nearest neighbor into account. The topographic product com-pares the neighbourhood relationship between each pair of data points with respect to both their position in the resulted map and their original reference vectors in the

observation space. As result it indicates whether the dimensionality of the output space is too small or too large.

Dans le document Graph-Based Clustering and Data Visualization Algorithms (Page 57-60)