Measures of distance - Model specification

Model specification

4.1 Measures of distance

In this chapter we will often discuss methods suitable for classifying and group-ing observations into homogeneous groups. In other words, we will consider the relationships between the rows of the data matrix which correspond to obser-vations. In order to compare observations, we need to introduce the idea of a distance measure, or proximity, among them. The indexes of proximity between pairs of observations furnish indispensable preliminary information for identify-ing homogeneous groups. More precisely, an index of proximity between any two observations xi and xj can be defined as a function of the corresponding row vectors in the data matrix:

I Pij =f (x_i, x_j), i, j =1,2, . . . , n.

We will use an example from Chapter 6 as a running example in this section.

We have n=32711 visitors to a website and p=35 dichotomous variables that define the behaviour of each visitor. In this case, a proximity index will be a function of two 35-dimensional row vectors. Knowledge of the indexes of proximity for every pair of visitors allows us to select those among them who

are more similar, or at least less different, with the purpose of identifying some groups as the most homogeneous among them.

When the variables of interest are quantitative, the indexes of proximity typ-ically used are called distances. If the variables are qualitative, the distance between observations can be measured by indexes of similarity. If the data are contained in a contingency table, the chi-squared distance can also be employed.

There are also indexes of proximity that are used on a mixture of qualitative and quantitative variables. We will examine the Euclidean distance for quantitative variables, and some indexes of similarity for qualitative variables.

4.1.1 Euclidean distance

Consider a data matrix containing only quantitative (or binary) variables. Ifxand y are rows from the data matrix then a functiond(x, y)is said to be a distance between two observations if it satisfies the following properties:

• Non-negativity.d(x, y)≥0, for allx andy.

• Identity.d(x, y)=0⇔x=y, for all xandy.

• Symmetry.d(x, y)=d(y, x), for allx andy.

• Triangular inequality.d(x, y)≤(x, z)+d(y, z), for allx,yand z.

To achieve a grouping of all observations, the distance is usually considered between all observations present in the data matrix. All such distances can be represented in a matrix of distances. A distance matrix can be represented in the following way:

where the generic elementdij is a measure of distance between the row vectors xi andxj. The Euclidean distance is the most commonly used distance measure.

It is defined, for any two units indexed by i and j, as the square root of the difference between the corresponding vectors, in the p-dimensional Euclidean space:

The Euclidean distance can be strongly influenced by a single large difference in one dimension of the values, because the square will greatly magnify that difference. Dimensions having different scales (e.g. some values measured in centimetres, others in metres) are often the source of these overstated differences.

To overcome such limitation, the Euclidean distance is often calculated, not on the original variables, but on useful transformations of them. The most common

choice is to standardise the variables. After standardisation, every transformed variable contributes to the calculation of the distance with equal weight. When the variables are standardised, they have zero mean and unit variance; furthermore, it can be shown that, fori,j =1, . . . , p:

2d_ij² =2(1−r_ij), rij =1−d_ij²/2,

whererij is the correlation coefficient between the observationsxi andxj. Thus the Euclidean distance between two observations is a function of the correlation coefficient between them.

4.1.2 Similarity measures

Given a finite set of observationsui ∈U, a functionS(ui, uj)=Sij fromU×U toRis called an index of similarity if it satisfies the following properties:

• Non-negativity.Sij ≥0, for allui, uj ∈U.

• Normalisation. Sii =1, for allui ∈U.

• Symmetry.Sij =Sj i, for all ui, uj ∈U.

Unlike distances, the indexes of similarity can be applied to all kinds of variables, including qualitative variables. They are defined with reference to the observation indexes, rather than to the corresponding row vectors, and they assume values in the closed interval [0, 1], making them easy to interpret.

The complement of an index of similarity is called an index of dissimilarity and represents a class of indexes of proximity wider than that of the distances. In fact, as a distance, a dissimilarity index satisfies the properties of non-negativity and symmetry. However, the property of normalisation is not equivalent to the property of identity of the distances. Finally, dissimilarities do not have to satisfy the triangle inequality.

As we have observed, indexes of similarity can be calculated, in principle, for quantitative variables. But they would be of limited use since they would tell us only whether two observations had, for the different variables, observed values equal or different, without saying anything about the size of the difference. From an operational viewpoint, the principal indexes of similarity make reference to data matrices containing binary variables. More general cases, with variables having more than two levels, can be brought back to this setting through the technique of binarisation.

Consider data onnvisitors to a website, which hasP pages. Correspondingly, there areP binary variables, which assume the value 1 if the specific page has been visited, or else the value 0. To demonstrate the application of similarity indexes, we now analyse only data concerning the behaviour of the first two visitors (2 of then observations) to the website described in Chapter 6, among theP =28 web pages that they can visit. Table 4.1 summarises the behaviour of the two visitors, treating each page as a binary variable.

Table 4.1 Classification of the visited web pages.

Visitor B Visitor A

PA = 4 CA = 21

25 1

CP = 2 AP = 1

Total

6 22 P = 28 1

0 Total

Note that, of the 28 pages considered, two have been visited by both visitors.

In other words, 2 represent the absolute frequency of contemporary occurrences (CP, for co-presence, or positive matches) for the two observations. In the lower right-hand corner of the table, there is a frequency of 21 equal to the number of pages that are visited neither by A nor by B. This frequency corresponds to contemporary absences in the two observations (CA, for co-absences or negative matches). Finally, the frequencies of 4 and 1 indicate the number of pages that only one of the two navigators visits (PA for presence–absence and AP for absence–presence, where the first letter refers to visitor A and the second to visitor B).

The latter two frequencies denote the differential aspects between the two vis-itors and therefore must be treated in the same way, being symmetrical. The co-presence is aimed at determining the similarity among the two visitors, a fun-damental condition because they could belong to the same group. The co-absence is less important, perhaps of negligible importance for determining the similar-ities between the two units. In fact, the indexes of similarity developed in the statistical literature differ in how they treat the co-absence, as we now describe.

Russel– Rao similarity index

The Russel–Rao similarity index is a function of the co-presences and is equal to the ratio between the number of the co-presences and the total number of binary variables considered,P:

Sij = CP P . From Table 4.1 we have

Sij = 2

28 ≈0.07.

Jaccard similarity index

This index is the ratio between the number of co-presences and the total number of variables, excluding those that manifest co-absences:

Sij = CP

CP +P A+AP.

Note that this index cannot be defined if two visitors or, more generally, the two observations, manifest only co-absences (CA=P ). In the example above we have

Sij = 2

7 ≈0.29.

Sokal–Michener similarity index

This is the ratio between the number of co-presences or co-absences and the total number of the variables:

Sij = CP+CA

P .

In our example

Sij = 23

28 ≈0.82.

For the Sokal–Michener index (also called the simple matching coefficient or, in a slight abuse of terminology, the binary distance) it is simple to demonstrate that its complement to one (a dissimilarity index) corresponds to the average of the squared Euclidean distance between the two vectors of binary variables associated to the observations:

1−Sij = 1 P(2d_ij²).

It is one of the commonly used indexes of similarity.

4.1.3 Multidimensional scaling

In the previous subsections we have seen how to calculate proximities between observations, on the basis of a given data matrix, or a table derived from it.

Sometimes, only the proximities between observations are available, for instance in terms of a distance matrix, and it is desired to reconstruct the values of the observations. In other cases, the proximities are calculated using a dissimilarity measure and it is desired to reproduce them in terms of a Euclidean distance, to obtain a representation of the observations in a two-dimensional plane. Mul-tidimensional scaling methods are aimed at representing observations whose observed values are unknown (or not expressed numerically) in a low-dimensional Euclidean space (usually R²). The representation is achieved by preserving the original distances as far as possible.

Section 3.5 explained how to use the method of principal components on a quantitative data matrix in a Euclidean space. It turns the data matrix into a lower-dimensional Euclidean projection by minimising the Euclidean distance between the original observations and the projected ones. Similarly, multidimen-sional scaling methods look for low-dimenmultidimen-sional Euclidean representations of the observations, representations which minimise an appropriate distance between the original distances and the new Euclidean distances. Multidimensional scaling

methods differ in how such distance is defined. The most common choice is the stress function, defined by

ⁿ

i=1 n

j=1

(δij−dij)²,

where the δij are the original distances (or dissimilarities) between each pair of observations, and thedij are the corresponding distances between the reproduced coordinates.

Metric multidimensional scaling methods look forkreal-valuedn-dimensional vectors, each representing one coordinate measurement of the n observations, such that the n×n distance matrix between the observations, expressed bydij, minimises the squared stress function. Typically, k=2, so that the results of the procedure can be conveniently represented in a scatterplot. The illustrated solution is also known as least squares scaling. A variant of least squares scaling is Sammon mapping, that minimises

ⁿ

i=1 n

j=1

(δij−dij)² δij

, thereby preserving smaller distances.

When the proximities between objects are expressed by a Euclidean distance, it can be shown that the solution of the previous problem corresponds to the prin-cipal component scores that would be obtained if the data matrix were available.

It is possible to define non-metric multidimensional scaling methods, where the relationship preserved between the original and the reproduced distances is not necessarily Euclidean. For further details, see Mardiaet al. (1979).

Dans le document Applied Data Mining for Business and Industry (Page 49-54)