2 Literature Review - Algorithms from and for Nature

The agreement metrics described in this paper are based upon the Rand index for clustering agreement. The Rand index (Rand 1971) was devised to compare clus-tering configurations and to help test the reliability of cluster analysis techniques.

A version of the Rand index, adjusted for random agreement, is described inHubert and Arabie(1985). Versions of the Rand index to calculate agreement between solutions were developed independently inAkkucuk(2004),Akkucuk and Carroll

Properties of a General Measure of Configuration Agreement 157

(2006),Chen(2006), andChen and Buja(2009). The basic agreement rate metric (AR) is described below.

Consider two solution configurations A and B. A is annm₁matrix and B is annm₂matrix. Given a distance metric function f that conforms to the distance axioms, letf .A/ D D_A andf .B/ D D_B. For each item i, item j is one of the k nearest neighbors of i if d(i,j) is one of k smallest values ofd.i; l/, wherelD1 : : : n andl ¤ i. Annk matrix of nearest neighbor indexes can be created for each configuration. Let N_Abe the matrix of nearest neighbors for configuration A and N_B be the matrix of nearest neighbors for configuration B. Leta_i be the number of indexes in both row i of N_Aand row i of N_B. The overall agreement is given in (1).

An adjusted agreement metric (A) is described inAkkucuk(2004) andAkkucuk and Carroll(2006). Random agreement is subtracted from the agreement metric by averaging the agreement from multiple empirically generated random samples. In Chen(2006) andChen and Buja(2009) an adjusted agreement metric is created by assuming a hyper-geometric distribution. The expected agreement is given in (2) and the adjusted agreement is given in (3).

E ŒAR.k/ D 1 A generalized agreement metric (France and Carroll 2007) is given in (4). This agreement metric is denoted as . It is calculated across all k and takes account of random agreement.

In (4), equal weights are given for all values of k. A further generalization of , given in (5), allows for a weighting function.

f .k/D

158 S.L. France

The function can be restricted to certain values of k. For example, setting f .k/D1forkD1 : : : 4averages evenly over the 4 nearest neighbors. A weighting that is even for values of k from 1 ton=4and then linearly declines to 0 atn=2is given in (6). The weighting scheme used is reliant on the application. For example, a cellphone company marketing manager may wish to examine a perceptual map of brand positions. The manager may want the answer to the question, “Am I competing more closely with Samsung or Nokia?” Thus, a good quality solution would have strong recovery of nearest neighbors. However, for visualization of a large scale nonlinear manifold, the overall global recovery of the manifold shape may be more important than the recovery of nearest neighbors.

f .k/D (

1 0kn=4

1^k.n=4/_.n=4/ n=4 < kn=2 (6)

Several properties of are given inFrance and Carroll(2007). These properties are listed below. The proofs are given inFrance and Carroll(2007).

1. AR is not monotonic with respect to k.

2. sup˚ The metric can be thought of as analogous to a discrete GINI coefficient (Corrodo 1921), but with the “inequality” curve above rather than below the line of equality (or random agreement). Given a line of random agreement over k and the value of AR plotted across k, an unweighted coefficient measures the total proportion of the area above the random agreement line that is below the AR line.

The AR and metrics described in this section measure the proportion of items that are in both configurations. These metrics are symmetric and they are not affected by the order of the configurations. If one was to definea_i as the number of indexes in row i of NAbut not in row i of NB or vice versa then the metrics would be asymmetric. Asymmetric agreement metrics are described inKaski et al.(2003).

A framework for both symmetric and asymmetric agreement metrics is given in Lee and Verleysen (2009). The framework assumes a source high dimensional configuration and a derived low dimensional configuration. The nearest neighbor ranking of item j for item i isrO_ij for the input configuration andr_ij for the output configuration. Hard and soft deviations from agreement are listed below.

1. Hard Intrusion:r_ij k <rO_ij 2. Soft Intrusion:r_ij <rO_ij k 3. Hard Extrusion:r_ij k <rO_ij 4. Soft Extrusion:rO_ij < r_ij k

Properties of a General Measure of Configuration Agreement 159

The intrusions and extrusions can be summarized in a “co-ranking” matrix. The AR and agreement metrics measure cases where there is a hard intrusion or extrusion. A method of using a diagonal subset of the co-ranking matrix is described inLueks et al.(2011). What may be a hard inclusion or exclusion at one value of k may be a soft inclusion or exclusion at a larger value of k. For example, consider a situation where item l is the second nearest neighbor of item i in configuration A and the fourth nearest neighbor of item i in configuration B. Item l gives a hard intrusion/extrusion forkD2andkD3, but a soft intrusion/extrusion fork > 4.

For , item l affects the agreement rate forkD2andkD3, creating a range of

“non-agreement”.

3 Extension

We extend previous work by describing a partial agreement metric. The partial agreement metric is analogous to the partial correlation coefficient (Fisher 1924).

The rationale behind the partial agreement metric is to discount some configuration Z when calculating the agreement between configurations A and B. A marketing example could be the calculation of the agreement between perceptual product maps for a consumer after two different promotions A and B. The configuration for the consumer’s previous perceptual map Z would be discounted from the equation in order to emphasize the differences between configuration A and configuration B.

The equation for partial agreement is given in (8).

ABZD ^AB AZ BZ

q 1 _AZ²

q 1 _BZ²

(8)

As per the properties of the partial correlation coefficient; if1 AB 1, 1 AZ 1, and1 BZ 1then1 ABZ1. If AZ D1, AZ D 1,

BZ D1, or BZ D 1then the partial agreement metric is undefined.

Dans le document Algorithms from and for Nature (Page 165-168)