Regression models as prototypes - 2 Cluster Analysis for Object Data

2 Cluster Analysis for Object Data

G. Regression models as prototypes

Another family of objective functions that use non-point prototypes was introduced in Hathaway and Bezdek (1993). They called this family fuzzy c-regression models (FCRM). Minimization of particular objective functions in the family yields simultaneous estimates for the parameters of c regression models; and a fuzzy c-partitioning of the data.

Let S = {(x , y ) (x , y )} be a set of data where each independent observation x e 9^® has a corresponding dependent observation y e

9t^ In the simplest case we a s s u m e t h a t a single functional relationship between x and y holds for all the data in S. In many cases a statistical framework is imposed on this problem to account for measurement errors in the data, and a corresponding optimal solution is sought. Usually, the search for a "best" function is partially constrained by choosing the functional form of f in the assumed relationship

y = f(x; B) + c , (2.61) where B e Q c 9^*^ is the vector of parameters that define f to be

determined, and c Is a random vector with mean vector jji = 0 e '3i^

and covariance matrix X. The definition of an optimal estimate of B depends on distributional assumptions made about c, and the set Q of feasible values of B. This type of model is well known and can be found in most texts on multivariate statistics.

The model considered by Hathaway and Bezdek (1993) is known as a switching regression model (Hosmer, 1974, Kiefer, 1978, Quandt and Ramsey, 1978, De Veaux, 1989). We assume S to be drawn from c models

y = f(x;p.) + Cj , l < i < c , (2.62) where p e £2 c 9?^', and c is a random vector with mean vector u = 0

' i l l '^1

e Si^ and covariance matrix X. Good estimates for the parameters B

= {p ,-•••,P } are desired as in the single model case. Here, as in (2.24), P. is the set of parameters for the i-th prototype, which in this case is the regression function f. However, we have the added difficulty that S is unlabeled. That is, for a given datum (x ,y ), it is not known which model from (2.62) applies.

One approach for estimating the parameters {|3 } is to use the GMD-AO algorithm (Table 2.4).The approach taken here is more akin to fuzzy cluster analysis than statistics. The main problem is that the data in S are unlabeled, so numerical methods for estimation of the parameters almost always lead to equations which are coupled across classes. If S were partitioned into c crisp s u b s e t s corresponding to the regimes represented by the models in (2.62), then estimates for {p p } could be obtained by simpler methods.

One alternative to using GMD-AO is to first find a crisp c-partition of S using an algorithm such as HCM; and then solve c separate single-model problems using S. with (2.61). This is usually not done

CLUSTER ANALYSIS 71 b e c a u s e it may not explain the data s t r u c t u r e properly. The effectiveness using of (2.61) for each crisp cluster depends on how accurate the crisp clusters are.

Hathaway and Bezdek formulated the two problems (partitioning S and estimating {[3 p }, the parameters of the prototype functions {f.(x; p.)}) so that a simultaneous solution could be attempted. A clustering criterion is needed that explicitly accounts for both the form of the regression models as well as the need to partition the unlabeled data so that each cluster of S is well-fit by a single model from (2.62). For the switching regression problem we interpret u as the importance or weight attached to the extent to which the model value f.(x ; p.) matches y . Crisp memberships (O's and I's) in this context would place all of the weight in the approximation of y by

f (x • p.) on one class for each k. But fuzzy partitions enable u s to represent situations where a data point fits several models equally well, or more generally, may fit all c models to varying degrees.

The measure of similarity in (2.24c) for the FCRM models is some measure of the quality of the approximation of y by each f: for

l<i<c; l<k<n, define

E., (P.) = measure of error in f (x ;P.) = y. . (2.63)

IK 1 1 K 1 K

The most common example for such a measure is the vector norm E (P ) = II f (x,; P ) - y, II. The precise nature of (2.63) can be left

I k ' i I ' i k ' i ' k " '•

unspecified to allow a very general framework. However, all choices for E are required to satisfy the following minimizer property. Let a = (a,,a^,....a)Twitha > 0 V i, and E.(p.) = (E.fp.),...,E. (p.))T 1 < i < c.

1. ^ n. 1 1 1 11 1 ixi 1

We require that each of the c Euclidean dot products

( a , E i ( p i ) ) ; l < i < c (2.64) have a global minimum over Q., the set of feasible values of p.. The

general family of FCRM objective functions is defined, for U e M and (P ,...,p ) e O.^Y.Q.^Y.---y.Q.^&'^^ v-'^'^y.-'-y.'^'^, by the fuzzy instance of (2.24a) with (2.63) that satisfy (2.64) inserted into (2.24c) - that is, D^ = Ey^(pj). The basis for this approach is the belief that minimizers (U, B) of J (U, B) are such that U is a reasonable fuzzy partitioning of S and that {pj p^,} determine a good switching regression model.

Minimization of (2.24a) under the assumptions of FCRM can be done with the u s u a l AO approach whenever grouped coordinate minimization with analytic formulae is possible. Specifically, given data S, set m > 1, choose c parametric regression models {f (x;

p )}, and choose a measure of error E = {E } so that E (|3) > 0 for i and k, and for which the minimizer property defined by (2.64) holds.

Pick a termination threshold t > 0 and an initial partition U e M, . Then for r = 0,1,2,...: calculate values for the c model

fen

parameters p.''^' t h a t globally minimize (over Q x D. x x Q. ) the restricted obiective function J (U , p ,,...,p ). Update U -^ U e M,

-" m r ' 1 ' c r r+1 fen

with the usual FCM update formula (2.7a). Finally, compare either

||Ur+i-Ui.|| or ||Bj.^i-Br|| in some convenient matrix norm to a termination threshold e. If successive estimates are less than e, stop;

otherwise set r = r+1 and continue.

Solution of the switching regression problem with mixture decomposition using the GMD-AO algorithm can be regarded as the same optimization approach applied to the objective function

L ( U , Y ^ , . . . , Y J = I Iu.j^(Eji^(Yj + ln(u.^)), see equation (11) of

k=l i=l

Bezdek et al. (1987a). In this case, the [y] are the regression model parameters (the {B}), plus additional parameters such as means, covariance matrices and mixing proportions associated with the c components of the mixture. Minimization with respect to B is possible since the measure of error satisfies the minimizer property and J can be rewritten to look like a sum of functions of the form

in (2.64).

For a specific example, suppose that t=l, and for l<i<c: k. = s, Q. = 9^s, f(x^; p^) = ( x / p ^ , and Ejp^) = (y^ - [x^V^^]^. Then J^(U, B) is a fuzzy, multi-model extension of the least squares criterion for model-fitting, and any existing software for solving weighted least squares problems can be used to accomplish the minimization. The explicit formulae for the new iterates p ', 1 < i < c, can be easily derived using calculus. Let X denote the matrix in 9?"^ having x as its k-th row; Y denote the vector in 9t" having y as its k-th component; and D. denote the diagonal matrix in 5R"" having (u )™

as its k-th diagonal element. If the columns of X are linearly independent and u > 0 for 1 < k < n, then

p.'"''= [X'^DXI-^X'^DY . (2.65)

CLUSTER ANALYSIS 73

If the columns of X are not linearly independent, p ''^^ can still be c a l c u l a t e d directly, b u t t e c h n i q u e s b a s e d on orthogonal factorizations of X should be used. Though it rarely occurs in practice, u can equal 0 for some values of k, but this will cause singularity of [XTD.X] only in degenerate (and extremely unusual) cases. As a practical matter, p ''^' in (2.65) will be defined throughout the iteration if the columns of X are linearly independent.

Global convergence theory from Zangwill (1969) can be applied for reasonable choices of E.. (P ) to show that any limit point of an iteration sequence will be a minimlzer, or at worst a saddle point, of J (U,p ,...,P ). The local convergence result in Bezdek et al. (1987a) states that if the error measures {E (p.)} are sufficiently smooth and a standard convexity property holds at a minimizer (U, B) of J , then any iteration sequence started with U sufficiently close to U will converge to (U, B). Furthermore, the rate of convergence of the sequence will be q-linear.

The level of computational difficulty in minimization of J with respect to B is a major consideration In choosing the particular measure of error E, (p ). The best situation is when a closed form

ik ' 1

solution for the new iterate p exists such as in the example at (2.65). Fortunately, in cases where the minimization must be done iteratively, the convergence theory in Hathaway and Bezdek (1991) shows that a single step of Newton's method, rather t h a n exact minimization, is sufficient to preserve the local convergence results. The case of inexact minimization in each half step is further discussed and exemplified in Bezdek and Hathaway (1992) in connection with the FCS algorithm of Dave.

-tfy^ Nip''

Example 2.11 This example illustrates the use of FCRM to fit c = 2 quadratic regression models. The quadratic models are of the form

y = P„ + Pj2X + PigX^ , and (2.66a) y = 1^21 + ^22^-^1^23^' • (2.66b) The four data sets A, B, C and D specified in Table 2.6 were generated

by computing y from 2.66(a) or 2.66(b) at n / 2 fixed, equally spaced x-values across the interval given in Column 3 of Table 2.6. This resulted in sets of n points (which we pretend are unlabeled), half of

which were generated from each of the two quadratics specified by the parameters in Columns 4 and 5 of Table 2.6. These four data sets are scatterplotted in Figure 2.14.

Table 2.6 Data from the qiiadratic models y = p^^ + ^^^x + PiaX^

n x - i n t e r v a l

h

A 4 6 [5, 27.5] | 3 j ^ = ( 2 1 , - 2 , 0.0625) P ^ = ( - 5 , 2,-0.0625) B 2 8 [9, 22.5] I3^g=(21,-2, 0.0625) p2B= (-5.2,-0.0625) C 3 0 [9, 23.5] P^^= (18.-1,0.03125) p2c= (-2. 1,-0.03125) D 4 6 [10.5, 21.75] p^j^= (172,-26,1) p2P= (364.-38,1)

FCRM iterations seeking two quadratic models were initialized at a pair of quadratics with parameters P , „= (-19. 2, 0); ^„= (-31, 2, 0).

Since the coefficients of the x^ terms are zero, the initializing models are t h e dashed lines shown in Figure 2.14. FCRM r u n parameters were c = m = 2 and E^(pj) = (Y^ " Pa " ^n\' ^ta^^^- Iteration was stopped as soon as the maximum change in the absolute value of successive pairs of estimates of the six parameter values for that model was found to be less than or equal to e =.00001, t h a t is.

»r+l < 0.00001.

Data Set A

) (

/ ^{i- r .}

Data Set B

16 Data Set C

0 32 0

X '' '

1 1

. . / il6

16 Data Set D

16 ^{32 0} ³²

Figure 2.14 Initial (dashed) and terminal (solid) models

CLUSTER ANALYSIS 75 Figure 2.14 shows the initial (dashed lines) and terminal regression models FCRM found when started at the given initialization. The initializing lines were neither horizontal nor vertical - they were inclined to the axes of symmetry of the data in every case. This initialization led to successful termination at the true values of the generating quadratics very rapidly (6-10 iterations) for all four data sets. The terminal fits to the data are in these four cases good (accurate to essentially machine precision). In the source paper for this example FCRM detected and characterized the quadratic models generating these four data sets correctly in 9 of 12 attempts over three different pairs of initializing lines.

FCRM differs from quadric c-shells most importantly in the sense that the regression functions - which are the FCRM prototypes - need not be recognizable geometric entities. Thus, data whose functional dependency is much more complicated than hyperquadric can (in principle at least) be accommodated by FCRM. Finally, FCRM explicitly recognizes functional dependency between grouped subsets of independent and dependent variables in the data, whereas none of the previous methods do. These are the major differences between FCRM and all the other non-point prototype clustering methods discussed in this section. In the terminology of Section 4.6, FCRM is really more aptly described as a "system identification"

method, the system being the mixed c-regression models.

Dans le document FUZZY MODELS AND ALGORITHMS FOR PATTERN RECOGNITION AND IMAGE PROCESSING (Page 83-89)