CONCLUSION AND FUTURE TRENDS - Data Mining: A Heuristic Approach

In this work, we have put forward a case for approximation of objective criteria and of proximity information as keys to the development of generic, scalable and robust methods for clustering for data mining applications. In particular, random-ized sampling and partition techniques seem to hold the greatest promise for pushing back the barrier of scalability for these important problems.

Although some of the solutions we have presented are specific to certain problem settings, others can be applied in a wide variety of settings, both inside and beyond the field of data mining clustering. The idea of using approximations to full proximity information is a very general one, and there is much potential for the application of these ideas in other settings. The underlying spirit of these solutions is the same: obtain the highest quality possible subject to a rigid observation of time constraints.

There are other settings in which approximation to the full proximity informa-tion has been used to good effect. In physics, researchers applied simulated annealing to predict the motion of particles in 2D and 3D used a hierarchical structure to aggregate distance computations (Carrier, Greengard & Rokhlin, 1988;

Barnes & Hut, 1986), reducing the cost of an iteration of the simulation from Θ(n²) to O(n log n). These structures have been taken up and extended to the area of graph drawing, where graphs are treated as linkages of stretchable springs between nodes with repulsive charges. Layouts of these graphs are generated by simulating the effect of forces along springs and between nodes, also using simulated annealing (Quigley & Eades, 2000). Similar structures have also been used in facility location (Belbin, 1987). Although scalability with respect to dimension is not an issue in these fields, the techniques presented here provide yet more opportunities for the development of fast and robust algorithms.

Our techniques contribute to the area of robust statistics. We showed (Estivill-Castro & Houle, 2000) that robust estimators of location could be computed in subquadratic time using random sampling and partitioning. Even though the statistic itself is a random variable, its robustness can be proved according to several standards.

Finally, there is great potential for our techniques to be hybridized with other search strategies, such as genetic algorithms (Estivill-Castro, 2000; Estivill-Castro

& Torres-Velázquez, 1998; Estivill-Castro & Murray, 2000; and references) and

simulated annealing [Murray & Church, 1996]. Genetic algorithms and simulated annealing have extended the power of local search hill-climbers due to their ability to escape from and improve over local optima, at the cost of an increase in the computation time. These areas can also benefit from a stricter management of the distance computations performed.

REFERENCES

Aldenderfer, M.S. and R.K. Blashfield (1973.) Cluster Analysis. Sage, Beverly Hills, 1984.

M.R. Anderberg. Cluster Analysis with Applications. Academic Press, NY.

Aurenhammer, F. (1991). Voronoi diagrams: A survey of a fundamental geometric data structure. ACM Comput. Surv., 23(3):345-405.

Barnes, J. and P. Hut (1986). A hierarchical O(n log n) force-calculation algorithm. Nature, 324(4), 446-449.

Beckmann, N., H.-P. Kriegel, R. Schneider, and B. Seeger (1990). The R*-tree: An efficient and robust access method for points and rectangles. In Proc. ACM SIGMOD Conf. on Management of Data, 322-331.

Belbin, L. (1987). The use of non-hierarchical allocation methods for clustering large sets of data. The Australian Computer Journal, 19(1), 32-41.

Bentley, J.L. (1975). A survey of techniques for fixed radius near neighbor searching.

Report STAN-CS-78-513, Dept. Comput. Sci., Stanford Univ., Stanford, CA.

Bentley, J.L. (1979). Decomposable searching problems. Information Processing Letters, 8, 244-251.

Berchtold, S. D.A. Keim, and H.-P. Kriegel(1996).. The X-tree: An index structure for higher dimensional data. In Proc. 22nd VLDB Conference, 28-39.

Berry, M.J.A. and G. Linoff (1997). Data Mining Techniques - for Marketing, Sales and Customer Support. John Wiley & Sons, NY, USA.

Berson, A. and S.J. Smith (1998). Data Warehousing, Data Mining, & OLAP. Series on Data Warehousing and Data Management. McGraw-Hill, NY, USA.

Bradley, P.S. and U. Fayyad (1998). Refining the initial points in k-means clustering. In Proc. of the 15th Int. Con.e on Machine Learning,Morgan Kaufmann, 91-99.

Bradley, P.S., U. Fayyad, and C. Reina (1998). Scaling clustering algorithms to large databases. In R. Agrawal and P. Stolorz, eds., Proc. of the 4th Int. Conference on Knowledge Discovery and Data Mining, AAAI Press, 9-15.

Bradley, P.S., O.L. Mangasarian, and W.N. Street (1997). Clustering via concave minimi-zation. Advances in neural information processing systems, 9:368.

Brucker, P. (1978). On the complexity of clustering problems. In R. Henn, B.H.B. Korte, and W.W. Oetti, eds., Optimization and Operations Research: Proceedings of the workshop held at the University of Bonn, Berlin. Springer Verlag Lecture Notes in Economics and Mathematical Systems.

Carrier, J., L. Greengard, and V. Rokhlin (1988). A fast adaptive multipode algorithm for particle simulation. SIAM J. of Science and Statistical Computing, 9:669-686.

Chávez, E., G. Navarro, R. Baeza-Yates, and J. Marroquín (1999). Searching in metric spaces. Report TR/DCC-99-3, Dept. of Comp. Science, U. of Chile, Santiago.

Cheeseman, P., M. Self, J. Kelly, W. Taylor, D. Freeman, and J. Stutz (1988). Bayesian classification. In Proc. Seventh National Conference on Artificial Intelligence, 607-611, Palo Alto, CA, AAAI, Morgan Kaufmann.

Cherkassky, V., and F. Muller (1998). Learning from Data - Concept, Theory and Methods.

John Wiley & Sons, NY, USA.

Chiu, D.K.Y., A.K.C. Wong, and B. Cheung (1991). Information discovery through hierarchical maximum entropy discretization and synthesis. In G. Piatetsky-Shapiro and W.J. Frawley, eds., Knowledge Discovery in Databases, pages 125-140, Menlo Park, CA. AAAI, AAAI Press.

Ciaccia, P., M. Patella, and P. Zezula (1997). M-tree: an efficient access method for similarity search in metric spaces. In Proc. 23rd VLDB Conference, pages 426-435.

Dempster, A.P., N.M. Laird, and D.B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1-38.

Densham, P. and G. Rushton (1992). A more efficient heuristic for solving large p-median problems. Papers in Regional Science, 71:307-329.

Dickerson, M.T., R.L.S. Drysdale, and J.-R. Sack (1992). Simple algorithms for enumer-ating interpoint distances and finding k nearest neighbours. International Journal of Computational Geometry & Applications, 2(3):221-239, 1992.

Duda, R.O. and P.E. Hart (1973). Pattern Classification and Scene Analysis. John Wiley

& Sons, NY, USA.

Ester, M., H.P. Kriegel, S. Sander, and X. Xu (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. E. Simoudis, J. Han, and U.

Fayyad, eds., Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining 226-231, Menlo Park, CA, AAAI, AAAI Press.

Estivill-Castro, V. (2000). Hybrid genetic algorithms are better for spatial clustering. In R.

Mizoguchi and J. Slaney, eds., Proc. 6th Pacific Rim Int. Conf. on Artificial Intelligence, 424-434, Melbourne, Australia,Springer-Verlag LNAI 1886.

Estivill-Castro, V. and M.E. Houle (forthcoming). Robust distance-based clustering with applications to spatial data mining. Algorithmica. In press - Special Issue on Algorithms for Geographic Information.

Estivill-Castro, V. and M.E. Houle (1999). Robust clustering of large data sets with categorical attributes. In J. Roddick, editor, Database Systems - Australian Computer Science Communications, 21(2), 165-176, Springer Verlag, Singapore.

Estivill-Castro, V. and M.E. Houle (2000). Fast randomized algorithms for robust estima-tion of locaestima-tion. In J. Roddick and K. Hornsby, eds., Proc. Int. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, in conjunction with the 4th European Conf.

on Principles and Practices of Knowledge Discovery and Databases, 74-85, Lyon, France, 2000. Springer-Verlag LNAI 2007.

Estivill-Castro, V. and M.E. Houle (2001). Fast minimization of total within-group distance. M. Ng, ed., Proc. Int. Workshop Spatio-Temporal Data Mining in conjunction with 5th Pacific-Asia Conf. Knowledge Discovery and Data Mining, Hong Kong 2001.

Estivill-Castro, V. and A.T. Murray (1998). Discovering associations in spatial data - an efficient medoid based approach. In X. Wu, R. Kotagiri, and K.K. Korb, eds., Proc. 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, 110-121, Melbourne, Australia, Springer-Verlag LNAI 1394.

Estivill-Castro, V. and A.T. Murray (2000). Weighted facility location and clustering via hybrid optimization. In F. Naghdy, F. Kurfess, H. Ogata, E. Szczerbicki, and H. Bothe, eds., Proc. Int. Conf. on Intelligent Systems and Applications (ISA-2000), Paper 1514-079, Canada, 2000. ICSC, ICSC Academic Press. CD-ROM version.

Estivill-Castro, V. and R. Torres-Velázquez (1999). Hybrid genetic algorithm for solving the p-median problem. In A Yao, R.I. McKay, C.S. Newton, J.-H Kim, and T. Furuhashi,

eds., Proc. of Second Asia Pacific Conference On Simulated Evolution and Learning SEAL-98, 18-25. Springer Verlag LNAI 1585.

Estivill-Castro, V. and J. Yang (2000).. A fast and robust general purpose clustering algorithm. R. Mizoguchi & J. Slaney, eds, Proc. 6th Pacific Rim Int. Conf. Artificial Intelligence, 208-218, Melbourne, Australia, 2000. Springer-Verlag LNAI 1886.

Everitt,B. (1980). Cluster Analysis. Halsted Press, New York, USA, 2nd. edition, 1980.

Fayyad, U., C. Reina, and P.S. Bradley (1998). Initialization of iterative refinement clustering algorithms. R. Agrawal and P. Stolorz, eds., Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, 194-198. AAAI Press.

Fisher, D.H.(1987). Knowledge acquisition via incremental conceptual clustering. Ma-chine Learning, 2(2):139-172.

Glover, F. (1986). Future paths for integer programming and links to artificial intelligence.

Computers and Operations Research, 5:533-549.

Gonnet, G.H. and R. Baeza-Yates (1991). Handbook of Algorithms and Data Structures, 2nd edition. Addison-Wesley Publishing Co., Don Mills, Ontario.

Guttmann, A. (1984). R-trees: a dynamic index structure for spatial searching. In Proc. ACM SIGMOD International Conference on Management of Data, pages 47-57, 1984.

Hall, I.B. (1999). L.O. Özyurt and J.C. Bezdek. Clustering with a genetically optimized approach. IEEE Transactions on Evolutionary Computation, 3(2):103-112, July 1999.

Han, J. and M. Kamber (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Mateo, CA.

Hartigan, J.A. (1975). Clustering Algorithms. Wiley, NY.

Horn, M. (1996). Analysis and computation schemes for p-median heuristics. Environment and Planning A, 28:1699-1708.

Jain, A.K. and R.C. Dubes (1998). Algorithms for Clustering Data. Prentice-Hall, Inc., Englewood Cliffs, NJ, Advanced Reference Series: Computer Science.

Kaufman, L. and P.J. Rousseuw (1990).. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, NY, USA.

Krivánek, M. (1986). Hexagonal unit network — a tool for proving the NP-completeness results of geometric problems. Information Processing Letters, 22:37-41.

Krznaric, D. and C. Levcopoulos (1998). Fast algorithms for complete linkage clustering.

Discrete & Computational Geometry, 19:131-145.

MacQueen, J.(1967). Some methods for classification and analysis of multivariate obser-vations. L. Le Cam and J. Neyman, eds., 5th Berkley Symposium on Mathematical Statistics and Probability, volume 1, 281-297.

Michalski, R.S. and R.E. Stepp (1983). Automated construction of classifications: cluster-ing versus numerical taxonomy. IEEE Tran. on Pattern Analysis and Machine Intelli-gence, 5:683-690.

Mitchell, T.M. (1997). Machine Learning. McGraw-Hill, Boston, MA, 1997.

Murray, A.T. (2000). Spatial characteristics and comparisons of interaction and median clustering models. Geographical Analysis, 32(1):1-.

Murray, A.T. and R.L. Church (1996). Applying simulated annealing to location-planning models. Journal of Heuristics, 2:31-53, 1996.

Murra, A.T. and V. Estivill-Castro (1998). Cluster discovery techniques for exploratory spatial data analysis. Int. J. of Geographic Information Systems, 12(5):431-443.

Ng, R.T. and J. Han (1994). Efficient and effective clustering methods for spatial data mining. In J. Bocca, M. Jarke, and C. Zaniolo, eds., Proc. 20th VLDB Conference, 144-155, San Francisco, CA, Santiago, Chile, Morgan Kaufmann.

Okabe,A., B. Boots, and K. Sugihara (1992). Spatial Tesselations - Concepts and applica-tions of Voronoi diagrams. John Wiley & Sons, NY, USA.

O’Rourke, J. (1994). Computational Geometry in C. Cambridge University Press, UK.

Quigley, A.J. and P. Eades (1984). FADE: Graph drawing, clustering and visual abstraction.

In J. Marks, editor, Proc. 8ht Int. Symposium on Graph Drawing, Williamsburg Virginia, USA, 2000. Springer Verlag Lecture Notes in Computer Science.

Rao, M. (1971). Cluster analysis and mathematical programming. Journal of the American Statistical Association, 66:622-626.

Rolland, D., E. Schilling and J. Current (1996). An efficient Tabu search procedure for the p-median problem. European Journal of Operations Research, 96:329-342.

Rosing, K. and C. ReVelle (1986). Optimal clustering. Environment and Planning A, 18:1463-1476.

Rosing, K.E., C.S. ReVelle, and H. Rosing-Voyelaar (1979). The p-median and its linear programming relaxation: An approach to large problems. Journal of the Operational Research Society, 30:815-823.

Rousseeuw, P.J. and A.M. Leroy (1987). Robust regression and outlier detection. John Wiley & Sons, NY, USA.

Samet, H. (1989). The Design and Analysis of Spatial Data Structures. Addison-Wesley Publishing Co., Reading, MA.

Schikuta, E. (1996). Grid-clustering: an efficient hierarchical clustering method for very large data sets. In Proc. 13th Int. Conf. on Pattern Recognition, vol. 2, 101-105.

Späth, H. (1980). Cluster Analysis Algorithms for data reduction and classification of objects. Ellis Horwood Limited, Chinchester, UK.

Teitz, M.B. and P. Bart (1968). Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16, 955-961.

Titterington, D.M., A.F.M. Smith, and U.E. Makov (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, UK.

Vinod, H.(1969). Integer programming and the theory of grouping. Journal of the American Statistical Association, 64, 506-517.

Wallace, C.S. and P.R. Freeman (1987). Estimation and inference by compact coding.

Journal of the Royal Statistical Society, Series B, 49(3), 240-265, 1987.

Wang, W., J. Yang, and R. Muntz (1997). STING: a statistical information grid approach to spatial data mining. In M. Jarke, editor, Proc. 23rd VLDB Conference, pages 186-195, Athens, Greece. VLDB, Morgan Kaufmann Publishers.

Zhang, T., R. Ramakrishnan, and M. Livny (1996). BIRCH: an efficient data clustering method for very large databases. SIGMOD Record, 25(2),103-114.

Dans le document Data Mining: A Heuristic Approach (Page 52-57)