APPLICATIONS AND CHALLENGES

Some of the important issues in data mining include the identification of appli-cations for existing techniques, and developing new techniques for traditional as well as new application domains, like the Web, E-commerce, and Bioinfor-matics. Some of the existing practical uses of data mining exist in (i) tracking fraud, (ii) tracking game strategy, (iii) target marketing, (iv) holding on to good customers, and (v) weeding out bad customers, to name a few. There are many other areas we can envisage, where data mining can be applied.

Some of these areas are as follows.

• Medicine: Determine disease outcome and effectiveness of treatments, by analyzing patient disease history to find some relationship between diseases.

26 INTRODUCTION TO DATA MINING

• Molecular or pharmaceutical: Identify new drugs.

• Security: Face recognition, identification, biometrics, etc.

• Judiciary: Search and access of historical data on judgement of similar cases.

• Biometrics: Positive identification of a person from a large image, fin-gerprint or voice database.

• Multimedia retrieval: Search and identification of image, video, voice, and text from multimedia database, which may be compressed.

• Scientific data analysis: Identify new galaxies by searching for subclus-ters.

• Web site or Web store design, and promotion: Find affinity of visitors to Web pages, followed by subsequent layout modification.

• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing pro-grams.

• Land use: Identify areas of similar land use in an earth observation database.

• Insurance: Identify groups of motor insurance policy holders with a high average claim cost.

• City-planning: Identify groups of houses according to their house type, value, and geographical location.

• Geological studies: Infer that observed earthquake epicenters are likely to be clustered along continental faults.

The first generation of data mining algorithms has been demonstrated to be of significant value across a variety of real-world applications. But these work best for problems involving a large set of data collected into a single database, where the data are described by numeric or symbolic features. Here the data invariably do not contain text and image features interleaved with these features, and they are carefully and cleanly collected with a particular decision-making task in mind.

Development of new generation algorithms is expected to encompass more diverse sources and types of data that will support mixed-initiative data min-ing, where human experts collaborate with the computer to form hypotheses and test them. The main challenges to the data mining procedure, to be considered for future research, involve the following.

1. Massive datasets and high dimensionality. Huge datasets create combi-natorially explosive search space for model induction, and they increase

the chances that a data mining algorithm will find spurious patterns that are not generally valid. Possible solutions include robust and efficient algorithms, sampling approximation methods, and parallel processing.

Scaling up of existing techniques is needed - for example, in the cases of classification, clustering, and rule mining.

2. User interaction and prior knowledge. Data mining is inherently an interactive and iterative process. Users may interact at various stages, and domain knowledge may be used either in the form of a high-level specification of the model or at a more detailed level. Visualization of the extracted model is also desirable for better user interaction at different levels.

3. Over-fitting and assessing the statistical significance. Datasets used for mining are usually huge and available from distributed sources. As a result, often the presence of spurious data points leads to over-fitting of the models. Regularization and re-sampling methodologies need to be emphasized for model design.

4. Understandability of patterns. It is necessary to make the discoveries more understandable to humans. Possible solutions include rule struc-turing, natural language representation, and the visualization of data and knowledge.

5. Nonstandard and incomplete data. The data can be missing and/or noisy. These need to be handled appropriately.

6. Mixed media data. Learning from data that are represented by a com-bination of various media, like (say) numeric, symbolic, images, and text.

7. Management of changing data and knowledge. Rapidly changing data, in a database that is modified or deleted or augmented, may make previ-ously discovered patterns invalid. Possible solutions include incremental methods for updating the patterns.

8. Integration. Data mining tools are often only a part of the entire decision-making system. It is desirable that they integrate smoothly, both with the database and the final decision-making procedure.

9. Compression. Storage of large multimedia databases is often required to be in compressed form. Hence the development of compression tech-nology, particularly suitable for data mining, is required. It would be even more beneficial if data can be accessed in the compressed domain [24].

10. Human Perceptual aspects for data mining. Many multimedia data min-ing systems are intended to be used by humans. So it is a pragmatic

28 INTRODUCTION TO DATA MINING

approach to design multimedia systems and underlying data mining techniques based on the needs and capabilities of the human percep-tual system. The ultimate consumer of most perceppercep-tual information is the 'Human Perceptual System?. Primarily, the Human Perceptual Sys-tem consists of the 'Human Visual SysSys-tem¹ and the 'Human Auditory System'. How these systems work synergistically is still not completely understood and is a subject of ongoing research. We also need to focus some attention in this direction so that their underlying principles can be adopted while developing data mining techniques, in order to make these more amenable and natural to the human customer.

11. Distributed database. Interest in the development of data mining sys-tems in a distributed environment will continue to grow. In today's networked society, data are not stored or archived in a single storage system unit. Problems arise while handling extremely large heteroge-neous databases spread over multiple files, possibly in different disks or across the Web in different geographical locations. Often combining such data in a single very large file may be infeasible. Development of algorithms for mining data from distributed databases will open up newer areas of applications in the near future.

Dans le document Data Mining (Page 44-47)