Data Integration and Data Cleaning Techniques: Generally, data analysis task includes data integration, which combine data from multiple

sources into a coherent data store. These sources may include multiple databases or flat files. A number of problems can arise during data integration. Real world entities in multiple data sources can be given different names. How does an analyst know that employee-id in one database is same as employee-number in another database. We plan to use meta-data to solve the problem of data integration. Data coming from input sources tends to be incomplete, noisy and inconsistent. If such data is directly loaded in the DW it can cause errors during the analysis phase resulting in incorrect results. Data cleaning methods will attempt to smooth out the noise, while identifying outliers, and correct inconsistencies in the data. We are investigating the following techniques for noise reduction and data smoothing.

a) Binning: These methods smooth a sorted data value by consulting the values around it.

b) Clustering: Outliers may be detected by clustering, where similar values are organized into groups or clusters. Intuitively, values that fall outside of the set of clusters may be considered outliers.

c) Regression: Data can be smoothed by fitting the data to a function, such as with regression. Using regression to find a mathematical equation to fit the data helps smooth out the noise.

Data pre-processing is an important step for data analysis. Detecting data integration problems, rectifying them and reducing the amount of data to be analyzed can result in great benefits during the data analysis phase.

7.2 Current Research in the area of Data Warehouse Maintenance

A number of techniques for view maintenance and propagation of changes from the source databases to the data warehouse (DW) have been discussed in literature. [5] [14] describes techniques for view maintenance and refreshing the data in a DW.

[15] also describes techniques for maintenance of data cubes and summary tables in a DW environment. However, the problem of propagating changes in a DW environment is more complicated due to the following reasons:

a) In a DW, data is not refreshed after every modification to the base data.

Rather, large batch updates to the base data must be considered which requires new algorithms and techniques.

b) In a DW environment, it is necessary to transform the data before it is deposited into the DW. These transformations may include aggregating or summarizing the data.

c) The requirements of data sources may change during the life cycle, which may force schema changes for the data source. Therefore techniques are required that can deal with both source data changes and schema changes. [Liu 2002] describes some techniques for dealing with schema changes in the data sources.

[6], [13] describes techniques for practical lineage tracing of data in a DW environment. It enables users to "drill through" from the views in the DW all the way to the source data that was used to create the data in the DW. However, their methods lack techniques to deal with historical source data or data from previous source versions.

22 Anoop Singhal

8. CONCLUSIONS

A data warehouse is a subject oriented collection of data that is used for decision support systems. They typically use a multidimensional data model to facilitate data analysis. They are implemented using a three tier architecture. The bottom most tier is a database server which is typically a RDBMS. The middle tier is a OLAP server and the top tier is a client, containing query and reporting tools. Data mining is the task of discovering interesting patterns from large amounts of data where data can be stored in multiple repositories. Efficient data warehousing and data mining techniques are challenging to design and implement for large data sets.

In this chapter, we have given a summary of Data Warehousing, OLAP and Data Mining Technology. We have also described our experience in using this technology to build Data Analysis Application for Network/Web services. We have also described some open research problems that need to be solved in order to efficiently extract data from distributed information repositories. Although, some commercial tools are available in the market, our experience in building a decision support system for a network/web services has shown that they are inadequate. We believe that there are several important research problems that need to be solved to build flexible, powerful and efficient data analysis applications using data warehousing and data mining techniques.

References

1. S. Chaudhuri, U. Dayal: An Overview of Data Warehousing and OLAP Technology, SIGMOD Record, March 1997.

2. W.H. Inmon: Building the Data Warehouse (2"^* Edition) John Wiley, 1996.

3. R. Kimball: The Data Warehouse Toolkit, John Wiley, 1996.

4. D. Pyne: Data Preparation for Data Mining, San Francisco, Morgan Kaufmann, 1999

5. Prabhu Ram and Lyman Do: Extracting Delta for Incremental Warehouse, Proceedings of IEEE 16^*^ Int. Conference on Data Engineering, 2000.

6. Y. Cui and J. Widom: Practical Lineage Tracing in Data Warehouses, Proceedings of IEEE 16^^ Int. Conference on Data Engineering, 2000.

7. S. Chaudhuri, G. Das and V. Narasayya: A Robust, Optimization Based Approach for Approximate Answering of Aggregate Queries, Proceeding of ACM SIGMOD Conference, 2001, pp 295-306

8. Anoop Singhal, "Design of a Data Warehouse for Network/Web Services",

"Proceedings of Conference on Information and Knowledge Management, CIKM 2004.

9. Anoop Singhal, "Design of GEMS Data Warehouse for AT&T Business Services", Proceedings of AT&T Software Architecture Symposium, Somerset, NJ, March 2000

10. ANSWER: Network Monitoring using Object Oriented Rules" (with G. Weiss and J. Ros), Proceedings of the Tenth Conference on Innovative Application of Artificial Intelligence, Madison, Wisconsin, July 1998.

11. "A Model Based Approach to Network Monitoring", Proceedings of ACM Workshop on Databases: Active and Real Time, Rockville, Maryland Nov. '96 pages 41-45.

12. Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, August 2000

13. Jennifer Widom, "Research Problems in Data warehousing", Proc. Of 4^^^ Int'l Conference on Information and Knowledge Management, Nov. 1995 14. Hector Garcia Molina, "Distributed and Parallel Computing Issues in Data

Warehousing", Proc. Of ACM Conference on Distributed Computing, 1999.

15. A. Gupta and I.S. Mumick, "Maintenance of Materialized Views", IEEE Data Engineering Bulletin, June 1995

16. Vipin Kumar et al. Data Mining for Scientific and Engineering Applications, Kluwer Publishing 2000

17. Bernstein P., Principles of Transaction Processing, Morgan Kaufman, San Mateo, CA 1997

18. Miller H and Han J, Geographic Data Ming and Knowledge Discovery, UK 2001 19. Liu, Bin, Chen, Songting and Rundensteiner, E. A. Batch Data Warehouse

Maintenance in Dynamic Environment. In Proceedings of CIKM' 2002, McLean, VA, Nov. 2002

20. Hector Garcia Molina, J.D. Ullman, J. Widom, Database Systems the Complete Book, Prentice Hall, 2002

Chapter 2

Dans le document Data Warehousing and Data Mining Techniques for Cyber Security (Page 31-35)