• Aucun résultat trouvé

A service-oriented solution

William K. Cheung

7.2 A service-oriented solution

To address the data analysis challenges as explained in the previous section, a data analy-sis prototype has recently been developed for realizing a new data analyanaly-sis approach called learning-by-abstraction (Cheung et al., 2006). The system is characterized by the following features.

A scalable and privacy preserving data mining paradigm This approach is adopted where data analysis is performed on abstracted data that are computed by first pre-grouping data items and then retaining only the first and second order statistics for each group. It is privacy preserving in the sense that individual data items after being grouped can no longer be identified. It is scalable and distributed, as data mining can now be applied directly to a compact representation of the data, instead of the raw data. The feasibility of such a paradigm has been demonstrated for applications such as clustering and manifold discovery.

The service-oriented architecture This approach provides seamless integration of remote self-contained data analysis modules implemented as Web services so that they can be coordinated

2http://grants.nih.gov/grants/ policy/data sharing/data sharing workbook.pdf

7.3 BACKGROUND 107 to realize distributed data analysis algorithms. With the recent advent of Web-service-related standards and technologies, including Web Service Description Language (WSDL), Universal Description, Discovery and Integration (UDDI) and Simple Object Access Protocol (SOAP), which take care of the basic interfacing and messaging details, developing data analysis sys-tems essentially becomes an exercise of discovering and composing relevant data analysis components.

A BPEL system BPEL is employed for creating and executing distributed data mining processes as composite services. BPEL3stands for Business Process Execution Language and it is a mark-up language originally proposed for describing service-oriented business processes. With the help of a graphical user interface provided by the BPEL system, a distributed data mining process can be schematically created and then described in BPEL. After deploying the described process onto a BPEL execution engine (middleware), the data mining process execution can then be readily taken care of.

7.3 Background

7.3.1 Types of distributed data analysis

The literature discusses a wide collection of distributed data analysis. There are many ways to decompose and distribute a particular data analysis task over different machines.

Decomposition with global analysis The data analysis process is first decomposed into steps.

Data are aggregated in a server and all the data flow through all the steps for computing the final analysis result, each being taken care of by a distributed resource. For example, a data mining process may involve a data cleansing step, followed by a data pre-processing step, a data mining one and so on (Cannataro and Talia, 2003). Specific distributed resources can be delegated to the corresponding mining steps for executing the complete data mining process.

For this type of data analysis, no special attention is paid to addressing the data scalability and privacy issues. The complete set of data is accessed and processed by all the involved computational resources.

Decomposition with local and global analysis The data analysis process involves data that are distributed and intermediate analysis steps are applied to each set of local data at their own source. Only these intermediate results are then combined for further global analysis to obtain the final analysis result. In the literature, most of the work on distributed data mining refers to processes of this type. As an example, a global classifier can take input from the outputs of a set of classifiers, each being created based on a local data set (Prodromidis and Chan, 2000).

The scalability of this data analysis paradigm rests on the fact that some intermediate steps are performed only on local data. Also, for the global analysis part, one gains computational saving as the intermediate analysis results are normally more compact than the raw data and thus take less time to process.

Decomposition with peer-to-peer analysis The data analysis process involves data that are distributed and only local analysis steps are applied to local data. Without a global mediator,

3http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html

global analysis is achieved via exchange of intermediate local analysis with the peers of each local data source. As an example, it has been shown that frequent counts (which is the basis of association rule mining) over distributed data sources can be computed when the distributed sources follow a certain peer-to-peer messaging protocol (Wolff and Schuster, 2004). The advantages of such a peer-to-peer paradigm include fault tolerance and better privacy protection. So far, only some restricted types of analysis can be shown to be computed in a peer-to-peer manner (Datta et al., 2006).

7.3.2 A brief review of distributed data analysis

Regarding the type of distributed data analysis that involves both local and global analysis, a meta-learning process has been proposed for combining locally learned decision tree classifiers based on horizontally partitioned data (Prodromidis and Chan, 2000). Beside, collective data mining works is another approach that works on vertically partitioned data and combines immediate results obtained from local data sources as if they are orthogonal bases (Kargupta et al., 2000). Data from the distributed sources form an orthogonal basis and are combined to obtain the global analysis result. This method was later on applied to learning Bayesian networks for Web log analysis (Chen and Sivakumar, 2002). Also, a number of research groups have been working on distributed association rule mining with privacy preserving capability (Agrawal and Aggarwal, 2001; Kantarcioglu and Clifton, 2004; Gilburd, Schuster and Wolff, 2004). With the heterogeneity in privacy concern for different local data sources as previously mentioned, there also exist works studying the trade off between the two conflicting requirements – data privacy and mining accuracy (Merugu and Ghosh, 2003). As uncontrolled local analysis could result in losing information that is salient to the subsequent global analysis, sharing and fusion of intermediate local analysis results to enhance the global analysis accuracy has also been studied (Zhang, Lam and Cheung, 2004).

7.3.3 Data mining services and data analysis management systems

In this section, we describe a number of systems implemented to support data analysis ser-vice provisioning and management. The viability of remote data analysis serser-vice provisioning has been demonstrated in a number of projects, e.g. DAMS4and XMLA,5where data mining requests described in XML format can be sent to the services for fulfillment. Also, there are sys-tems with data mining services on top of the grid middleware, e.g. Weka4WS6and GridMiner.7 The use of the grid middleware can free the data analysts from taking care of management issues related to distributed resources, which is especially important if the complexity and the scalability of the analysis are high.

The recent advent of e-science also triggered the implementation of management tems for supporting data analysis processes (also called scientific workflows). Related sys-tems were designed with the requirement of hiding details from users, i.e. the scientists who care more about the data analysis processes than the system for handling them. Most of them contain three major parts: (i) a workbench environment for the user to create

4http://www.csse.monash.edu.au/projects/MobileComponents/projects/dame/

5http://www.xmlforanalysis.com/

6http://grid.deis.unical.it/weka4ws/

7http://www.gridminer.org/

7.4 MODEL-BASED SCALABLE, PRIVACY PRESERVING, DISTRIBUTED DATA ANALYSIS 109 and modify workflows, (ii) an engine to orchestrate the execution of the deployed work-flows and (iii) a repository of computational/data components that are properly designed regarding their application interfacing and communication protocol so as to be effectively composed and orchestrated. Examples include Wings (Gil et al., 2007b) and Pegasus (Deelman et al., 2005), Kepler (Ludascher et al., 2006), Taverna (Oinn et al., 2006), Sedna (Emmerich et al., 2005) and ActiveBPEL.8Among them, some have followed Web service standards for better reusability and extensibility. For example, WSDL has been used for specifying the component interfaces and BPEL has been used for specifying the analysis processes.

7.4 Model-based scalable, privacy preserving, distributed data