Further research directions - William K. Cheung

William K. Cheung

7.7 Further research directions

7.7.1 Optimizing BPEL4WS process execution

Most existing BPEL4WS process execution engines are designed with the primary objective of providing a reliable execution environment for processes described in BPEL4WS. As the scale of the BPEL4WS applications increases, the execution engine’s performance enhancement will soon be an issue. Optimizing the engine’s system architecture design at the operating system level (say, with improved memory management or more effective job queuing strategy) will no doubt become more important as BPEL4WS becomes more widely used, e.g. in products such as ActiveBPEL, Oracle BPEL Manager and IBM WebSphere Process Server.

Another direction for optimizing BPEL4WS process execution is to model BPEL4WS processes as workflows and then perform workflow analysis before submitting jobs to the BPEL execution engine with the aim to improve resource allocation and thus a higher throughput rate (Sion and Tatemura, 2005). In large-scale, grid-enabled data mining applications, it has been

7.7 FURTHER RESEARCH DIRECTIONS 115 shown that significant performance improvement can be achieved by properly controlling the complexity of the workflow analysis performed at the middleware to ensure a high enough job submission rate (Singh, Kesselman and Deelman, 2005). Also, the complexity of a data analysis workflow could be reduced if some immediate analysis results were cached (Deelman et al., 2005). In addition, dynamically allocating resources (binding services) during the BPEL process execution is another feature discussed by Deelman et al. (2005), but without a proof-of-concept demonstration so far. We envision that the need for executing a large number of data analysis workflows in a highly distributed environment will eventually blossom and such a dynamic property will be essential.

7.7.2 Improved support of data analysis process management

Other than ensuring reliable and efficient execution, another important concern of managing large-scale distributed data analysis processes is how the user can be assisted to create and validate workflows as the correct annotations of the analysis processes expected before the process execution can take place. Most of the existing BPEL process creation tools such as Oracle BPEL designer or ActiveBPEL provide the capability to help users create BPEL processes visually only at the abstraction where all the execution details of a workflow have to be specified for deployment. For creating a large-scale data analysis application, much implementation and execution related knowledge and experience is normally expected for the user, let alone the manual effort required for creating a workflow with, say, thousands of nodes.

As simple as the distributed data clustering described in this chapter is, the corresponding process management cost could just be too high for an ordinary user. In our case, the issue becomes important when the number of data sources involved is large.

Fortunately, many distributed data analysis processes can be modelled as workflows with some repeated structures. For example, for the distributed data clustering we described, the local abstraction steps taken at the local data sources are in fact functionally identical and thus can be described in a much more compact manner if the iteration structure can be specified in the workflow description. The idea of providing a higher level of process abstraction has been proposed in the literature (Gil et al., 2007b). A workflow can first be represented as a workflow template, and then bound with data sources to form a workflow instance, and finally optimized based on some pre-computed intermediate results to give an executable workflow.

The user needs only to take care of the workflow template creation step and the data binding step with the help of the workflow creation tool called Wings. Another feature of the tool is that it is enabled for the Semantic Web. Given the semantics of the data sources and the workflow components, various types of constraint for workflow validation can be specified and checked.

For example, it is common for a user to expect that the workflow system can ensure reliability of the output results of the workflow (e.g. file consistency constraints (Kim, Gil and Ratnakar, 2006)) or can check the compliance of the workflow against some process related policies (e.g.

data privacy policies (Gil et al., 2007a)).

We believe that further development along this direction can further help BPEL details to be hidden from the user and the user need only focus on the conceptual design of the processes for supporting their research or applications.

7.7.3 Improved support of data privacy preservation

In many distributed systems, data privacy refers to whether the data transmitted via an open network are properly encrypted so that only the data recipients can read the data. This sense of

data privacy is important and a number of related protection mechanisms have been included in the design of some SOA and grid computing environments. However, in many cases, data privacy in distributed data analysis also refers to whether the data being released are properly

‘anonymized’ so that no one, including the data recipients, can read the data, while some meaningful data analysis should still be possible (as explained in Section 7.3.2). The pro-posed learning-from-abstraction approach for scalable and privacy preserving data analysis is essentially addressing the privacy issue.

The distributed data analysis processes being considered in this chapter can be modelled as direct acyclic graphs and they fit especially well to most of the workflow and BPEL systems.

However, to provide better data privacy preservation support, one may want to explore more privacy preserving data analysis approaches and to incorporate some autonomy in the data analysis services to gain further adaptation and robustness in privacy protection. Then, sup-porting only simple direct acyclic graph structures will be insufficient. For example, security multiparty computation is another common approach adopted for privacy preserving data anal-ysis (Clifton et al., 2002), where a special secure computation protocol is needed to circulate some privacy protected intermediate computing results through a set of involved hosts. Also, for the learning-from-abstraction approach described in this chapter, a negotiation protocol can be incorporated so that the trade-off between the global analysis accuracy and the degree of local data abstraction can be dynamically computed on a need-to-know basis via negotiation between the services under a game theoretic framework (Zhang and Cheung, 2007).

Unlike some existing workflow systems (Gil et al., 2007b), BPEL, and thus the related execution systems, allows the processes that cannot be represented as acyclic directed graphs to be modelled. However, to what extent issues such as confidentially, reliability and flexibility can be supported by the existing BPEL platforms remains to be investigated.

7.8 Conclusions

Learning-from-abstraction is a recently proposed approach for scalable and privacy preserv-ing distributed data analysis applications (Cheung et al., 2006). We discussed the issues related to the design of a service-oriented implementation of the approach. Also we pre-sented the evaluation results regarding the performance of a distributed data clustering process and the lesson learned. We believe that modelling distributed data analysis processes using BPEL is an effective means to enable researchers and data analysts to be able to perform large-scale experimental studies without the need to develop customized analysis software systems.

For future work, incorporating more intelligent execution optimization, providing better process creation and validation support and supporting workflows with less structural con-straints are believed to be the important research problems to be addressed before large-scale privacy preservation distributed data analysis can eventually be widely applied.

Acknowledgements

The author would like to say thank to Yolanda Gil for her comments and suggestions for the section on future research directions. This work is partially supported by RGC Central Allocation HKBU 2/03C, RGC Grant HKBU 2102/06E and HKBU Faculty Research Grant FRG/05-06/I-16.

REFERENCES 117

References

Agrawal, D. and Aggarwal, C. (2001), On the design and quantification of privacy preserving data mining algorithms, in ‘Proceedings of the Twentieth ACM–SIGACT–SIGMOD–SIGART Symposium on Principles of Database Systems’, Santa Barbara, CA, pp. 247–255.

Bishop, C. (1999), Latent variable models, in ‘Learning in Graphical Models’, MIT Press, pp. 371–403.

Cannataro, M. and Talia, D. (2003), ‘The knowledge grid’, Communications of the ACM 46 (1), 89–

93.

Chen, R. and Sivakumar, K. (2002), A new algorithm for learning parameters of a Bayesian network from distributed data, in ‘Proceedings of the 2002 IEEE International Conference on Data Mining’, Maebashi City, Japan, pp. 585–588.

Cheung, W., Zhang, X., Wong, H., Liu, J., Luo, Z. and Tong, F. (2006), ‘Service-oriented distributed data mining’, IEEE Internet Computing 10 (4), 44–54.

Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X. and Zhu, M. (2002), ‘Tools for privacy-preserving distributed data mining’, ACM SIGKDD Explorations Newsletter 4 (2), 28–34.

Datta, S., Bhaduri, K., Giannella, C., Wolff, R. and Kargupta, H. (2006), ‘Distributed data mining in peer-to-peer networks’, IEEE Internet Computing 10, 18–26.

Deelman, E., Singh, G., Su, M., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G., Good, J., Laity, A., Jacob, J. and Katz, D. (2005), ‘Pegasus: a framework for mapping complex scientific workflows onto distributed systems’, Scientific Programming Journal 13 (3), 219–237.

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), ‘Maximum likelihood from incomplete data via the EM algorithm’, Journal of the Royal Statistical Society. Series B (Methodological) 39 (1), 1–38.

Emmerich, W., Butchart, B., Chen, L., Wassermann, B. and Price, S. L. (2005), ‘Grid service orchestration using the business process execution language (BPEL)’, Journal of Grid Computing 3 (3–4), 283–304.

Gil, Y., Cheung, W., Ratnakar, V. and Chan, K. (2007a), Privacy enforcement through workflow system in e-Science and beyond, in ‘Proceedings of Workshop on Privacy Enforcement and Acccountability with Semantics, Held in Conjunction with the Sixth International Semantic Web Conference’, Busan, Korea.

Gil, Y., Ratnakar, V., Deelman, E., Mehta, G. and Kim, J. (2007b), Wings for Pegasus: creating large-scale scientific applications using semantic representations of computational workflows, in ‘Proceedings of the 19th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI)’, Vancouver.

Gilburd, B., Schuster, A. and Wolff, R. (2004), k-TTP: a new privacy model for large-scale distributed environments, in ‘Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, Seattle, WA, pp. 563–568.

Kantarcioglu, M. and Clifton, C. (2004), ‘Privacy-preserving distributed mining of association rules on horizontally partitioned data’, IEEE Transactions on Knowledge and Data Engineering 16 (9), 1026–

1037.

Kargupta, H., Park, B., Hershberger, D. and Johnson, E. (2000), Collective data mining: a new perspective towards distributed data mining, in ‘Advances in Distributed and Parallel Knowledge Discovery’, MIT/AAAI Press, pp. 133–184.

Kim, J., Gil, Y. and Ratnakar, V. (2006), Semantic metadata generation for large scientific workflows, in

‘Proceedings of the 5th International Semantic Web Conference’, Athens, GA.

Kobsa, A. (2007), ‘Privacy-enhanced personalization’, Communication of ACM 50 (8), 24–33.

Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E., Tao, J. and Zhao, Y. (2006), ‘Scientific workflow management and the Kepler system’, Concurrency Computation:

Practice and Experience 18 (10), 1039–1065.

McLachlan, G. J. and Basford, K. E. (1988), Mixture Models – Inference and Applications to Clustering, Dekker, New York.

Merugu, S. and Ghosh, J. (2003), Privacy-preserving distributed clustering using generative models, in

‘Proceedings of the Third IEEE International Conference on Data Mining’, Melbourne, FL, pp. 211–

218.

Oinn, T., Greenwood, M., Addis, M., Alpdemir, M., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P., Lord, P., Pocock, M., Senger, M., Stevens, R., Wipat, A. and Wroe, C. (2006),

‘Taverna: lessons in creating a workflow environment for the life sciences’, Concurrency Computation:

Practice and Experience 18 (10), 1067–1100.

Organization for Economic Co-Operation and Development (1980), ‘Guidelines on the protec-tion of privacy and transborder flow of personal data’, http://www.oecd.org/document/18/0, 2340,en 2649 34255 1815186 1 1 1 1,00.html

Prodromidis, A. and Chan, P. (2000), Meta-learning in distributed data mining systems: issues and approaches, in ‘Advances of Distributed Data Mining’, MIT/AAAI Press.

Singh, G., Kesselman, C. and Deelman, E. (2005), ‘Optimizing grid-based workflow execution’, Journal of Grid Computing 3 (3–4), 201–219.

Sion, R. and Tatemura, J. (2005), Dynamic stochastic models for workflow response optimization, in

‘Proceedings of the IEEE International Conference on Web Services’, IEEE Computer Society, Wash-ington, DC, pp. 657–664.

Weitzner, D., Abelson, H., Berners-Lee, T., Hanson, C., Hendler, J., Kagal, L., McGuinness, D., Suss-man, G. and WaterSuss-man, K. (2006), Transparent accountable data mining: new strategies for privacy protection, Technical Report MIT-CSAIL-TR-2006-007, MIT.

Weitzner, D., Hendler, J., Berners-Lee, T. and Connolly, D. (2005), Creating a policy-aware Web: discre-tionary, rule-based access for the World Wide Web., in E. Ferrari and B. Thuraisingham, eds, ‘Web and Information Security’, IRM Press.

Wolff, R. and Schuster, A. (2004), ‘Association rule mining in peer-to-peer systems.’, IEEE Transactions on Systems, Man, and Cybernetics, Part B 34 (6), 2426–2438.

Zhang, X. and Cheung, W. (2005a), Learning global models based on distributed data abstractions, in

‘Proceedings of the International Joint Conference on Artificial Intelligence’, Edinburgh, pp. 1645–

1646.

Zhang, X. and Cheung, W. (2005b), Visualizing global manifold based on distributed local data abstrac-tion, in ‘Proceedings of the 5th IEEE International Conference on Data Mining’, Houston, pp. 821–

824.

Zhang, X. and Cheung, W. (2007), A game theoretic approach to active distributed data mining, in

‘Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology’, IEEE Press, Silicon Valley, USA.

Zhang, X., Cheung, W. and Li, C. H. (2006), Graph-based abstraction for privacy-preserving manifold visualization, in ‘Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology’, Hong Kong, pp. 94–97.

Zhang, X., Lam, C. and Cheung, W. (2004), ‘Mining local data sources for learning global cluster models via local model exchange’, IEEE Intelligent Informatics Bulletin 4 (2), 16–22.

8

Building and using analytical

Dans le document Data Mining Techniques in Grid Computing Environments (Page 137-142)