• Aucun résultat trouvé

Future directions

Moustafa Ghanem, Vasa Curcin, Patrick Wendel and Yike Guo

8.6 Future directions

Our vision has always been to enable end users, i.e. non-programmers, to build and use their grid-based data mining applications and also to be able to deliver them easily as executable applications to other end users. The overarching theme is to enable a wide range of users to reap the benefits of grid computing technologies while shielding them from the complexities of the underlying protocols. Our experience so far indicates that achieving the balance is indeed possible.

The analytical workflow paradigm is intuitive in this respect since it provides a high-level abstract view of the data analysis steps. Although intuitively simple in terms of usage, designing and implementing an analytical workflow system raises a number of challenges. One issue is deciding the execution semantics of the workflow language used. The next one is enabling extensibility, i.e. enabling users to easily integrate access to new local or remote data sources and data mining tools. The third issue relates to data management. All of these issues have been addressed in the Discovery Net system and its architecture over the years.

The system and architecture presented here constitute a solid foundation for addressing forthcoming research challenges in the area. Here we shall briefly discuss the issues we foresee as being relevant over the next several years.

We have introduced decorators as dynamic modifications to individual components that can manipulate global resources or serve as a way of customizing a predefined workflow. Further use of decorators will ultimately lead to the separation of activities into core definitions and applicators, sets of predefined decorators that can adapt an activity to a particular data type or computational resource and are shared among several activities. This separation will, in part, be motivated by the increased component exchange between workflow systems and the need to move it beyond mutual Web service invocation and into the realm of dynamic code exchange.

These ideas found their first incarnation in the EU project SIMDAT.

REFERENCES 137 Security and credential management is also an issue that needs to be handled more con-sistently and safely in all workflow systems. Currently it is up to each activity to deal with the credentials to access a secured service and to provide the necessary options. Furthermore, once the workflow is disseminated, it is no longer possible to control who will be using it.

Hence, restricting both modifications and access levels is difficult once the workflow leaves a particular hosting environment, which limits the commercial applicability of workflows as a medium for application creation. The definition of policies to manage remote retrieval of credentials needs to be separated from the workflow definition itself to support a full range of licensing options.

Finally, the increased modularization of workflow servers, resulting in a range of hosting environments, such as embedded workflow-based processing in devices or specialized ap-plications (e.g. visualizer tools or desktop apap-plications), will lead to the next generation of deployment, in which workflows are published together with a custom-made hosting environ-ment created at the deployenviron-ment stage. Such an environenviron-ment contains only the resources and the functionality necessary for the application execution, thereby eliminating code bloat and removing dependence on external servers for standalone applications.

The accessibility of data and computational resources to data mining users enables them to conduct more complex analyses over larger data sets in less time. It allows them to become more efficient in discovering new knowledge. The availability of the end-user-oriented frameworks that provide these scientists with full access to the benefits of grid computing technologies, while shielding them from the complexities of the underlying protocols, is essential. Our experience indicates that achieving the balance is indeed possible by providing the users with tools at the appropriate level of abstraction that suits their problem solving methods, and that suits their modes of knowledge discovery.

Acknowledgements

The work described in this chapter has been funded, in part, under different grants including the Discovery Net Project (2001–2005) funded by the EPSRC under the UK e-Science Programme and the SIMDAT project funded by the European Union under the FP6 IST programme. The authors would also like to acknowledge the work conducted by their many colleagues at Imperial College London and InforSense Ltd. in implementing various parts of the Discovery Net system, and Dr. Mariam Molokhia from London School of Hygiene and Tropical Medicine for the ADR case-control workflow.

References

Alsairafi, S., Emmanouil, F.-S., Ghanem, M., Giannadakis, N., Guo, Y., Kalaitzopolous, D., Osmond, M., Rowe, A. and Wendel, P. (2003), ‘The design of Discovery Net: towards open grid services for knowledge discovery’, International Journal of High Performance Computing 17 (3), 297–315.

Atkinson, M., DeRoure, D., Dunlop, A., Fox, G., Henderson, P., Hey, T., Paton, N., Newhouse, S., Parastatidis, S., Trefethen, A., Watson, P. and Webber, J. (2005), ‘Web service grids: an evolutionary approach’, Concurrency and Computation: Practice and Experience 17 (2–4), 377–389.

Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., K¨oler, M. and Syed, J. (1999), An architecture for distributed enterprise data mining, in ‘Proceedings of the 7th International Conference on High-Performance Computing and Networking (HPCN Europe’99)’, Springer, London, pp. 573–582.

Curcin, V., Ghanem, M., Wendel, P. and Guo, Y. (2007), Heterogeneous workflows in scientific workflow systems, in ‘International Conference on Computational Science (3)’, pp. 204–211.

Foster, I. (2005), ‘Service-oriented science’, Science 308 (5723), 814–817.

Foster, I., Kesselman, C., Nick, J. M. and Tuecke, S. (2002), ‘The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration’, http://www.globus.org/

research/papers/ogsa.pdf

Ghanem, M., Azam, N., Boniface, M. and Ferris, J. (2006), Grid-enabled workflows for industrial product design, in ‘2nd International Conference on e-Science and Grid Technologies (e-Science 2006)’, p. 96.

Ghanem, M. M., Guo, Y., Lodhi, H. and Zhang, Y. (2002), ‘Automatic scientific text classification using local patterns: KDD Cup 2002 (Task 1)’, SIGKDD Explorer Newsletter 4 (2), 95–96.

Ghanem, M., Ratcliffe, J., Curcin, V., Li, X., Tattoud, R., Scott, J. and Guo, Y. (2005), Using text mining for understanding insulin signalling, in ‘4th UK e-Science All Hands Meeting 2005’.

Grishman, R. (1997), ‘Tipster Architecture Design Document, Version 2.3’, http://www.itl.nist.gov/

iaui/894.02/related projects/tipster/

Guo, Y., Liu, J. G., Ghanem, M., Mish, K., Curcin, V., Haselwimmer, C., Sotiriou, D., Muraleetharan, K. K.

and Taylor, L. (2005), Bridging the macro and micro: a computing intensive earthquake study using Discovery Net, in ‘Proceedings of the 2005 ACM/IEEE conference on Supercomputing (SC’05)’, IEEE Computer Society, Washington, DC, p. 68.

Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M. R., Li, P. & Oinn, T. (2006), ‘Taverna: a tool for building and running workflows of services’, Nucleic Acids Research 34, W729–W732.

Lu, Q., Hao, P., Curcin, V., He, W., Li, Y.-Y., Luo, Q.-M., Guo, Y.-K. and Li, Y.-X. (2006), ‘KDE Bioscience: platform for bioinformatics analysis workflows’, Journal of Biomedical Informatics 39 (4), 440–450.

Lud¨ascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E. A., Tao, J. and Zhao, Y. (2006), ‘Scientific workflow management and the Kepler system’, Concurrency and Computation:

Practice and Experience 18 (10), 1039–1065.

Object Management Group (2006), CORBA component model 4.0 specification, Specification Version 4.0, Object Management Group.

Pocock, M., Down, T. and Hubbard, T. (2000), ‘BioJava: open source components for bioinformatics’, ACM SIGBIO Newsletter 20 (2), 10–12.

Richards, M., Ghanem, M., Osmond, M., Guo, Y. and Hassard, J. (2006), ‘Grid-based analysis of air pollution data’, Ecological Modelling 194, 274–286.

Rowe, A., Kalaitzopoulos, D., Osmond, M., Ghanem, M. and Guo, Y. (2003), ‘The Discovery Net system for high throughput bioinformatics’, Bioinformatics 19 (1), i225–i231.

Silverstein, F., Faich, G., Goldstein, J., Simon, L., Pincus, T., Whelton, A., Makuch, R., Eisen, G., Agrawal, N., Stenson, W., Burr, A., Zhao, W., Kent, J., Lefkowith, J., Verburg, K. and Geis, G. (2000),

‘Gastrointestinal toxicity with Celecoxib vs nonsteroidal anti-inflammatory drugs for osteoarthritis and rheumatoid arthritis. The CLASS study: a randomized controlled trial. Celecoxib long-term arthritis safety study.’, The Journal of the American Medical Association 284 (10), 1247–1255.

Surridge, M., Taylor, S., Roure, D. D. and Zaluska, E. (2005), Experiences with GRIA – industrial appli-cations on a Web services grid, in ‘First International Conference on e-Science and Grid Computing’, pp. 98–105.

Syed, J., Ghanem, M. and Guo, Y. (2002), ‘Discovery Processes: Representation and Reuse’, cite-seer.ist.psu.edu/syed02discovery.html

Talia, D. (2002), ‘The open grid services architecture: where the grid meets the Web’, IEEE Internet Computing 6 (6), 67–71.

Taylor, I., Shields, M., Wang, I. and Harrison, A. (2005), ‘Visual grid workflow in Triana’, Journal of Grid Computing 3 (3-4), 153–169.

REFERENCES 139 Taylor, S., Surridge, M. and Marvin, D. (2004), Grid resources for industrial applications, in ‘Proceed-ings of the IEEE International Conference on Web Services (ICWS’04)’, IEEE Computer Society, Washington, DC, p. 402.

Wendel, P., Ghanem, M. and Guo, Y. (2006), Designing a Java-based grid scheduler using commodity services, in ‘UK e-Science All Hands Meeting’.

Witten, I., Frank, E., Trigg, L., Hall, M., Holmes, G. and Cunningham, S. (1999), Weka: prac-tical machine learning tools and techniques with Java implementations, in ‘Proceedings of ICONIP/ANZIIS/ANNES’99 International Workshop: Emerging Knowledge Engineering and Connectionist-Based Information Systems’, pp. 192–196.

World Health Organization (2002), ‘Importance of Pharmacovigilance’, http://whqlibdoc.who.int/

hq/2002/a75646.pdf

9

Building workflows that traverse