• Aucun résultat trouvé

Future developments

Peter Brezany, Ivan Janciak and A. Min Tjoa

3.6 Future developments

From the previous paragraphs, we can see that we are already witnessing various grid-based data mining research and development activities. There are many potential extensions of this work towards comprehensive, productive and high-performance analysis of scientific data sets.

Below we outline three promising future research avenues.

3.6.1 High-level data mining model

The most popular database technologies, like the relational one, have precise and clear founda-tions in the form of uniform models and algebraic and logical frameworks on which significant theoretical and system research has been conducted. The emergence of the relational query language SQL and its query processors is an example of this development. A similar state has to be achieved in data mining in order for it to be leveraged as a successful technology and science. The research in this field will include three main steps: (i) identifying the properties required of such a model in the context of large-scale data mining, (ii) examining the currently available models that might contribute (e.g. CRISP-DM) and (iii) developing a formal notation that exposes the model features.

3.6.2 Data mining query language

From a user’s point of view, the execution of a data mining process and the discovery of a set of patterns can be considered as either the result of an execution of a data mining workflow or an answer to a sophisticated database query. The first is called the procedural approach, while the second is the descriptive approach. GridMiner and other grid-based data mining developments are based on this model. To support the descriptive approach several languages have been developed. The Data Mining Query Language (DMQL) (Han, 2005) and OLE DB for Data Mining (Netz, et al., 2001) are good examples. A limitation of these languages is their poor support for data preparation, transformation and post-processing. We believe that the design of such a language has to be based on an appropriate data mining model; its features will be reflected by the syntactic and semantic structure of the language. Moreover, implementation of queries expressed in the language on the grid, which represents a highly volatile distributed environment, will require investigation of dynamic approaches able to sense the query execution status and grid status in specified time intervals and appropriately adapt the query execution plan.

3.6.3 Distributed mining of data streams

In the past five years with advances in data collection and generation technologies, a new class of application has emerged that requires managing data streams, i.e. data composed of continuous, real time sequence of items. A significant research effort has already been devoted to stream data management (Chaudhry, Shaw and Abdelguerfi, 2005) and data stream mining (Aggarwal, 2007). However, in advanced applications there is a need to mine multiple data streams. Many systems use a centralized model (Babcock, et al., 2002). Here, the distributed data streams are directed to one central location before they are mined. Such a model is limited in many aspects. Recently, several researchers have proposed a distributed model considering distributed data sources and computational resources–an excellent survey was provided by

REFERENCES 53 Parthasarathy, Ghoting and Otey (2007). We believe that investigation of grid-based distributed mining of data streams is also an important future research direction.

3.7 Conclusions

The characteristics of data exploration in modern scientific applications impose unique require-ments for analytical tools and services as well as their organization into scientific workflows.

In this chapter, we have introduced our approach to the e-science analytics that has been in-corporated into the GridMiner framework. The framework aims to exploit modern software engineering and data engineering methods based on service-oriented Web and grid technolo-gies. An additional goal of the framework is to give the data mining expert powerful support to achieve appealing results. There are two main issues driving the development of such an infrastructure. The first is the increasing volumes and complexity of involved data sets, and the second is the heterogeneity and geographic distribution of these data sets and the scientists who want to analyse them. The GridMiner framework attempts to address both challenges, and we believe successfully. Several data mining services have been already deployed and are ready to perform the knowledge discovery tasks and OLAP.

References

Aggarwal, C. (2007), Data Streams: Models and Algorithms, Advances in Database Systems, Springer.

Antonioletti, M., Atkinson, M., Baxter, R., Borley, A., Hong, N. P. C., Collins, B., Hardman, N., Hume, A. C., Knox, A., Jackson, M., Krause, A., Laws, S., Magowan, J., Paton, N. W., Pearson, D., Sug-den, T., Watson, P. and Westhead, M. (2005), ‘The design and implementation of grid database ser-vices in OGSA-DAI: Research articles’, Concurrency and Computation: Practice and Experience 17 (2–4), 357–376.

Antonioletti, M., Hong, N. C., Atkinson, M., Krause, A., Malaika, S., McCance, G., Laws, S., Magowan, J., Paton, N. and Riccardi, G. (2003), ‘Grid data service specification’, http://www.

gridpp.ac.uk/papers/DAISStatementSpec.pdf

Babcock, B., Babu, S., Datar, M., Motwani, R. and Widom, J. (2002), Models and issues in data stream systems, in ‘Proceedings of the 21st ACM SIGMOD–SIGACT–SIGART Symposium on Principles of Database Systems (PODS’02)’, ACM, New York, NY, pp. 1–16.

Berkley, C., Bowers, S., Jones, M., Ludaescher, B., Schildhauer, M. and Tao, J. (2005), Incorporating semantics in scientific workflow authoring, in ‘Proceedings of the 17th International Conference on Scientific and Statistical Database Management (SSDBM’05)’, Lawrence Berkeley Laboratory, Berkeley, CA, pp. 75–78.

Brezany, P., Janciak, I. and Han, Y. (2006), Parallel and distributed grid services for building classification models based on neural networks, in ‘Second Austrian Grid Symposium’, Innsbruck.

Brezany, P., Janciak, I. and Tjoa, A. M. (2007), Ontology-based construction of grid data mining work-flows, in H. O. Nigro, S. E. G. Císaro and D. H. Xodo, eds, ‘Data Mining with Ontologies: Imple-mentations, Findings, and Frameworks’, Hershey, New York, pp. 182–210.

Brezany, P., Kloner, C. and Tjoa, A. M. (2005), Development of a grid service for scalable decision tree construction from grid databases, in ‘Sixth International Conference on Parallel Processing and Applied Mathematics’, Poznan.

Brezany, P., Tjoa, A. M., Rusnak, M. and Janciak, I. (2003a), Knowledge grid support for treatment of traumatic brain injury victims, in ‘2003 International Conference on Computational Science and its Applications’, Vol. 1, Montreal, pp. 446–455.

Brezany, P., Tjoa, A. M., Wanek, H. and Woehrer, A. (2003b), Mediators in the architecture of grid infor-mation systems, in ‘Fifth International Conference on Parallel Processing and Applied Mathematics, PPAM 2003’, Springer, Czestochowa, Poland, pp. 788–795.

Cannataro, M. and Talia, D. (2003), ‘The knowledge grid’, Communications of the ACM 46 (1), 89–93.

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. (2000), CRISP-DM 1.0: Step-by-Step Data Mining Guide, CRISP-CRISP-DM Consortium: NCR Systems Engineering Copenhagen (USA and Denmark) DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA Verzekeringen en Bank Groep BV (The Netherlands).

Chaudhry, N., Shaw, K. and Abdelguerfi, M. (2005), Stream Data Management (Advances in Database Systems), Springer.

Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M., Taylor, I. and Wang, I. (2006), ‘Programming scientific and distributed workflow with Triana services: Research articles’, Concurrency and Computation: Practice and Experience 18 (10), 1021–1037.

Curˇcin, V., Ghanem, M., Guo, Y., K¨ohler, M., Rowe, A., Syed, J. and Wendel, P. (2002), Discovery´ Net: towards a grid of knowledge discovery, in ‘Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02)’, ACM Press, New York, NY, pp. 658–663.

Czajkowski, K., Ferguson, D., Foster, I., Frey, J., Graham, S., Maguire, T., Snelling, D. and Tuecke, S. (2004), ‘From Open Grid Services Infrastructure to WS-Resource Framework: Refactoring and Evolution, version 1.1’, Global Grid Forum.

Data Mining Group(2004), ‘Predictive Model Markup Language’, http://www.dmg.org/

Elsayed, I. and Brezany, P. (2005), Online analytical mining of association rules on the grid, Technical Re-port Deliverable of the TU/Uni-Vienna GridMiner Project, Institute for Software Science, University of Vienna.

Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996), ‘From data mining to knowledge discovery in databases’, Ai Magazine 17, 37–54.

Fiser, B., Onan, U., Elsayed, I., Brezany, P. and Tjoa, A. M. (2004), Online analytical processing on large databases managed by computational grids, in ‘DEXA 2004’.

Foster, I. and Kesselman, C. (1998), The Globus Project: a status report, in ‘Proceedings of the 7th Heterogeneous Computing Workshop (HCW’98)’, IEEE Computer Society, Washington, DC, p. 4.

Foster, I., Kesselman, C., Nick, J. M. and Tuecke, S. (2002), ‘The physiology of the grid: an open grid services architecture for distributed systems integration’, http://www.globus.org/research/ pa-pers/ogsa.pdf

Franklin, M., Halevy, A. and Maier, D. (2005), ‘From databases to dataspaces: a new abstraction for information management’, SIGMOD Record 34 (4), 27–33.

Goil, S. and Choudhary, A. (1997), ‘High performance OLAP and data mining on parallel computers’, Journal of Data Mining and Knowledge Discovery 1 (4), 391–417.

Han, J. (2005), Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA.

Hofer, J. and Brezany, P. (2004), DIGIDT: distributed classifier construction in the grid data mining framework GridMiner-Core, in ‘Workshop on Data Mining and the Grid (DM-Grid 2004), held in conjunction with the 4th IEEE International Conference on Data Mining (ICDM’04)’, Brighton, UK.

Janciak, I., Kloner, C. and Brezany, P. (2007), ‘Workflow Enactment Engine Project’, http://weep.

gridminer.org

REFERENCES 55 Janciak, I., Sarnovsky, M., Tjoa, A. M. and Brezany, P. (2006), Distributed classification of textual documents on the grid, in ‘The 2006 International Conference on High Performance Computing and Communications, LNCS 4208’, Munich, pp. 710–718.

Jones, M. B., Ludaescher, B., Pennington, D., Pereira, R., Rajasekar, A., Michener, W., Beach, J. H.

and Schildhauer, M. (2006), ‘A knowledge environment for the biodiversity and ecological sciences’, Journal of Intelligent Information Systems 29 (1), 111–126.

Kargupta, H., Joshi, A., Sivakumar, K. and Yesha, Y., eds(2003), Data Mining: Next Generation Chal-lenges and Future Directions, AAAI/MIT Press, Menlo Park, CA.

Kickinger, G., Brezany, P., Tjoa, A. M. and Hofer, J. (2004), Grid knowledge discovery processes and an architecture for their composition, in ‘IASTED Conference’, Innsbruck, pp. 76–81.

Kickinger, G., Hofer, J., Tjoa, A. M. and Brezany, P. (2003), Workflow management in GridMiner, in

‘3rd Cracow Grid Workshop’, Cracow.

Langella, S., Oster, S., Hastings, S., Siebenlist, F., Kurc, T. and Saltz, J. (2006), Dorian: grid service infrastructure for identity management and federation, in ‘Proceedings of the 19th IEEE Sympo-sium on Computer-Based Medical Systems (CBMS’06)’, IEEE Computer Society, Washington, DC, pp. 756–761.

Netz, A., Chaudhuri, S., Fayyad, U. M. and Bernhardt, J. (2001), Integrating data mining with SQL databases: OLE DB for data mining, in ‘Proceedings of the 17th International Conference on Data Engineering’, pp. 379–387.

Parthasarathy, S., Ghoting, A. and Otey, M. E. (2007), A survey of distributed mining of data streams, in C. C. Aggarwal, ed., ‘Data Streams: Models and Algorithms’, Springer, pp. 289–307.

Pyle, D. (1999), Data Preparation for Data Mining, Morgan Kaufmann San Francisco, CA.

Romberg, M. (2002), ‘The UNICORE grid infrastructure’, Scientific Programming 10 (2), 149–157.

Sarang, P., Mathew, B. and Juric, M. B. (2006), Business Process Execution Language for Web Services, 2nd Edition. An Architects and Developers Guide to BPEL and BPEL4WS, Packt.

Shafer, J. C., Agrawal, R. and Mehta, M. (1996), SPRINT: A scalable parallel classifier for data mining, in T. M. Vijayaraman, A. P. Buchmann, C. Mohan and N. L. Sarda, eds, ‘Proceedings of the 22nd International Conference on Very Large Databases (VLDB’96)’, Morgan Kaufmann, pp. 544–555.

Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Kindermann, J. and Dubitzky, W.

(2008), ‘Grid-enabling data mining applications with DataMiningGrid: an architectural perspective’, Future Generation Computer Systems 24, 259–279.

Taylor, I. J., Deelman, E., Gannon, D. B. and Shields, M. (2007), Workflows for e-Science: Scientific Workflows for Grids, Springer, Secaucus, NJ.

Wang, J. T.-L., Zaki, M. J., Toivonen, H. and Shasha, D., eds(2005), Data Mining in Bioinformatics, Springer.

Woehrer, A., Brezany, P. and Tjoa, A. M. (2005), ‘Novel mediator architectures for grid information systems’, Future Generation Computer Systems 21 (1), 107–114.

W¨ohrer, A., Novakova, L., Brezany, P. and Tjoa, A. M. (2006), D3G: novel approaches to data statistics, understanding and pre-processing on the grid, in ‘IEEE 20th International Conference on Advanced Information Networking and Applications’, Vol. 1, Vienna, pp. 313–320.

Wolstencroft, K., Oinn, T., Goble, C., Ferris, J., Wroe, C., Lord, P., Glover, K. and Stevens, R. (2005), Panoply of utilities in Taverna, in ‘Proceedings of the First International Conference on e-Science and Grid Computing (E-SCIENCE’05)’, IEEE Computer Society, Washington, DC, pp. 156–162.

4

ADaM services: Scientific data