Discovery Net system - Moustafa Ghanem, Vasa Curcin, Patrick Wendel and Yike Guo

Moustafa Ghanem, Vasa Curcin, Patrick Wendel and Yike Guo

8.2 Discovery Net system

Web source

RDBMS

Staging

Pre-processing Integration

(data-parallel)

Training set

Test set

Modelling

Storage Validation

Figure 8.1 Simple workflow

way of describing and managing distributed data mining tasks – data integration, processing and analysis. Moreover, each of the different steps in the workflow can be implemented as a remote service using grid resources. The advantage is that users of the workflow system are shielded from the complexities of dealing with the underlying grid infrastructure and can focus on the logic of their data mining tasks. Access to the remote services and managing data transfers between them becomes the responsibility of the workflow system. We refer to workflows handled by such system as analytical workflows since their key feature is managing data analysis tasks.

8.2 Discovery Net system

The Discovery Net system has been designed around an analytical workflow model for inte-grating distributed data sources and analytical tools within a grid computing framework. The system was originally developed as part of the UK e-science project Discovery Net (2001–

2005) (Rowe et al., 2003) with the aim of producing a high-level application-oriented platform, focused on enabling the end-user scientists to derive new knowledge from devices, sensors, databases, analysis components and computational resources that reside across the Internet or grid.

Over the years, the system has been used in a large number of scientific data mining projects in both academia and industry. These include life sciences applications (Ghanem et al., 2002, 2005; Lu et al., 2006), environmental monitoring (Richards et al., 2006) and geo-hazard mod-eling (Guo et al., 2005). Many of the research ideas developed within the system have also been incorporated within the InforSense¹KDE system, a commercial workflow management and data mining system that has been widely used for business-oriented applications. A number of extensions have also been based on the research outputs of the EU-funded SIMDAT²project.

8.2.1 System overview

Figure 8.2 provides a high-level overview of the Discovery Net system. The system is based on a multi-tier architecture, with a workflow server providing a number of support-ing functions needed for workflow authorsupport-ing and execution, such as integration and access

1http://www.inforsense.com/

2http://www.simdat.org/

Workflow server

Collaboration

Resource integration

Data management

Workflow management

Data sources

Grid resources Services

Workflow

Portal application Web service Grid service Shell script ...

deployed as

Figure 8.2 Discovery Net concept

to remote computational and data resources, collaboration tools, visualizers and publishing mechanisms.

A key feature of the system is that it is targeted at domain experts, i.e. scientific and business end users, rather than at distributed and grid computing developers. These domain experts can develop and execute their distributed data mining workflows through a workflow authoring client, where users can drag and drop icons representing the task nodes and connect them together. The workflows can also be executed from specialized Web-based client interfaces.

Although generic, the three-tier architecture model presented in Figure 8.2 serves as a consistent and simple abstraction throughout the different versions of the Discovery Net system.

The implementation of the system itself has evolved over the past few years from a prototype targetted to specific projects to an industrial strength system widely used by commercial and academic organizations.

8.2.2 Workflow representation in DPML

Within Discovery Net, workflows are represented and stored using Discovery Process Markup Language (DPML) (Syed, Ghanem and Guo, 2002), an XML-based representation language for workflow graphs supporting both a data flow model of computation (for analytical workflows) and a control flow model (for orchestrating multiple disjoint workflows).

Within DPML, each node in a workflow graph represents an executable component (e.g.

a computational tool or a wrapper that can extract data from a particular data source). Each component has a number of parameters that can be set by the user and also a number of input and output ports for receiving and transmitting data, as shown in Figure 8.3. Each directed edge in the graph represents a connection from an output port, namely the tail of the edge, to an input port, namely the head of the edge. A port is connected if there is one or more connections from/to that port.

In addition, each node in the graph provides metadata describing the input and output ports of the component, including the type of data that can be passed to the component and parameters

8.2 DISCOVERY NET SYSTEM 123

Output Data/Metadata Input

Data/Metadata

User parameters

Figure 8.3 Structure of a component in Discovery Net

of the service that a user might want to change. Such information is used for the verification of workflows and to ensure meaningful chaining of components. A connection between an input and an output port is valid only if the types are compatible, which is strictly enforced.

8.2.3 Multiple data models

One important aspect of a workflow system is which data models it supports. Discovery Net is based on a typed workflow language and therefore supports arbitrary data types. However for the purpose of supporting data mining and other scientific applications, we included support for a relational data model, a bioinformatics data model for representing gene sequences and a stand-off markup model for text mining based on the Tipster architecture (Grishman, 1997).

Each model has an associated set of data import and export components, as well as specific visualizers, which integrate with the generic import, export and visualization tools already present in the system. As an example, chemical compounds represented in the widely used SMILES format can be imported inside data tables, where they can be rendered adequately using either a three-dimensional representation or its structural formula. The relational model also serves as the base data model for data integration, and is used for the majority of generic data cleaning and transformation tasks. Having a typed workflow language not only ensures that workflows can be easily verified during their construction, but can also help in optimizing the data management.

8.2.4 Workflow-based services

To simplify the process of deploying workflow-based applications, Discovery Net supports the concept of workflow services. A workflow-based service is a service, i.e. either a Web/grid service or a Web application front-end to that service, derived from the workflow definition.

Deploying a workflow as service requires selecting which parameters of the components used within a workflow and which input and output data elements are to be exposed to the user of the service. It also requires specifying the properties of the user interface, if any, to be associated with the service. Such services are then created and registered dynamically, from the workflow building client, and do not necessitate any re-initialization of the various servers.

8.2.5 Multiple execution models

Workflow execution within Discovery Net is delegated to the execution module on the workflow server that executes the data flow and control flow logic of the workflow definition, invoking the execution of the different components and also handling and data transfer between them.

Each component receives input objects (data in data flow or tokens in control flow) on its input

Figure 8.4 Abstract workflow graph

ports, manipulates such objects based on its own predefined implementation and then passes output objects along its output port. In order to understand how a workflow graph is interpreted and executed by the server, we now describe briefly the execution semantics for Discovery Net workflows.

It should be noted that, although different workflow systems all work on graph definitions similar to those of Discovery Net, the meaning of these workflows and how they are exe-cuted may vary greatly. A comparison between the execution semantics of Discovery Net and other systems including Taverna (Hull et al., 2006), Triana (Taylor et al., 2005) and Kepler (Lud¨ascher et al., 2006) is described by Curcin et al. (2007). In our discussion below we make use of the simple workflow graph shown in Figure 8.4 to highlight the key concepts behind the workflow execution semantics for Discovery Net. The graph is generic, without assumptions about the nodes’ functionality, and we shall use it to explain the data flow and control flow semantics of the system.

8.2.6 Data flow pull model

Discovery Net supports a data pull model for data flow graphs with a workflow acting as an acyclic dependence graph. Within this model a user requests execution of one the end-point nodes, e.g. G. This node can only execute if its input is available and it requests an input from the preceding node F, which in turn requests inputs from its preceding nodes, etc. Within this example both D and E are eligible for parallel execution if they represent computations executing on different resources; however, H will not be executed unless its output is requested explicitly by the user at a later time. Note also that the system supports caching of intermediate data outputs at node outputs. This means that if at a later time the user requests the execution of H, it would be able to re-use the pre-cached output of F. This model is typically suitable for data mining workflows.

8.2.7 Streaming and batch transfer of data elements

Discovery Net supports both streaming and batch transfers of data elements between compo-nents connected in a data flow graph. Note that, although metadata can be associated with the different input/output ports, specifying which data will be placed on the port, e.g. a list of pro-tein sequences, the definition of the streaming behaviour for each component is not specified.

There is no indication in the graph definition whether a component will process list elements on its input or output ports one by one or whether it should wait for the whole list to arrive before it is processed. This means that the same graph can be executed in either mode by the server depending on the nature of the environment and the other components in the workflow.

Whether streaming or batch mode is to be used is specified by the user when a workflow is submitted for execution.

8.2 DISCOVERY NET SYSTEM 125 8.2.8 Control flow push model

As opposed to data flow graphs, nodes within Discovery Net control flow graphs represent special constructs for manipulating execution tokens and controlling iteration and check point behaviour. These control flow graphs can be cyclic in their definition, and communi-cation between them is based on passing tokens rather than on passing data elements. The execution of Discovery Net control flow graphs is based on a data push paradigm, where workflow components are invoked left to right and nodes with multiple output ports can de-cide on which port(s) they will place the tokens, hence determining the direction of further execution.

8.2.9 Embedding

Since the strength of the workflow approach for data mining lies primarily in the use of data flows for structured application design, control flows in Discovery Net are used mainly to coordinate disconnected workflows whose ordering is driven by business logic, rather than an explicit data dependence. Therefore, control flow nodes contain within them data flows (deployed as services) that are invoked every time the node executes. This concept is shown in Figure 8.5.

Figure 8.5 Embedding of data flows inside control flows

Dans le document Data Mining Techniques in Grid Computing Environments (Page 144-149)