Prototyping and production clients - Architecture for Discovery Net

Moustafa Ghanem, Vasa Curcin, Patrick Wendel and Yike Guo

8.3 Architecture for Discovery Net

8.3.10 Prototyping and production clients

The user (workflow builder) is presented with a client tool for prototyping. The interface allows her to compose services, execute them, cache temporary results, visualize intermediate and final data products and modify particular parameters of the workflow either at design time or while its execution is suspended (i.e. when user interaction is requested during the workflow execution). Once the workflow is completed, the client interface also allows the user to define how the workflow should be published, i.e. turned into either a Web-based application for domain experts or into a Web/grid service for integration into a service-oriented architecture.

The workflows are then transformed into workflow-based services. These published services can also be reused from the workflow client as particular activities.

8.4 DATA MANAGEMENT 131

8.4 Data management

Figure 8.8 shows the data management system of Discovery Net. It designed to support persis-tence and caching of intermediate data products. It also supports scalable workflow execution over potentially large data sets using remote compute resources. The processing of data is often performed outside the system. Apart from some libraries such as Weka (Witten et al., 1999) or Colt³, which process data directly in Java and can therefore be used for tighter in-tegrations, data are processed outside, using local execution with streams or temporary files, socket communication such as the communication with an R⁴server or over SSH protocol for secure communication.

An analytical workflow management system cannot keep storing any data product generated in the workflow, given that during prototyping only a fraction of data is relevant and deserves to be stored back for longer term. The process of storing back the result is a one-off event that occurs when the user is satisfied with the results. Thus these repositories cannot always be used as a place to cache these intermediate data sets. The means and location to stage them must be part of the workflow system itself.

During workflow prototyping we cannot assume a particular, fixed data model, since the workflow is not fixed yet. Re-parameterization or changes to the graph structure may trigger a change in the schemas of these tables. This means that storing these intermediate data products does not follow the same pattern as storing tables in RDBMS, and instead has the following characteristics.

r Many temporary and unrelated tables need to be stored and there is no notion of database schema or referential integrity between these tables.

Web-based

Figure 8.8 Data management in execution

3http://dsd.lbl.gov/∼hoschek/colt/

4http://www.r-project.org/

r For the purpose of caching, any data generated need to be preserved for the duration of the cache. Thus no intermediate data point is modified, but only created and populated once.

r The pattern of access is mainly based on scanning the entire data set at least once for impor-tation, exporimpor-tation, loading in memory or deriving new attributes. Indexing and optimized querying are not essential, given that this model does not aim to replace a database.

r The ordering of attributes and instances may be relevant and needs to be preserved. For instance, time series may assume a range of attributes in a particular order, or that data instances themselves are ordered. In general, data mining algorithms can be affected by the ordering of the data and it is important to preserve it for the process to be deterministic.

The processing thus bears more resemblance to the way files and documents are usually pro-cessed by analysis executables in workflows, by creating new output files instead of modifying the input ones, than to the way data are modified in a relational database through declarative queries. However, relying on text files to contain the data is not practical either, as they cannot handle binary data or documents easily and the cost of parsing the data in and out would take too much time.

The other issue is the execution model over these data sets. Data sets may need to be used in the workflow not only for caching and storage but also for streaming data instances in order to model pipeline parallelism in the workflow, in particular when the processing of these data sets is performed in a remote location. In other words, to the end-user it is natural that the data sets should act simultaneously as a data type and a communication channel between activities.

While this is partly a usability requirement, in the context of data-flow modelling of analysis workflows, having a common model introduces two important changes. First, activities can output new relations derived from their input, by adding, removing or reordering attributes, with only the modifications being stored, for efficiency purposes. This pattern is preferred to the simpler pattern of creation of new output, as it allows easier maintenance of data record provenance through activities. Second, table-based operations such as relational operators, aggregation functions or data mining algorithms, can be used in the workflow regardless of whether the preceding components are streaming or not. Hence streaming and batch com-ponents can be freely combined and the component semantics determines the behavior at runtime.

Therefore the main features of Discovery Net data sets are the following.

(a) Temporary and short-term persistence for exploring and caching data products during the workflow prototyping phase.

(b) Streaming of data instances, to enable pipeline parallelism either in memory with no additional I/O cost or over a file-based representation.

(c) Ad hoc integration of information from each activity by creating temporary relations with-out requiring pre-defined schemas in external databases.

It is important to note that the Discovery Net data set is not a replacement for traditional data management systems, relational database management systems, data warehouses and other long-term data storage mechanisms. It however acts as a data structure for staging the data to be processed in the workflow.

Dans le document Data Mining Techniques in Grid Computing Environments (Page 153-156)