• Aucun résultat trouvé

Operations for workflow construction

Dennis Wegener and Michael May

10.3 Operations for workflow construction

components of the DataMiningGrid Workflow Editor supporting various data mining scenarios are defined.

A component inside a Triana workflow is called a unit. Each unit, which can be seen as a wrapper component, refers to special operations. The Triana units are grouped in a treelike structure. In the user interface, units are split into several subgroups referring to their functionality, e.g. applications, data resources, execution, provenance and security.

The generic job template is a component of the DataMiningGrid system supporting the user in the construction of workflows and in the specification of grid jobs. The template consists of four different groups of units for application exploration, data selection, application control and execution.

10.3 Operations for workflow construction

A workflow in a data mining application is a series of operations consisting of data access and preparation, modeling, visualization, and deployment. Such a workflow can typically be seen as a series of data transformations, starting with the input data and having a data mining model as final output. In a distributed environment, additional operations are related to splitting a data mining task into sub-problems that can be computed in a distributed manner, and selecting an appropriate distribution strategy. This section describes the facilities the DataMiningGrid offers for defining workflows.

Workflows in Triana are data driven. Control structures are not part of the workflow language but are handled by special components – e.g., looping and branching are handled by specific Triana units and are not directly supported by the workflow language (Shields, 2007; Taylor et al., 2007).

In the DataMiningGrid system the client-side components are implemented as extensions to the Triana Workflow Editor and Manager. The workflow can be constructed by using (a) the standard units provided by Triana and (b) the DataMiningGrid extensions. By using and combining these units, workflows performing many different operations can be defined.

10.3.1 Chaining

A typical data mining task spans a series of steps from data preparation to analysis and visu-alization that have to be executed in sequence. Chaining is the concatenation of the execution of different algorithms in a serial manner. The result data of an algorithm execution are used by the next execution in the series.

Concatenation of analysis tasks is achieved by using multiple generic job templates inside the Triana workflow. Each generic job template represents a single analysis step with a (not necessarily different) application. The output of the previous task can be used as input for the next task. Different tasks can run on different machines. It is up to the user to decide whether it makes sense to run a second task on the same or on a different machine, which could mean transferring the results of the first task over the network.

10.3.2 Looping

Looping means the repeated execution of an algorithm. The algorithm can run on the same (or different) data or with the same (or different) parameter settings until a condition is fulfilled.

The DataMiningGrid system provides different ways of performing loops. Triana contains a Loop unit that controls repeated execution of a subworkflow (Shields, 2007; Taylor et al.,

2007). Additionally, it provides a loop functionality when grouping units. The DataMiningGrid components do not directly provide loops but parameter sweeps (see Section 10.3.6). Depend-ing on the kind of loop to be performed, one or more of these choices are possible.

10.3.3 Branching

Branching means letting a program flow follow several parallel executions paths from a spec-ified point or following a special execution path depending on a condition.

In a workflow there are different possibilities for branching without a condition. A user would, e.g. divide the workflow into two (or more) branches after the execution of an application is finished (e.g. pre-processing) and start the parallel execution of other applications. After each application execution the Execution unit returns a URI as reference to the results’ location in the grid. As inputs and outputs of grid jobs are file based, they can be accessed at any later step of the workflow. To set up a branch, the GridURI output node of the Execution unit has to be cloned. Analogously, branching can be performed at various steps of the workflow (provided the unit after the branches start has a clonable output node). In addition, Triana contains a Duplicator unit that enables workflow branching (Taylor et al., 2007). This unit can be used if there is no clonable output node at a unit. The Duplicator unit’s function is to duplicate any object received at the input node.

Triana also provides a unit supporting conditional branching, which is designed for execut-ing a designated part of the workflow dependent on a condition (Shields, 2007; Taylor et al., 2007).

10.3.4 Shipping algorithms

Shipping of algorithms means sending the algorithm to the machine where the data it oper-ates on are located. This is one of the major options in setting up a distributed data mining application. It is required when, as is often the case, no pre-configured pool of machines is available that already has the data mining functionality installed. The option to ship algorithms to data allows for flexibility in the selection of machines and reduces the overhead in setting up the data mining environment. This is especially important when the data naturally exist in a distributed manner and it is not possible to merge them. This may be the case when data sets are too large to be transferred without significant overhead or when, e.g., security policies prevent the data being moved.

The execution system of the DataMiningGrid system allows us to ship algorithms. The executable file that belongs to the application is transferred to the execution machine at each application execution. If the algorithm is to be shipped to the data to avoid copying files among different sites, the machine where the data is located has to be selected as the execution machine.

If a job is submitted to a Grid Resource Allocation Manager (GRAM)2that is connected to a Condor cluster, it is not possible to specify exactly on which machine of the cluster a job should run. By selecting the appropriate execution mode, the job can be submitted and processed either on the GRAM itself or on the cluster where the clusterware is responsible for the resource management.

10.3.5 Shipping data

Shipping of data means sending the data to the machine where the algorithm that processes them is running. This operation is important in cases where either the full data set is partitioned

2A GRAM enables the user to access the grid in order to run, terminate and monitor jobs remotely.

10.4 EXTENSIBILITY 171

Figure 10.2 Parameter sweep – parameter loops and lists can be specified in the GUI

to facilitate distributed computation (e.g. application of k-NN) or where the same source data set is moved to different machines and repeatedly analysed, for instance for ensemble learning.

Each time an application is executed in the DataMiningGrid environment the input data (consisting of one or more files) for the application are copied to a local work directory on the execution machine (when using the DataMiningGrid unit built-in functionality). If only the data and not the algorithm are to be shipped (e.g. for copyright reasons) the machine can be specified where the job should run (see Section 10.3.4).

10.3.6 Parameter variation

Parameter variation means executing the same algorithm with different input data and differ-ent parameter settings. To set up a parameter variation as a distributed computation requires shipping the data and the algorithms.

The DataMiningGrid system provides the possibility of using parameter sweeps, which means that a loop (for numeric values) or a list (for any data type) can be specified for each option of the application as well as a loop for the input files or directories (see Figure 10.2, showing parts of the GUI). With this approach it is possible to submit hundreds of jobs at the same time.

When performing a sweep the system handles the naming of the result files automatically to ensure that results are not overwritten when they are collected in the result directory.

10.3.7 Parallelization

Parallelization means distributing subtasks of an algorithm and executing them in parallel.

The system supports the parallel or concurrent execution of applications at the same time, e.g. by performing a parameter sweep (Section 10.3.6) and the parallel execution of workflow branches (Section 10.3.3).

In general, parallelization of a single algorithm is difficult on the grid because of the com-munication overheads involved. The QosCosGrid3project is currently investigating whether grids can be used for parallel computing purposes. Thus, the DataMiningGrid environment does not support the parallelization of the execution of a single algorithm. If the application itself takes care of the parallelization process, then an integration is possible.

10.4 Extensibility

Data mining tasks are highly diverse. In many cases the functionality already provided by a system will not be sufficient for the task at hand. This makes extensibility an important feature.

3QosCosGrid Web site: www.QosCosGrid.com

In the context of the DataMiningGrid system, there are two types of extensibility: on the client side (local extensibility), and on the grid environment, e.g. the inclusion of new grid-enabled applications.

Local extensibility A local extension of the DataMiningGrid system is a client side extension.

The Triana workflow editor, which is the main client of the system, can be easily extended by new workflow components. Such components – implemented as Triana units – could, for instance, be viewer or inspection components, which may or may not directly interact with the grid environment.

Extensibility of the grid environment The requirement for extensibility of the grid environment requests the following:

r Extensibility without platform modification. A data miner who wants to make use of a grid-enabled data mining platform typically does not have any knowledge about the details of the underlying system. Therefore, he or she does not want to change – and even might not be capable of changing – any components or program code of the data mining platform.

Additionally, the data miner may not want to be dependent on a grid platform developer.

r Extensibility without algorithm modification. The data miner can reuse his or her favorite algorithms and combine them with third-party methods to address complex data mining tasks.

New applications can be grid-enabled with the DataMiningGrid system without platform and algorithm modification. In the following, a simple scenario of extending the system by a decision tree application is described.

The grid-enabling process requires an executable file that contains the application and an application description file (an instance of the ADS). In this example, the Weka-J4.8 decision tree algorithm from Weka v3.4.3 was grid-enabled. The Weka implementation of the algorithm does not use flag-value pairs for specifying command-line calls, so a wrapper component that translates the command-line call into the required format is needed for the integration. The sources together with the wrapper were packaged into an executable jar file.

The application description file has to be created according to the ADS and contains in-formation about the application that is, e.g., used within a grid-wide application registry. It contains the application name, group and domain and the technique used, as well as data about the executable file and the class to call, information about the application’s options, inputs and outputs, and system requirements. The data input for the J4.8 application is a file on which the training is performed; the output is the result file Weka creates. The ADS can be pro-vided either manually or by using the DataMiningGrid Application Enabler, which is a wizard style webpage that guides through the process of grid-enabling. Once the executable file and the application description file are made available in the grid environment, the application can be used with the client side components.

Using this approach, we have grid-enabled a number of new applications. In particu-lar, these are the algorithms for the scenarios used in the case studies (see Section 10.5) as well as a number of helper-applications responsible for input file and result processing.

For example, in the k-NN scenario (see Section 10.5.2) there is a need for two additional applications that process the input and output of the k-NN application in order to split the input