Taverna for bioinformatics experiments - Robert Stevens, Paul Fisher, Jun Zhao, Carole Goble an

Robert Stevens, Paul Fisher, Jun Zhao, Carole Goble and Andy Brass

9.4 Taverna for bioinformatics experiments

always completely captured in an analysis, potentially leading to problems of scientific record (Zhao et al., 2004).

In a ‘human-driven’ approach, bioinformaticians perform the data pipeline by manually inputting data between different sources. They implement the housekeeping tasks as a set of ‘purpose-built scripts’ (Stein, 2002). These purpose-built scripts are brittle as the data publishers often update their data representation format. The functionality of these scripts is not always documented, and the tasks performed by the scripts are obscure. Finally, there is massive duplication of effort across the community in order to cope with the volatility (Stein, 2002).

Ideally, the materials and methods used in an experiment are recorded in a scientist’s laboratory-book. However, some implicit steps, such as the housekeeping scripts, may be lost in the documentation. Also, as the repetitive nature of experiments and the amplified data results, manually collecting these provenance information is time consuming, error prone and not scalable.

In a ‘workflow’ approach, the applications from different providers are deployed as (Web) services in order to achieve a unified interface. The workflows make explicit the data passed between the experiment steps (i.e. the services) including any hidden steps, such as the house-keeping steps (Stevens et al., 2004). The experiment protocol documented in a workflow can be reused and repeatedly executed without duplicate efforts (Stevens et al., 2004).

9.4 Taverna for bioinformatics experiments

Taverna (Oinn et al., 2006) is a workflow workbench that is targeted at bioinformaticians who need to deal with the data sources and services described above, including the issues associated with the bioinformatics landscape and the repetitive nature of bioinformatics experiments. To realize these goals, it provides a multi-tiered open architecture and an open-typing and data-centric workflow language, in addition to the following technologies.

r Shim services. Implement the housekeeping scripts that reconcile the type mismatches be-tween the heterogeneous bioinformatics services (Hull et al., 2006).

r Iterations. Handle the collection data products passed between services.

r Nested workflows. Enable modularized workflows. A nested workflow is a workflow that is treated as an experiment step and contained within another higher-level workflow.

r Fault tolerance. To help bioinformaticians live with the fragile domain services with dynamic service substitution and re-try (Oinn et al., 2006).

r Multiple service styles. Taverna can accommodate a variety of services styles. For example, as well as WSDL services, Taverna can use local Java services, beanshell scripts or R processes.

Taverna also provides mechanisms for users to develop their own services. The SoapLab software package allows command-line programmes to be wrapped as Web services, and Taverna is also able to collect and use (scavenge) Java API from a local computer. The extensible plugin framework of Taverna also enables new processor types to be developed, bringing an even wider range of services to bear.

r Provenance and data services. To collect both final and intermediate experiment data prod-ucts and their provenance information (Stevens et al., 2004; Stevens, Zhao and Goble, 2007).

Table 9.1 How Taverna addresses issues of the bioinformatics landscape

Issues Taverna

Heterogeneous data Use of open typing system

Heterogeneous interface for data access Use of open typing system

Collection data products Iterations

Housekeeping scripts Shim services

Volatile resources and repetitive experiments Explicit and repeatable workflows;

Provenance and data services

Volatile and fragile services Fault tolerance

Amplified data products and their relationship Provenance and data services

Table 9.1 summarizes how the issues caused by the bioinformatics data sources and experiments are addressed by Taverna’s design considerations and technologies. The autonomy of data sources and services is not managed in Taverna in order to leave bioinformaticians with open choices. These designs and technologies are detailed in Table 9.1.

9.4.1 Three-tiered enactment in Taverna

Taverna provides a three-tiered abstraction architecture, shown in Figure 9.1, in order to sep-arate user concerns from operational concerns:

r The data-centric workflow language, Simplified Conceptual Unified Workflow Language (SCUFL) (Oinn et al., 2006), is at the abstraction layer that enables the users to build a workflow without writing a large, bespoke application.

r The run of a workflow is managed at the execution layer by the Freefluo engine following the execution semantics (Turi, 2006), and the experiment data products are also handled at this layer by the Baclava data model. The Freefluo enactor dispatches executions details to

Figure 9.1 The three-tiered abstraction architecture of Taverna. The processor invocation layer pro-vides access to a variety of third-party services. the middle execution flow layer co-ordinates use of these services. The application data-flow layer (top layer) allows the user to describe how these services are to be used

9.4 TAVERNA FOR BIOINFORMATICS EXPERIMENTS 147 the Event Listener plug-in during critical state changes, such as WorkflowRunCompletion, ProcessRunCompletion etc. This enables provenance to be automatically collected on behalf of users by an enactor-centric provenance collection architecture.

r The invocation of different families of services is managed by the Freefluo workflow enactor at the invocation layer. A set of Processor plug-ins is defined in this layer in order to invoke the domain services of heterogeneous styles (Oinn et al., 2006).

This abstraction architecture enables bioinformaticians to design workflows without worrying about the underlying service executions. Taverna’s openness to heterogeneous domain data and services is underpinned by both the flexible Processor plug-ins and the open-typing Taverna data models.

9.4.2 The open-typing data models

Neither the SCUFL model for defining a Taverna workflow nor the Baclava model for managing the data products constrains the types of datum passed between services or their invocations.

The SCUFL model The SCUFL model contains the following components.

r A set of input ports (source ports) and output ports (sink ports) for a workflow and for a processor. Each port is associated with free text descriptions and its MIME type. The MIME types describe the data format for transporting between services, such astext/plain, and they say little about the type of the data. This is very much a reflection of the bioinformatics data landscape described earlier.

r A set of processors, each representing an individual step within a workflow. The processor specification that specifies the ‘style’ of the processor, such as a SoapLab service or as a NestedWorkflow.

r A set of data links that link a source port with a sink port. This reflects SCUFL as a data-flow-centric workflow language.

r A set of co-ordination links that link two processors together and enable additional constraints on the behaviour of these linked processors. For instance, a processor B will not process its data until processor A completes.

r A set of optional fault tolerance policies, which either provide alternative services to replace a failed service or specify the number of times a failed service invocation should be re-tried before finally reporting failure.

r A set of optional iteration configurations that specify the behaviour of a processor when it receives an actual data product that mismatches the data cardinality of its port. For example, if a port expects the input as a ‘string’ but actually receives a ‘list’, the implicit iteration in Taverna handles this mismatch (Oinn et al., 2006).

Data typing constraints are not associated with a port in the SCUFL model or not in the Baclava data model.

The Baclava data model The Baclava model manages the data products, or data entity, gen-erated during workflow executions. This model describes a data entity’s data structure and its

Figure 9.2 A BLAST report, which contains a collection of sequence records, is regarded as an atomic data product in the Baclava model. However, it is a collection at the domain level

MIME type. This information is from a data entity’s corresponding port defined in the SCUFL model, e.g. the input of a particular processor.

A data entity can be either atomic or a collection. The Baclava describes a data entity at the transportation level to help the enactor to handle the implicit iterations, rather than at the domain level. This means the internal structure of a DataThing, either atomic or collection, is not exposed in the Baclava model, in order to adopt any services presented by the bioinformatics community (Oinn et al., 2006).

A BLAST report is a typical example to illustrate this domain-independent Baclava model.

A BLAST report, as shown in Figure 9.2, is produced by the BLAST service (Altschul et al., 1990), which searches for similar sequences in a sequence database. At the domain level, this BLAST report is a ‘collection’ as it contains a collection of pairwise sequence alignments and associated links from the target sequence database. These pairwise sequence alignment records are, however, not exposed in the Baclava model. Rather, such a BLAST report is treated as an

‘atomic’ item by Baclava, when it is transferred between services during workflow execution.

These open-typing data models enable Taverna to be open to the heterogeneous services and data of the bioinformatics landscape. This, however, brings difficulty for bioinformaticians to interpret experiment data products at the domain level. The domain-independent models also impact on the kinds of identity that Taverna can allocate for a DataThing.

Dans le document Data Mining Techniques in Grid Computing Environments (Page 168-171)