Workflows and the systematic approach

Robert Stevens, Paul Fisher, Jun Zhao, Carole Goble and Andy Brass

9.6 Workflow case study

9.6.5 Workflows and the systematic approach

Since we have chosen to automatically pass data from one service into the next using workflows, we are now able to process a far greater volume of information than can be achieved using manual analysis. This in turn provides the opportunity to systematically analyse any results we obtain without the need to prematurely filter the data (for human convenience). An example of this triaging process was found where studies carried out by researchers on the Wellcome Trust Pathogen–Host Project (see Acknowledgments) had failed to identify Daxx as a candidate gene for trypanosomiasis resistance. This occurred when manually analysing the microarray and QTL data; researchers hypothesized that the mouse–cow syntenous QTL region may contain the same QT genes. It was later found through a systematic analysis that Daxx lay outside this region, and so the mouse QTL data were prematurely filtered based on researcher bias (although this does not preclude the discovery of other QT genes within this syntenous region).

The use of a hypothesis-driven approach is essential for the construction of a scientifically sound investigation however, the use of a data-driven approach should also be considered.

This would allow the experimental data to evolve in parallel to a given hypothesis and form its own hypotheses regardless of any previous assumptions (Kell and Oliver, 2004), as shown by this case.

Worthy of note is that the expression of genes and their subsequent pathways can be inves-tigated with little to no prior knowledge, other than that of the selection of all candidate genes from the entire QTL region. This reduces the bias that may be encountered from traditional hypothesis/expert-driven approaches. By implementing the manually conducted experiments in the form of workflows, we have shown that an automated systematic approach reduces, if not eliminates, these biases whilst providing an explicit methodology for recording the processes involved. These explicit analysis pipelines increase the reproducibility of our methods and also provide a framework for which future experiments can adapt or update the underlying methods.

In using the Taverna workflow workbench we are able to state the services used and the parameters chosen at execution time. Specifying the processes in which these services interact with one another in the Scufl workflow language enables researchers to re-use and re-run exper-iments. An additional feature of the Taverna system is the capture of experimental provenance.

The workflow parameters and individual results are captured in this execution log (Stevens, Zhao and Goble, 2007).

The generality of these workflows allows them to be re-used for the integration of mapping and microarray data in other cases than that of response to trypanosomiasis. Furthermore, the QTL gene annotation workflow may be utilized in projects that use the mouse model organism and do not have any gene expression data to back up findings. Likewise, the microarray gene annotation workflow may be used in studies with no quantitative evidence for the observed phenotype.

It should be noted that an unavoidable ascertainment bias is introduced into the methodol-ogy, in the form of Utilizing remote resources for candidate selection. The lack of pathway annotations limits the ability to narrow down the true candidate genes from the total genes identified in the QTL region, with the reliance on extant knowledge. With a rapid increase in the number of genes annotated with their pathways, however, the number of candidate QTG identified in subsequent analyses is sure to increase. The workflows described here provide the means to readily repeat the analysis.

The KEGG pathway database was chosen as the primary source of pathway information due to its being publicly available and containing a large set of biological pathway annotations.

This results in a bias, relying on extant knowledge from a single data repository; however, this investigation was established as a proof of concept for the proposed methodology and, with further work, may be modified to query any number of pathway databases, provided they offer (Web) service functionality.

The workflows developed in this project are freely available for re-use – either by ourselves or others in future analyses. And so, by revisiting issues (a)–(f) outlined earlier in this sec-tion, we can show that by utilizing workflows within this investigation we have achieved the following.

(a) Successfully reduced the premature filtering of data sets, where we are now able to process all data systematically through the workflows.

(b) The systematic analysis of the gene expression and QTL data has supported a data-driven analysis approach.

(c) The use of a data-driven approach has enabled a number of novel hypotheses to be inferred from the workflow results, including the role of apoptosis and Daxx in trypanosomiasis resistance.

(d) The workflows have explicitly captured the data analysis methodologies used in this investigation.

(e) Capturing these data analysis methods enables the direct re-use of the workflows in sub-sequent investigations.

(f) The total number of errors within this investigation has been reduced as a whole from all of the issues addressed above.

9.7 DISCUSSION 159

9.7 Discussion

myGrid’s Taverna, as well as workflow management systems in general, have supported a step change in how bioinformatics analyses are conducted. The industrialization of bioinformatics analysis has begun, just as the industrialization of much of biology’s experimentation has already been accomplished. Industrial data production must be matched by industrial data analysis. In addition, as with any technological shift, how best to use technological innovations has to be understood. This industrialization is not simply mass production; it also changes the nature of the work; that is the methodology of data analysis.

Much bioinformatics is rooted in human-driven analysis through the manual transfer of data between services, delivered by webpages. In the days when single sequence analysis with one service was all that was necessary, this approach was adequate, but lacked support for easily attaining a level of rigour in recording what happened during an analysis. With today’s increase in complexity – larger range of data types, larger quantities and more services needed in an analysis, such an approach does not scale. In these situations other consequences of human fallibility are also seen: biases in the pursuit of one data item in preference to another;

premature triage to cope with too many data and others described in Section 9.6.

Taverna workflows can overcome many of these issues. Tedious repetition of data transfer can be automated. Large numbers of services can be coordinated. Workflows can be utterly systematic, thus avoiding issues of bias in selection of data items. There is no temptation to shed data to ease processing and thus risking false negative disposal. These consequences are typical of automation, be it workflow or bespoke scripts. Workflows, however, have other advantages.

They are high level, allowing a bioinformatician to concentrate better on the bioinformatics and avoiding the ‘plumbing’. They afford re-use much more easily than scripts, by allowing service substitution, extension of the workflow, fragmentation, data and parameter change etc. They are explicit about the nature of the analysis and, through common services such as provenance, afford better recording of an in silico experiment’s events, data lineage and origins, etc.

Taverna also gives a means by which a bioinformatician can address the issues arising from the autonomy and consequent heterogeneity of data and tools that exist in a volatile knowledge environment. Taverna does not solve these issues, but provides a mechanism by which the issues can be addressed. One option is to require that the services that provide data and analysis tools conform to a particular type system that would hide the issues of heterogeneity. This would, in effect, mean providing a wide range of bioinformatics service in-house or necessitating a high

‘activation energy’ before any workflows could be developed. Taverna took the alternative route of being entirely open to any form of service. This immediately provides access to a large number of tools and data services (3500 at the time of writing). This openness avoids having to provide services or make them conform to some internal type model. This choice does, however, mean that heterogeneity is mostly left to the workflow composer to manage.

The cost is still there, but it is lessened and itself distributed. Consequently, the cost of start-up is lower.

Taverna addresses, with ease, the factor of distribution. It makes the data and tools appear as if they are within the Taverna workbench. The bioinformaticians still have to use their skill to address basic bioinformatics issues of syntax transformation, extraction and mapping of identifiers between data resources. In a perfect world, this would not be the case, but the Taverna approach does exploit the target audience’s skills.

In the case study, significant problems were encountered in all workflows when attempting to cross-reference between database identifiers. This matter, together with the naming conventions assigned to biological objects (Dr˘aghici, Sellamuthu and Khatri, 2006), has proven to be a considerable barrier in bioinformatics involving distributed resources. In an attempt to resolve this, a single and explicit methodology has been provided by which this cross referencing was done. This methodology is captured within the workflows themselves.

In the initial use of Taverna (Stevens et al., 2004), workflows were developed that were direct instantiations of human protocols. When developing genetic maps for Williams–Beuren syndrome (Stevens et al., 2004), three workflows were developed that replicated the actions of the work a human bioinformatician performed in gathering data for human analysis. This task took two days and biological insight was gained over this time-span. The Taverna work-flows gathered all these data (some 3.5 MB in numerous files (Stevens et al., 2004)) within a 20 minute period. This left a bioinformatician with a near un-manageable set of data to interpret.

These data were crudely co-ordinated and simply created one problem whilst solving an-other. In light of this experience, the high-level methodology for the goals of workflows has changed. Data still have to be gathered, but this is not the end of the workflow. In the African tryoanosomiasis workflows reported here the strategy is now of gathering and managing data.

By using workflows the nature of the question asked of bioinformaticians and their collab-orating biologists has changed. When there is simply a large number of gathered data they are in effect asked to ‘make sense of these data’. The workflow approach, through mining the data for patterns and relationships, changes the nature of the question to ‘does this pattern or these relationships make sense to you?’. Rather than requiring biologists to become data processors, their background and training are exploited by asking them to design analysis protocols (ex-periments) and then to assess the data mining outcomes. The patterns and relationships that are the outcomes of an analysis are really just a hypothesis that will prompt further analysis, either in the wet lab or through in silico experimentation.

As well as exploiting a biologist’s background and training, the computer is exploited for the tasks that it is best at performing. Computers are good at being systematic and unbiased – two problems identified as being problems in analyses of microarray and QTL data.

This change in approach to analyses has led to a methodology for analysing microarray and QTL data. An approach has been developed, implemented as workflow, that exploits the systematics of a computational approach and couples this with an unbiased approach to evaluating a QTG’s contribution to observed effects. The industrial, systematic approach enables premature triage and consequent false negatives to be avoided. A computer is only limited by its memory size, not its fragility, as is human memory. Finally, this industrial approach alters the granularity with which we look at the data – investigating pathways not gene products. This has gained biological insights into tolerance to African tryoanosomiasis infection (Fisher et al., 2007). This is just one methodology for one data situation – similar effects in other data situations are expected.

In Taverna an analytical approach has been developed that manages to live with the het-erogeneity, autonomy and volatility of the bioinformatics landscape. If bioinformatics were to cure its obvious ills of identity and other forms of heterogeneity, complex analyses of data would be far easier. Such a cure will not happen soon. Taverna provides a vehicle by which a bioinformatician can traverse the rough bioinformatics landscape. In doing so, the myGrid project has changed the nature of bioinformatics analysis.

REFERENCES 161

Acknowledgements

The authors thank the whole myGrid consortium, software developers and its associated researchers. We would also like to thank the researchers of the Wellcome Trust Host–Pathogen Project – grant number GR066764MA. Paul Fisher is supported by the UK e-science EPSRC grant for myGrid GR/R67743.

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990), ‘Basic local alignment search tool’, Journal of Molecular Biology 215 (3), 403–410.

Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., O’Donovan, C., Redaschi, N. and Yeh, L.-S. L. (2004), ‘UniProt: the universal protein knowledgebase’, Nucleic Acids Research 32, 115–

119.

Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., O’Donovan, C., Redaschi, N. and Yeh, L. L. (2005), ‘The universal protein resource (UniProt)’, Nucleic Acids Research 33, D154–

D159.

Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. and Wheeler, D. L. (2005), ‘GenBank’, Nucleic Acids Research 33, 34–38.

Davidson, S., Overton, C. and Buneman, P. (1995), ‘Challenges in integrating biological data sources’, Journal of Computational Biology 2 (4), 557–572.

de Buhr, M., M¨ahler, M., Geffers, R., Westendorf, W. H. A., Lauber, J., Buer, J., Schlegelberger, B., Hedrich, H. and Bleich, A. (2006), ‘Cd14, Gbp1, and Pla2g2a: three major candidate genes for experimental IBD identified by combining QTL and microarray analyses’, Physiological Genomics 25 (3), 426–434.

Dr˘aghici, S., Sellamuthu, S. and Khatri, P. (2006), ‘Babel’s tower revisited: a universal resource for cross-referencing across annotation databases’, Bioinformatics 22 (23), 2934–2939.

Fisher, P., Hedeler, C., Wolstencroft, K., Hulme, H., Noyes, H., Kemp, S., Stevens, R. and Brass, A. (2007),

‘A systematic strategy for large-scale analysis of genotype–phenotype correlations: identification of candidate genes involved in African trypanosomiasis’, Nucleic Acids Research 35 (16), 5625–5633.

Frawley, W., Piatetsky-Shapiro, G. and Matheus, C. (1992), ‘Knowledge discovery in databases: an overview’, AI Magazine 13 (3), 57–70.

Galperin, M. Y. (2006), ‘The molecular biology database collection: 2006 update’, Nucleic Acids Research 34, 3–5.

Hand, D., Mannila, H. and Smyth, P. (2001), Principles of Data Mining, MIT Press.

Hanotte, O. and Ronin, Y. (2003), ‘Mapping of quantitative trait loci controlling trypanotolerance in a cross of tolerant West African N’Dama and susceptible East African Boran cattle’, Proceedings of the National Academy of Sciences 100 (13), 7443–7448.

Hedeler, C., Paton, N., Behnke, J., Bradley, E., Hamshere, M. and Else, K. (2006), ‘A classification of tasks for the systematic study of immune response using functional genomics data’, Parasitology 132, 157–167.

Hill, E., O’Gorman, G., Agaba, M., Gibson, J., Hanotte, O., Kemp, S., Naessens, J., Coussens, P. and MacHugh, D. (2005), ‘Understanding bovine trypanosomiasis and trypanotolerance: the promise of functional genomics’, Veterinary Immunology and Immunopathology 105 (3–4), 247–258.

Huang, L., Walker, D. W., Rana, O. F. and Huang, Y. (2006), Dynamic workflow management using performance data, in ‘Proceedings of the 6th International Symposium on Cluster Computing and the Grid (CCGrid’06)’, pp. 154–157.

Hull, D., Zolin, E., Bovykin, A., Horrocks, I., Sattler, U. and Stevens, R. (2006), Deciding semantic matching of stateless services, in ‘Proceedings of the 21st National Conference on Artificial Intel-ligence and the 18th Innovative Applications of Artificial IntelIntel-ligence Conference’, Boston, MA, pp. 1319–1324.

Iraqi, F., Clapcott, S. J., Kumari, P., Haley, C. S., Kemp, S. J. and Teale, A. J. (2000), ‘Fine mapping of trypanosomiasis resistance loci in murine advanced intercross lines’, Mammalian Genome 11 (8), 645–648.

Kanehisa, M. and Goto, S. (2000), ‘KEGG: Kyoto Encyclopedia of Genes and Genomes’, Nucleic Acids Research 28 (1), 27–30.

Karp, P. (1995), ‘A strategy for database interoperation’, Journal of Computational Biology 2 (4), 573–

586.

Kell, D. (2002), ‘Genotype–phenotype mapping: genes as computer programs’, Trends Genetics 18 (11), 555–559.

Kell, D. and Oliver, S. (2004), ‘Here is the evidence, now what is the hypothesis?, the complementary roles of inductive and hypothesis-driven science in the post-genomic era’, Bioessays 26, 99–105.

Koudand´e, O. D., van Arendonk, J. A. and Koud, F. (2005), ‘Marker-assisted introgression of trypan-otolerance QTL in mice’, Mammalian Genome 16 (2), 112–119.

Lord, P., Alper, P., Wroe, C. and Goble, C. (2005), Feta: a light-weight architecture for user oriented semantic service discovery, in ‘Proceedings of the 2nd European Semantic Web Conference’, pp. 17–

31.

Maglott, D., Ostell, J., Pruitt, K. D. and Tatusova, T. (2007), ‘Entrez gene: gene-centered information at NCBI’, Nucleic Acids Research 35, D26–D31.

Mitchell, J. and McCray, A. (2003), ‘From phenotype to genotype: issues in navigating the available information resources’, Methods of Information in Medicine 42 (5), 557–563.

Naessens, J. (2006), ‘Bovine trypanotolerance: a natural ability to prevent severe anaemia and haemophagocytic syndrome?’, International Journal for Parasitology 36 (5), 521–528.

Nature Editorial (2006), ‘Illuminating the black box’, Nature 442 (7098), 1.

Oinn, T., Greenwood, M., Addis, M., Alpdemir, M. N., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P., Lord, P., Pocock, M. R., Senger, M., Stevens, R., Wipat, A. and Wroe, C. (2006), ‘Taverna: lessons in creating a workflow environment for the life sciences’, Concurrency and Computation: Practice and Experience 18 (10), 1067–1100.

Schadt, E. (2006), ‘Novel integrative genomics strategies to identify genes for complex traits’, Animal Genetics 37 (1), 18–23.

Shi, M., Wei, G., Pan, W. and Tabel, H. (2005), ‘Impaired Kupffer cells in highly susceptible mice infected with Trypanosoma congolense’, Infection and Immunity 73 (12), 8393–8396.

Stein, L. (2002), ‘Creating a bioinformatics nation’, Nature 417 (6885), 119–120.

Stein, L. (2003), ‘Integrating biological databases’, Nature Reviews Genetics 4 (5), 337–345.

Stevens, R., Tipney, H. J., Wroe, C., Oinn, T., Senger, M., Lord, P., Goble, C., Brass, A. and Tassabehji, M. (2004), ‘Exploring Williams–Beuren syndrome using my Grid’, Bioinformatics 20, i303–i310.

Stevens, R., Zhao, J. and Goble, C. (2007), ‘Using provenance to manage knowledge of in silico experi-ments’, Briefings in Bioinformatics 8 (3), 183–194.

Turi, D. (2006), Taverna workflows: syntax and semantics, Internal technical report, University of Manch-ester.

Yan, Y., Wang, M., Lemon, W. and You, M. (2004), ‘Single nucleotide polymorphism (SNP) analysis of mouse quantitative trait loci for identification of candidate genes’, Journal of Medical Genetics 41 (9), e111.

REFERENCES 163 Yang, X. and Khosravi-Far, R. (1997), ‘Daxx, a novel Fas-binding protein that activates JNK and

apop-tosis’, Cell 89 (7), 1067–1076.

Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D. and Greenwood, M. (2004), Using semantic Web technologies for representing e-science provenance, in ‘Proceedings of the 3rd International Semantic Web Conference’, Vol. 3298, Hiroshima, pp. 92–106.

Zuñiga, E. and Motran, C. (2000), ‘Trypanosoma cruzi-induced immunosuppression: B cells undergo spontaneous apoptosis and lipopolysaccharide (LPS) arrests their proliferation during acute infection’, Clinical and Experimental Immunology 119 (3), 507–515.

10

Specification of distributed

Dans le document Data Mining Techniques in Grid Computing Environments (Page 180-188)