Example of a workflow study - Moustafa Ghanem, Vasa Curcin, Patrick Wendel and Yike Guo

Moustafa Ghanem, Vasa Curcin, Patrick Wendel and Yike Guo

8.5 Example of a workflow study

In this section we briefly present an analytical workflow application developed within Discov-ery Net to highlight some of the key features of the system. The chosen example is based on developing workflow-based data analysis services for use in studying adverse drug reaction (ADR) studies. The aim of the analysis is to build a classification model to determine what factors contribute to patients having an adverse reaction to a particular drug.

Although the workflow makes use of many features of the Discovery Net system, including integration of remote data mining tools and underlying data management features of the system, we primarily focus on demonstrating one of the key advantages of the system, its ability to develop and re-use workflow-based services that can be re-used in other studies.

The workflow itself uses multiple data mining tools made accessible as remote grid services executing on high-performance computing resources at the Imperial College London Internet Centre. These include logistic regression, bootstrapping, text mining and others. Previous to use, these tools were integrated as activities within the Activity Service Definition Layer, thereby making the access to them, as well as the associated data management, transparent to the user.

8.5.1 ADR studies

ADR studies are a common type of analysis in pharmacovigilance, a branch of pharmacology concerned with the detection, assessment, understanding and prevention of adverse effects, and particularly long-term and short-term side effects of medicines (World Health Organization, 2002). ADR studies can be contrasted to drug safety trials that are required before the drug enters the market, in that the safety trials typically operate on a limited set of healthy test subjects, while ADR studies focus on patients with specific conditions (diabetes, liver damage etc.) not directly connected to the drug being tested as well as on patients on a combination of drugs. Also, ADR studies typically cover a longer follow-up periods in observations.

However, the mechanisms for conducting ADR studies are still limited, with typical source data consisting of voluntary patient and GP reporting, and lacking in quantity needed for meaningful analysis as well as length coverage. One data source in the UK containing suitable follow-up and all relevant information is the primary care data, available from a number of providers, such as the General Practitioners Research Database (GPRD) and The Health Integration Network (THIN), containing full prescription data and observations for a large body of patients over an extended period of time (10–20 years). In this way, researchers can perform an in silico study by creating their own control sets and running analyses on them.

Furthermore, in addition to traditional statistics techniques, data mining can be utilized to answer more ambitious questions.

8.5.2 Analysis overview

Non-steroidal anti-inflammatory drugs (NSAIDs) are one of the most popular classes of drugs in the world, including aspirin, ibuprofen and others. One common problem with them is that they can cause gastro-intestinal side-effects, and significant research effort has gone into producing NSAIDs that did not demonstrate this behaviour. With this goal in mind, selective cyclo-oxygenase-2 (COX-2) inhibitors were developed and marketed in celecoxib and rofecoxib (Vioxx) drugs. However, as demonstrated in numerous research papers, these drugs can still,

Event model

transformation Exclusions Exposure

definitions

Corpus retrieval Statistical tests

Figure 8.9 Abstract workflow for adverse drug reaction study

in some cases, produce side-effects ranging from gastrointestinal, via skin rashes to respiratory effects (Silverstein et al., 2000).

The study presented here aims to build a classification model to determine what factors con-tribute to patients having an adverse reaction to COX-2. The list of tasks involves determining side-effects from patients who have been known to previously receive COX-2 inhibitors, defin-ing exposure window, excluddefin-ing conditions and drugs that may produce similar effects and testing the produced model on a sample of patients on COX-2, both with and without ADRs.

The data sources used in this study were derived from GPRD, and consist of a patient table, prescription table and clinical table, containing, respectively, information on the patient, all the prescriptions the patient has been given with full dosage and duration information and all diagnosis and measurements related to the patient. The overall workflow is given in Figure 8.9.

8.5.3 Service for transforming event data into patient annotations

As a first step, the event data in the database, organized by event identifiers and annotated with dates, is transformed into patient samples. For each patient we define a workflow query to establish the first instance of any of the nominated conditions. In order to publish this workflow as a service, several parameters are defined. First, the conditions themselves are made into a parameter. The GPRD selector interface for the conditions utilizes medical dictionaries, and is integrated into the hosting environment, as shown in Figure 8.10. Second, the aggregation function for the condition is generalized, so the researcher can specify first condition, last condition or count of condition occurrences.

8.5.4 Service for defining exclusions

Next step in the analysis is to pick known medical conditions and drugs that may cause gastro-intestinal conditions. Event databases are searched for relevant diagnosis and prescriptions and these patients are removed from the data set. The selection dialogue for drug and condition querying, used in the published workflow, is the same as in the previous service.

8.5 EXAMPLE OF A WORKFLOW STUDY 135

Figure 8.10 Medical term selection widget

8.5.5 Service for defining exposures

The next step is to provide a framework for establishing exposure, i.e. duration of time that the patient is on a drug. One valid approach is to only take the first prescription, and the first occurrence of the event, assuming that ADR will happen the first time patient takes the drug. One potential problem with this scenario is that if dual medication needs to be taken into account, it needs to be treated as a separate drug. Taking this into account, the next iteration was to use the dosage information from prescription table to establish an event timeline for each patient and calculate, for each event, a set of events with which it overlaps, including other conditions and other prescriptions. In this way, it is relatively efficient to find valid drug exposures for each event on the timeline. Publishing this workflow as a service includes parameterizing the definition of events that need to be tracked, using a similar selection dialogue as in the service above.

8.5.6 Service for building the classification model

Once the data have been filtered and annotated with event information, the training set needs to be determined. Since the cleaned data are likely to contain a relatively low percentage of patients with adverse reactions, over-sampling is utilized so that the sample contains 50 per cent of adverse effects. Default separation then involves 40 per cent for training the logistic regression algorithm, and a further 30 per cent for both the validation and evaluation steps. The outputs from this service are the profit matrix and the lift diagram determining the prediction accuracy of the classifier, as well as the list of variables found to contribute to the adverse reaction to COX-2. Parameterization involves allowing the analyst to specify oversampling, training–test–validation sample ratios and various parameters of logistic regression.

8.5.7 Validation service

The last step, after the classification model is calculated, is to create a support corpus for the study. One way of doing this is to query Pubmed with the COX-2 drug names, and retrieve abstracts describing existing scientific knowledge about them. The resulting collection of abstracts can then either be manually checked or passed to a text-mining procedure.

Figure 8.11 ADR workflow

8.5.8 Summary

In this section we have presented how a scientific application for investigating adverse drug reactions can be built using Discovery Net architecture. An example study for finding contribut-ing factors to adverse reaction to COX-2 is presented, with separate workflow-based services designed for each of the main tasks involved in the process. Such services are then published inside a portal environment, with the addition of dictionary lookup widgets for selection of medical terms. The full workflow is presented in Figure 8.11.

Dans le document Data Mining Techniques in Grid Computing Environments (Page 156-159)