• Aucun résultat trouvé

Mining web services

Rahul Ramachandran, Sara Graves, John Rushing, Ken Keizer, Manil Maskey, Hong Lin and Helen Conover

4.5 Mining web services

Because a sequence of operations is often necessary to provide an appropriate data mining solution to a complex problem, the SOA approach allows users to develop a workflow that employs various Web services for pre-processing to prepare the data for mining, one or more mining steps and post-processing to present the results. For use within an SOA, ADaM Toolkit components support Web service interfaces based on World Wide Web Consortium (W3C) specifications including SOAP interfaces and WSDL service descriptions. The primary customers for these services are data centres that provide online data to their community such as NASA Data Pools and would like to provide additional processing features to their end users. Because of the size and volume of data at different online science data providers, pushing data to distributed mining service locations often incurs a large network load. This burden can be overcome if the services are available at these online data provider sites along with the data.

Therefore, the ADaM mining Web services have to be deployable and able to reside at many different distributed online data repositories. The need to be deployable places implementation constraints on these Web services, in that they must support different Web servers and comput-ing platforms, thus allowcomput-ing distribution to different groups with minimal site-specific support.

The general architecture for mining Web services is presented in Figure 4.3. The functionality of the various components is the same as the original ADaM architecture (Figure 4.1) but implementation of the components has changed. The mining daemon is now a Web service.

In the implementation described in this chapter, the Business Process Execution Language (BPEL) standard-based engine and associated service workflow have replaced the earlier mining engine and plan. BPEL provides a standard XML schema for workflow composition

Figure 4.3 Generic Web service architecture to provide ADaM mining services at data archiving centres

4.5 MINING WEB SERVICES 63 of Web services that are based on SOAP. The workflow describes how tasks are orchestrated, including what components perform them, what their relative order is, how they are syn-chronized and how information flows to support the tasks. This standardized composition description is deployable on BPEL-compliant engines. Deploying a workflow translates into taking a BPEL description and asking a BPEL engine to expose the workflow as a Web service.

Being able to expose a composition of Web services makes the workflow itself a Web service and thus reusable. A BPEL engine is also responsible for invoking the individual services.

The advantage of a standard-based approach such as BPEL is interoperability: a common execution language allows for the portability of workflow definitions across multiple systems.

Any BPEL engine will now be able to take an ADaM workflow described in BPEL and invoke the individual services. The database component and the scheduler implementations in the Web service architecture are left to the individual data archives, as they might have different requirements and restrictions.

4.5.1 Implementation architecture

For one implementation of a data mining SOA, shown in Figure 4.4, ITSC has teamed with NASA Goddard Earth Sciences Data and Information Services Center (GES DISC). The GES DISC provides a large online repository of many NASA scientific data sets, together with a processing environment called the Simple, Scalable Script-Based Science Processor for Measurements (S4PM) (Lynnes, 2006), for applying user-supplied processing components to the data. ITSC provides an external ‘sandbox’ that allows the user to experiment with a small subset of the data and a variety of ADaM Toolkit components and parameter settings. A mining-specific workflow composition application hides the complexities of the BPEL language from users. Note that the user client, toolkit of mining services and BPEL service orchestration engine may be co -located at the data center or distributed across the Web. Once the user is satisfied with a workflow, it is deployed to a BPEL engine, which generates a WSDL description for the composite service (workflow) and returns its URL. This URL is then transmitted to the GES DISC via a Web service request along with a specification of the data to be mined, such as the data set identifier and temporal or spatial constraints. This request is provided to the GES DISC’s processing engine S4PM for mining data at the data center. The S4PM engine is

Figure 4.4 Implementation architecture to provide ADaM mining services at NASA GES DISC

Figure 4.5 Example mining workflow to detect airborne dust in MODIS imagery

responsible for acquiring the right data from the archive, staging the data and then executing the requested workflow on each input file. However, note that the workflow itself is not provided to S4PM. Instead, S4PM uses the supplied WSDL URL to fetch the WSDL document and then invoke the corresponding composite Web service in the BPEL engine, supplying the full path of the input file. The BPEL engine invokes the atomic Web services in the proper order at the GES DISC. The result is then staged to an FTP directory, from which it can be retrieved and verified.

4.5.2 Workflow example

An example mining workflow to detect and classify air-borne dust in NASA’s Moderate Resolu-tion Imaging Spectrometer (MODIS) imagery is presented in Figure 4.5. The workflow consists of pre-processing steps to calibrate the MODIS data, perform the data model translation and subset the spectral bands to select the spectral bands most sensitive to dust classification. A naïve Bayes classifier is used for the detection and the result is converted into an image for visualization. The original data (Figure 4.6(a)) show dust blowing off the northwest coast of Africa. The classification result is presented as a thematic map in Figure 4.6(b). The pix-els coloured black represent dust over water whereas the pixpix-els coloured grey represent dust over land. Details of the dust detection study are discussed by Ramachandran et al. (2007).

This example dust detection using classifier and data processing components demonstrates the different kinds of mining application that scientists will be able to perform with the ADaM services at data centres such as GES DISC.

4.5.3 Implementation issues

Several issues were encountered while implementing ADaM Web services and are listed below.

SOAP wrappers The Web service implementation of existing ADaM modules included placing a SOAP wrapper around each ADaM operation. Many options of the SOAP implementation were evaluated before deciding on SOAP::Lite2, the Perl implementation of SOAP. Although

2http://www.soaplite.com/

4.5 MINING WEB SERVICES 65

Figure 4.6 (A) Example MODIS true colour imagery depicting airborne dust off the coast of Africa.

(B) Classification mask generated by the workflow (black, dust over water; grey, dust over land)

Java’s implementation of SOAP, Axis3, provides a wide variety of tools that assist in imple-menting services and clients, Java was not appropriate in this case because the ADaM Toolkit was implemented in C++, and interfacing Java Native Interface (JNI) to C++ can be problem-atic. The Axis C++ offers a C++ implementation of SOAP; however, it does not appear to be well accepted by the software industry so far. The Python implementation of SOAP, pySOAP, has not been updated or maintained in recent years. The Web Service Definition Language (WSDL) for each of the ADaM services has been published and each of the Web services was tested for interoperability using Perl and Java clients.

Security issues Even though this initial implementation did not focus on security issues as-sociated with Web services, some basic rules were enforced. To allow the data centre full control over enforcing local policies for file creation, the SOAP wrappers specify the full path of the input file only, plus additional parameters as necessary. The data centre constructs the output directory and filename and returns it the calling program in the SOAP response. The output of one service may then be used as the input file to the next step in the workflow. This interface mechanism allows the data centre to control where files are created on its system and to implement safeguards against inappropriate usage.

Data staging There are two types of architecture available for service orchestration: data-centric and Web service-data-centric. In a Web service-data-centric architecture, the services are deployed at one centralized location. Data to be processed are made available using some sort of file handling protocol such as File Transfer Protocol (FTP), UDP-Based Data Transfer Protocol (UDT ) (Gu and Grossman, 2007), or GridFTP4. These services transfer the data to the host of the Web services. Additional Web services are created to wrap these file handlers, so that these services can be incorporated into a workflow. In any workflow, these file transfer services will

3Apache Axis SOAP implementation: http://ws.apache.org/axis/

4Globus Toolkit 4.0, GridFTP: http://www.globus.org/ toolkit/docs/4.0/data/gridftp/.

always precede the actual processing components. Use of this architecture also makes main-taining the services easier, as changes to the services can be handled rather quickly. However, implementations using this architecture are limited by the rate at which data transfers occur.

In a data-centric architecture, the Web services are deployed at the data centers, so that all the data to be processed are available locally. Thus, the overhead of data transfer is avoided.

The BPEL engine that hosts the workflow services can be hosted separately. A user can deploy a workflow to any BPEL engine using the service WSDLs that are exposed by the Web service hosts (in this case the data centres). The BPEL engine exposes a new WSDL to the composite service workflow such that data centre users can invoke the workflow to process specified data by calling the composite service. The processed data are not transferred to and from the BPEL engine every time a service in the workflow is invoked; instead a file handle is being passed in/out as parameters. Advantages of such third party transfers are highlighted by Ludäscher et al. (2005). The implementation architecture described in this chapter is data-centric as the primary target customers are large online data repositories interested in providing data processing and analysis services to their user community.

Deployment descriptors for BPEL engine A BPEL engine is required to process the instructions described in BPEL. There are many BPEL engines and visual workflow composers that are available as open source software. The problem with having different BPEL engines is that, even though BPEL is a standard language, these engines require an additional deployment descriptor that describes engine-specific additional dependences. The deployment descriptor varies from engine to engine. The ActiveBPEL5engine was selected because of its popularity.