Empirical experiments - Ali Shaikh Ali and Omer F. Rana

Ali Shaikh Ali and Omer F. Rana

6.9 Empirical experiments

Figure 6.5 Elements of the data mining toolkit

Processing tools These tools are used to process data generated as output from a Web service.

The processing tools include a tool to visualize the classifiers list from the Classifier Web service as a tree according to their types, a tool to assist the user in selecting the run time option parameters and a tool to visualize the attributes embedded in a data set.

Visualization tools These tools implement various visualization services based on the output generated by a particular Web service. These visualization tools include Tree Plotter, Image Plotter and Cluster Visualizer. Additional Web services have been implemented that commu-nicate with GNUPlot.

Figure 6.5(b) illustrates the components in the data mining toolkit. Triana provides a cen-tralized enactor that coordinates the execution of a number of different Web services. It is also possible to integrate services provided within a signal processing toolbox (containing a fast Fourier transform, StringViewer, StringReader etc.) with inputs of Web services.

6.8 Availability

The toolkit can be downloaded under the GNU Public License. To use the toolkit, the Triana workflow engine needs to be downloaded and installed. The data mining toolkit can then be installed as a folder within Triana, and can be downloaded from the FAEHIM download site.³ Installation instructions are also provided. A user can also add additional Web services to the toolkit within the data mining toolbox.

6.9 Empirical experiments

To evaluate the effectiveness of the distributed analysis approach, it is necessary to identify metrics that are quantifiable and can be easily measured. These metrics help determine the value of the framework and its strengths and weaknesses. We use accuracy and running time as evaluation criteria.

3http://users.cs.cf.ac.uk/Ali.Shaikhali/faehim/Downloads/index.htm

Figure 6.6 The output of the J4.8 classifier

r Accuracy. The data mining Web service must produce the same result as the monolithic data mining algorithm.

r Running time. The amount of time that the Web services require to perform the data mining process compared to the stand-alone algorithm. We are also interested in comparing the running time difference of processing a data mining request by the single Web service framework and the distributed Web service framework.

6.9.1 Evaluating the framework accuracy

The data used in this experiment are obtained from the UCI Machine Learning Repository.⁴ The data set contains information about a number of different breast-cancer cases – with patient data being anonymized. It includes 201 instances of one class and 85 instances of another class.

The instances are described by nine attributes, some of which are linear and others nominal.

The data set is already in the attribute-relation file format (ARFF).

Using the Weka Toolkit to extract knowledge from the data set The Weka Toolkit is installed on the machine where the data set resides. The J4.8 classifier is used to perform classification on the data set. In this instance, the attribute ‘node-caps’ has been chosen to lie at the root of the tree. The result of the J4.8-based classification is shown in Figure 6.6.

Using the Classification Web service to extract knowledge from the data set In order to use the Classifier Web service, we first need to obtain the available classifiers that the Web service supports, and the options for the selected classification algorithm. The supported classifier al-gorithms are obtained by invoking thegetClassifiers()operation. For our case study, we select J4.8. The run-time options that the J4.8 algorithm requires are obtained by invoking thegetOptions()operation. Once the classifier algorithm and the options have been iden-tified, we can invoke theclassifyInstance()operation. The result of the classification in this instance is viewed in a text viewer. As expected, the result of the J4.8 Web service classifier is exactly like the stand alone version as shown above (J4.8 pruned tree).

This example involved the use of a four Web services: (i) a call to read the data file from a URI and convert this into a format suitable for analysis, (ii) a call to perform the classification (i.e. one that wraps the J4.8 classifier), (iii) a call to analyse the output generated from the decision tree, and (iv) a call to visualize the output.

4http://archive.ics.uci.edu/ml/

6.9 EMPIRICAL EXPERIMENTS 103 Through the two experiments above we confirmed that the Web services produce the same results as the stand-alone Weka Toolkit.

6.9.2 Evaluating the running time of the framework

In this set of experiments, we evaluated the running time required by a single Web service to respond to a classification request compared to the time that a stand-alone Weka Toolkit required to classify a data set that resides on the same machine. We also compared the running time required by a single Web service to respond to a classification request compared with the time that distributed Web services require to respond to the same classification request.

Comparing the running time of a Web service and a stand-alone Weka Toolkit The data used in this experiment are stored in a 518 KB file, in ARFF, containing information about weather.

The data set is derived from the work of Zupan et al. (1997) and available at the UCI Machine Learning Repository. The data set contains information about weather conditions. It includes 1850 instances and each instance is described by five attributes (outlook, temperature, humidity, windy and play), some of which are ordinal and others nominal.

We chose the J4.8 classifier to carry out the classification and we kept the default run-time parameters for the J4.8 classifier. The execution run-time for the Weka Toolkit to process a classification request for the data set, which resided on the same machine, for 20 requests was measured. We also measured the running time that the Web service required to respond to the same classification requests. Figure 6.7 shows a comparison between the running time of the Weka Toolkit and the Web service.

The above experiment demonstrated that the running time required to perform a classifica-tion request, on the same dataset, by a Web service was, on average, twice the time required by the stand-alone algorithm from the Weka toolkit. This additional time can primarily be attributed to messaging delays and time to migrate the dataset to the remote location.

Comparing the running time of a single Web service and distributed Web services In this experiment, we used a 10 MB data set containing information on weather obtained from the UCI Machine Learning Repository. We also chose the J4.8 classifier and kept the default run-time parameters for the J4.8 classifier.

Figure 6.7 Running time of the Weka Toolkit versus Web service

Figure 6.8 Running time of a single Web service versus distributed Web service

We measured the running time of a single Web service for 20 classification requests on the data set and compared it with the running time distributed Web services (consisting of a master and two nodes) required to process the same requests. Figure 6.8 shows a comparison between the running time of the single and distributed Web services.

The experiment showed that the total running time of the Web services was shorter than a single Web service. This arises as a consequence of splitting the data set across multiple Web services, enabling a “high throughput” mode of execution. Figure 6.8 demonstrates the benefits of using multiple distributed services for processing large data sets – under the assumption that a large data set can be split into multiple independent sets.

6.10 Conclusions

This chapter describes the FAEHIM Toolkit. FAEHIM enables composition of Web services from a pre-defined toolbox. Web services have been developed from the Java-based templates provided by the Weka library of algorithms. Using the Triana workflow system, data analysis can be performed on both local and remote data sets. A variety of additional services to facilitate the entire data mining process are also supported, for data translation, visualization and session management. A set of empirical experiments are reported, which demonstrate the use of the Toolkit on publicly available data sets (at the UCI Machine Learning Repository).

References

Khoussainov, R., Zuo, X. and Kushmerick, N. (2004), ‘Grid-enabled Weka: a toolkit for machine learning on the grid’, ERCIM News.

Witten, I. H. and Frank, E. (2005), Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann.

Zupan, B., Bohanec, M., Bratko, I. and Demsar, J. (1997), Machine learning by function decomposition, in ‘ICML-97’.

7

Scalable and privacy preserving

Dans le document Data Mining Techniques in Grid Computing Environments (Page 124-128)