Real Data - 5 Experiments 5.1 Artificial Data

5 Experiments 5.1 Artificial Data

5.2 Real Data

The second set of experiments has performed on real data, most often from [16].

For the experiments, the author used also a few commercial datasets that cannot be published, but the results of the classification can show differences between classifiers (Table3).

The dataset ‘frauds’ contains fraudulent transactions of reward receiving in loyalty program VITAY of PKN ORLEN (PKN ORLEN is the biggest oil company in central Europe). The number of fraudulent transactions is 3072. The dataset ‘Clients’

contains customer profiles that are divided into 6 classes. The smallest class contains

ADX Algorithm for Supervised Classification 49 Table 3 Real datasets used for classification experiment

Dataset Events Attributes Classes Source

Alizadehdata 62 4026 3 [2]

Golubdata 38 7129 2 [11]

Frauds 15123 33 2 PKN Orlen

Mushroom 8123 22 2 [16]

Iris 150 4 3 [16]

AbaloneData 4177 8 3 [16]

Soybean 683 35 19 [16]

Tic_tac_toe 958 9 2 [16]

Hypothyroid 3772 30 4 [16]

Ionosphere 351 35 2 [16]

Splice.asx 3190 61 3 [16]

Vote 435 17 2 [16]

Zoo 101 18 7 [16]

Clients 16920 59 6 real Polska

919 profiles and the largest contains 7598 profiles. Each customer is described by a set of demographic features (e.g. sex, age etc.) as well as by behavioral features based on transaction history (e.g. mobility, loyalty etc.).

For each of the datasets 3-fold cv was used 10 times of. The average result of 10 cross validations is given in Table4. Classifiers: NaiveBayes (NB), C4.5 (J48), SVM(SMO), kNN (IBk) implemented in WEKA ver. 3.4.11, were run on their default parameters (kNN: 1-NN, linear kernel in SVM). Implementation of ADX, done in JAVA 1.6, was run on its default parameters too. The factor wAcc is the well known weighted/balanced accuracy which is sensitive to problems where classes are highly unbalanced. Time given in the table is the average time of a single 3-fold cross validation process.

Classification quality (see Table4) of the ADX implementation is comparable to that achieved by popular effective classifiers and for the most datasets the ADX is one of the best classifiers. However there are a few datasets where default parameters of ADX did not lead to the highest performance. However, by applying slightly different method for final selection of rules for ‘tic_tac_toe’ we can achieve average wAcc = 0.985, for ‘Soybean’ average wAcc = 0.864 and for ‘zoo’ average wAcc = 0.811.

The time needed for processing a single cross validation proces is much more diversified. Great scalability of ADX can be noticed especially for large datasets. For the dataset ‘frauds’ and ‘Clients’ average cv time equals to several/tens of seconds in comparison to hundreds/thousands of seconds for other classifiers. In the case of smaller datasets the ADX algorithm is still one of the fastest techniques but the difference is not such significant.

Table 4 Results of classification of real datasets

Dataset ADX J48 1-NN SVM NB

Alizadeh

frauds CV[s] 6.4 24.6 831.9 196.5 2.9

mushroom

iris CV[s] 0.01 0.01 0.02 1.93 0.01

abalone Acc(wAcc)

52.4(52) 52.7(52.7) 50.1(50.2) 54.5(53.1) 51.9(53.7)

abalone CV[s] 1.64 1.46 16.42 1.24 0.25

soybeam

tic_tac CV[s] 0.16 0.05 0.85 1.22 0.02

hypothyroid

vote CV[s] 0.2 0.03 0.31 0.13 0.01

(continued)

ADX Algorithm for Supervised Classification 51 Table 4 (continued)

Dataset ADX J48 1-NN SVM NB

zoo Acc(wAcc)

83.8(72.8) 93.4(85.5) 95(89.9) 95.2(88.1) 95(91)

zoo CV[s] 0.16 0.01 0.06 1.43 0.01

Clients Acc(wAcc)

81.6(73.6) 89.6(85.5) 76.2(68.6) 87(83) 76.1(68.5)

Clients CV[s] 31 50 1739 483 1759

6 Conclusions

In the process of optimization of classifier’s parameters the speed of learning and testing is very important. Today, in practice, we deal with huge amount of data and even the data sample often contains thousands of events. Therefore, if various techniques have comparable classification quality, the scalability of classifier in many practical applications is the most important. The ADX algorithm gives comparable results of classification to the well known popular classification techniques but its scalability is much better.

The general idea of combining single selectors to lengthen the rules in ADX is very similar to lengthen itemsets in Apriori algorithm [1], but ADX produces classification rules and uses combination ofpcovandncovto estimate rules quality.

The qualityQof the rules is used to select a fixed number of best complexes (parents) to build complexes candidates. This solution has positive influence on efficiency of algorithms because it limits the number of possible candidates to be created. It also helps to create rules that do not need post pruning because high quality complex still can cover some of negative events. The main idea of specialization of rules to increase their quality is commonly used in many algorithms that are based on sequential covering schema (e.g. AQ, LEM2, CN2, RIPPER [5] etc.). Creation of many rules by combining simple selectors is in some sense analogous to creation of spline functions in MARS (Multivariate Adaptive Regression Splines Model [13]) by combining basis functions. However, the ADX algorithm is a new rule classifier algorithm that is fast and highly efficient, what makes it very attractive considering large datasets.

References

1. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases, Santiago, Chile

2. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudso J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R,

Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM (2000) Distinct types of diffuse large B-cell Lymphoma identified by expression profiling. Nature 403:503–511 3. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees.

Wadsworth International Group, Monterey

4. Clarc P, Niblett T (1989) The CN2 induction algorithm. Mach. Learn. 3:261–283

5. Cohen W (1995) Fast effective rule induction. In: Machine learning: proceedings of the twelfth international conference, Lake Tahoe, California

6. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans. Inform. Theory, IT-13(1):2127

7. Drami´nski M (2004) ADX Algorithm: a brief description of a rule based classifier. In: Pro-ceedings of the new trends in intelligent information processing and web mining IIS’2004 symposium. Springer, Zakopane, Poland

8. Drami´nski M (2005) Description and practical application of rule based classifier ADX, pro-ceedings of the first Warsaw International Seminar on intelligent systems, WISIS (2004). In:

Dramiski M, Grzegorzewski P, Trojanowski K, Zadrozny S (eds) Issues in intelligent systems.

Models and techniques. ISBN 83-87674-91-5, Exit 2005

9. Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 8:87–102

10. Fix E, Hodges JL (1951) Discriminatory analysis nonparametric discrimination: Consistency properties, Project 21–49-004, Report no. 4, USAF School of Aviation Medicine, Randolph Field, pp 261–279

11. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of Cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–

537

12. Grzymala-Busse JW (2003) MLEM2-Discretization during rule induction, intelligent informa-tion processing and web mining. In: Proceedings of the internainforma-tional IIS:IIPWM’03 conference held in Zakopane, Poland

13. Hastie T, Tibshirani R, Friedman J (2001) Elements of statistical learning: data mining, infer-ence and prediction. Springer, New York

14. Kaufman KA, Michalski RS (1999) Learning from inconsistent and noisy data: the AQ18 approach. In: Proceedings of the eleventh international symposium on methodologies for intel-ligent systems (ISMIS’99), Warsaw, pp 411–419

15. Michalski RS, Kaufman KA (2001) The AQ19 system for machine learning and pattern dis-covery: a general description and user’s guide, reports of the machine learning and inference laboratory, MLI 01–2. George Mason University, Fairfax

16. Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI Repository of machine learning data-bases. University of California, Department of Information and Computer Science, Irvine, http://www.ics.uci.edu/mlearn/MLRepository.html

17. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic Publishing, Dordrecht

18. Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann 19. Quinlan JR (1986) Induction of decision trees. Machine learning 1:81–106

20. Stefanowski J (1998) On rough set based approaches to induction of decision rules. In:

Polkowski L, Skowron A (eds) Rough sets in data mining and knowledge discovery, Physica-Verlag, pp 500–529

21. The R project for statistical computing.http://www.r-project.org/

22. Wittenn IH, Eibe F (2005) Weka 3: data mining software in Java, data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

Estimation of Entropy from Subword

Dans le document Challenges in Computational Statistics and Data Mining (Page 59-64)