Meta-mining: a meta-learning framework to support the recommendation, planning and optimization of data mining workflows

(1)

Thesis

Reference

Meta-mining: a meta-learning framework to support the

recommendation, planning and optimization of data mining workflows

NGUYEN, Phong

Abstract

La fouille de données ou data mining peut être un processus extrêmement complexe dans lequel le data miner doit assembler dans un flux de travail un nombre d'opérateurs de traitement des données et d'analyse afin d'accomplir sa tâche. Afin de supporter le data miner dans la modélisation de son processus de découverte de connaissances, nous proposons un nouveau cadre de travail que nous appelons meta-mining ou méta-apprentissage orienté processus et qui étend de manière significative l'état de l'art dans les approches de méta-apprentissage et celles dites de planification de flux de travail de data mining. Nous analysons l'ensemble du processus de découverte de connaissances afin de construire des méta-modèles qui tiennent à la fois compte des spécificités des jeux de données et des flux de travail, où chaque paire est associée à une mesure de performance telle que celle pour la classification. Les méta-modèles sont construits en combinant d'une nouvelle manière plusieurs technologies et en concevant de nouveaux algorithmes; nous combinons la fouille de données sur des structures [...]

NGUYEN, Phong. Meta-mining: a meta-learning framework to support the

recommendation, planning and optimization of data mining workflows. Thèse de doctorat : Univ. Genève, 2015, no. Sc. 4936

URN : urn:nbn:ch:unige-861312

DOI : 10.13097/archive-ouverte/unige:86131

Available at:

http://archive-ouverte.unige.ch/unige:86131

Disclaimer: layout of this document may differ from the published version.

(2)

université de genève Département d’informatique

faculté des sciences Professeur Christian Pellegrini Professeur Stéphane Marchand-Maillet Professeur Alexandros Kalousis Docteur Mélanie Hilario

Meta-mining: a Meta-learning Framework to Support the Recommendation, Planning and Optimization of Data

Mining Workflows

Th` ese

présentée à la Faculté des Sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par Phong Nguyen

de Onex (GE)

Th`ese n^o 4936

gen`eve 2016

(3)

(4)

(5)

J’aimerais tout d’abord remercier mes deux principaux superviseurs, Dr. Mélanie Hilario and Pr. Alexandros Kalousis, qui m’ont accepté dans leur groupe de recherche et m’ont permis de participer à des projets académique d’envergure internationale. Sans un tel environnement, je n’aurais pas trouvé la force et la motivation d’aller au bout de ma thèse de doctorat et de devenir le chercheur que je suis aujourd’hui.

Je suis grandement reconnaissant envers les membres de mon comité de soutenance, Pr. Christian Pellegrini, Pr. Stéphane Marchand-Maillet and Pr. Michèle Sebag, pour leurs inestimables conseils en matière de recherche scientifique. Un grand merci aussi à tous mes amis et collègues du groupe AI, de l’Université de Genève, et du projet e-LICO, en particulier Jun Wang, Adam Woznica, Huyen Do, Jörg Uwe-Kietz, Floarea Serban and Simon Fisher, avec qui j’ai partagé tant de discussions et idées durant toutes ces années de recherche académique.

A ma famille, je suis profondément redevable de leur soutien sans faille durant toutes ces années de dur labeur, en particulier envers Marie, ma chère et tendre épouse qui a

été et est toujours chaque jour à mes côtés, ainsi que mon frère Duy et sa famille, et mes parents qui m’ont donné l’inspiration nécessaire de poursuivre une voie académique.

Enfin, je dédie cette thèse à ma fille, Mai Lan, qui chaque jour illumine ma vie.

(6)

(7)

La fouille de données ou data mining peut être un processus extrêmement complexe dans lequel le data miner doit assembler dans un flux de travail un nombre d’opérateurs de traitement des données et d’analyse afin d’accomplir sa tâche. Afin de supporter le data miner dans la modélisation de son processus de découverte de connaissances, nous proposons un nouveau cadre de travail que nous appelonsmeta-miningou méta-apprentissage orienté processus et qui étend de manière significative l’état de l’art dans les approches de méta-apprentissage et celles dites de planification de flux de travail de data mining.

Nous analysons l’ensemble du processus de découverte de connaissances afin de construire des méta-modèles qui tiennent à la fois compte des spécificités des jeux de données et des flux de travail, où chaque paire est associée à une mesure de performance telle que celle pour la classification. Les méta-modèles sont construits en combinant d’une nouvelle manière plusieurs technologies et en concevant de nouveaux algorithmes; nous combinons la fouille de données sur des structures complexes, les flux de travail, en présence de connaissances du domaine, et nous concevons des techniques de planifications aussi bien que des algorithmes d’apprentissage de distance ainsi que de préférence définis sur des es- paces hétérogènes. Le système de planification que nous proposons est le premier à notre connaissance pouvant construire des flux de travail de data mining dont la performance attendue sur un problème donné est maximum. Finalement, notre cadre de travail peut aussi s’appliquer à tout autre problème de recommandation comprenant des informations supplémentaires sur les utilisateurs et les objets afin de résoudre le problème du démarrage

` a froid.

(8)

(9)

Data mining (DM) can be an extremely complex process in which the data analyst has to assemble into a DM workflow a number of data preprocessing and data mining operators in order to address her/his mining task. In order to support the DM user in the design of her/his DM process analysis, we propose a novel framework, aptly called meta-miningor process-oriented meta-learning, which extends significantly previous state- of-the-art meta-learning and DM workflow planning approaches. We analyze the whole DM process in order to build meta-models which account for both the specificities of datasets and workflows, the pairs of which is related by some performance metrics such as classification accuracy. We build the meta-mining models by combining in an innovative manner a number of different technologies and by devising new algorithms; we combine mining over complex structures, workflows, in the presence of domain knowledge, and we derive planning techniques as well as new metric-learning and learning-to-rank algorithms which are defined over heterogeneous spaces. The planning system we propose is the first one to our knowledge which is able to design DM workflows whose expected performance is maximum for a given mining problem. Finally, our framework can be also applied to any recommendation problem with side information, i.e. descriptors on users and items, to solve the cold start problem.

(10)

(11)

List of Figures xiii

List of Tables xv

Chapter 1 Introduction 1

1.1 Main Contributions . . . 3

1.2 Thesis Outline . . . 5

Chapter 2 Background 7 2.1 Introduction . . . 7

2.2 The Data Mining Process . . . 7

2.3 State of the Art in Data Mining Workﬂow Design Support . . . 9

2.3.1 Ontology-based Planning of DM Workﬂows . . . 9

2.3.2 Meta-Learning . . . 11

2.4 Discussion . . . 13

Chapter 3 Data Mining Workflow Pattern Analysis 15 3.1 Introduction . . . 15

3.2 Motivation . . . 15

3.3 Related Work . . . 17

3.4 The Data Mining Optimization Ontology . . . 18

3.4.1 Core Concepts . . . 19

3.4.2 Taxonomy of Classiﬁcation Algorithms . . . 19

3.4.3 Characteristics of Classiﬁcation Algorithms: C4.5 . . . 21

3.4.4 Characteristics of Feature Selection Algorithms . . . 22

3.5 A Formal Deﬁnition of Data Mining Workﬂow . . . 24

3.6 Data Mining Workﬂow Frequent Pattern Extraction . . . 26

(12)

3.6.1 Workﬂow Representation for Generalized Frequent Pattern Mining 28

3.6.2 Tree Pattern Mining . . . 31

Chapter 4 Meta-mining as a Classification Problem 39 4.1 Introduction . . . 39

4.3 Notations . . . 41

4.4 Meta-mining Framework . . . 42

4.4.1 Rice Model . . . 42

4.4.2 Revisiting the Rice Model . . . 44

4.4.3 Meta-mining Tasks . . . 45

4.5 Building a Meta-Learning and a Meta-mining Problems . . . 47

4.5.1 Gathering the Meta-Data . . . 47

4.5.2 Representing the Meta-Data . . . 48

4.6 Experiments . . . 53

4.6.1 Meta-Learning Results . . . 53

4.6.2 Meta-mining Results . . . 55

Chapter 5 Meta-mining as a Recommendation Problem 61 5.1 Introduction . . . 61

5.3 Related Work . . . 63

5.4 Learning Similarities for Hybrid Recommendations . . . 64

5.4.1 Learning a Dataset Metric . . . 65

5.4.2 Learning a Data Mining Workﬂow Metric . . . 67

5.4.3 Learning a Heterogeneous Metric over Datasets and Workﬂows . . 68

5.5.1 Baseline Strategies and Evaluation Methodologies . . . 70

5.5.2 Experiment Results on the Biological Datasets . . . 72

5.5.3 Model Analysis . . . 75

(13)

6.1 Introduction . . . 79

6.3 System Architecture and Operational Pipeline . . . 81

6.4 Workﬂow planning . . . 83

6.4.1 HTN Planning . . . 84

6.4.2 Workﬂow Selection Task . . . 85

6.5 TheMeta-miner . . . 87

6.5.1 Planning with the homogeneous similarity metrics (P1) . . . 87

6.5.2 Planning with the heterogeneous similarity measure (P2) . . . 88

6.6 Experimental Evaluation . . . 89

6.6.1 Base-level Datasets and Data Mining Workﬂows . . . 90

6.6.2 Meta-Learning & Default Methods . . . 91

6.6.3 Evaluation Methodology . . . 92

6.6.4 Meta-mining Model Selection . . . 93

6.6.5 Experimental Results . . . 94

6.6.6 Result Analysis . . . 97

Chapter 7 Meta-mining as a Learning to Rank Problem 101 7.1 Introduction . . . 101

7.3 Learning to Rank . . . 103

7.3.1 Evaluation Metric . . . 104

7.3.2 LambdaMART . . . 105

7.4 Factorized Lambda-MART . . . 107

7.5 Regularization . . . 109

7.5.1 Input-Output Space Regularization . . . 109

7.5.2 Weighted NDCG Cost . . . 111

7.5.3 Regularized LambdaMART-MF . . . 112

(14)

7.6.2 Comparison Baselines . . . 115

7.6.3 Meta-mining . . . 116

7.6.4 MovieLens . . . 120

Chapter 8 Conclusion and Future Work 125 Appendix A Supplementary Material for Chapter 4 131 A.1 Table of Dataset Characteristics . . . 131

A.2 Dataset Descriptions . . . 133

Appendix B Supplementary Material for Chapter 6 137 B.1 Detailed Results for Scenario 1 . . . 137

B.2 Detailed Results for Scenario 2 . . . 139

Appendix C Supplementary Material for Chapter 7 141 C.1 Table Results for Meta-mining . . . 141

C.2 Table Results for MovieLens . . . 142

Bibliography 145

(15)

2.1 The knowledge discovery (KD) process and its steps, adapted from Fayyad

et al. (1996). . . 8

2.2 A snapshot of the DM ontology used by the IDEA system (Bernstein et al., 2005). . . 10

3.1 dmop’s core concepts (Hilario et al., 2011) . . . 18

3.2 dmop’s classiﬁcation algorithm taxonomy. . . 20

3.3 dmop’s characteristics of theC4.5 decision tree algorithm. . . 21

3.4 dmop’s characteristics of feature selection algorithms. . . 23

3.5 Example of a DM workﬂow that does performance estimation of a combination of feature selection and classiﬁcation. . . 26

3.6 Topological order of the DM workﬂow given in Figure 3.5. . . 27

3.7 Augmented parse tree of the DM workﬂow originally given in Figure 3.6. Thin edges depict workﬂow decomposition, double lines depictdmop’s concept subsumption and bold lines depictdmop’simplement relation. . . 28

3.8 Rewritten parse tree of the augmented parse tree given in Figure 3.7. . . 30

3.9 Embbeded subtree (d) from trees (a), (b) and (c) . . . 32

3.10 Parse trees of the four experimented feature selection workﬂows. . . 36

3.11 Augmented parse trees of feature selection with ReliefF and CFS. Thin edges depict workﬂow decomposition, double lines depict dmop’s concept subsumption and bold lines depictdmop’s implementrelation. . . 37

3.12 Six patterns extracted from the augmented parse trees of the four workﬂows given in Figure 3.10 . . . 38

4.1 Rice model and its four components. . . 43

4.2 The new Rice model and its ﬁve components. . . 44

(16)

5.1 Top-ranked workﬂow patterns according to their average absolute weights given in matrixV. . . 76 6.1 The meta-mining system’s components and its pipeline. . . 82 6.2 HTN plan of the DM workﬂow given in Figure 3.5. Non terminal nodes

are HTN tasks/methods, except for thedominatingoperator X-Validation.

Abstractoperators are in bold and simple operators in italic, each of which is annotated with its I/O specification. . . 84 6.3 Percentage of times that a workflow is among the top-5 workflows over the

diﬀerent datasets. . . 92 6.4 Average correlation gain ¯Kg of the diﬀerent methods against the baseline

on the 65 bio-datasets. In the x-axis, k= 2. . .35, we have the number of top-k workflows suggested to the user. P1 and P2 are the two planning strategies. Metric and Eucl are baseline methods and defX is the default strategy computed over the set ofX workflows. . . 95 7.1 µ₁andµ₂ heatmap parameter distribution at the different truncation levels

k= 1,3,5 of NDCG@k from theUser Cold Startmeta-mining experiments.

In the y-axis we have theµ₁ parameter and in the x-axis theµ₂ parameter.

We validated each parameter with three-folds inner-cross validation to ﬁnd the best value in the range [0.1,1,5,7,10]. . . 119 7.2 µ1andµ2 heatmap parameter distribution at the diﬀerent truncation levels

k= 1,3,5 of NDCG@k from the Full Cold Start meta-mining experiments.

The ﬁgure explanation is as before. . . 119

(17)

4.1 Summary of notations used. . . 42 4.2 Average estimated errors for the default classiﬁer, meta-learning task 1 and

the three meta-mining tasks 1, 2 and 3. A + sign indicates that the error for the given task was significantly better than the default error, an = that there was no significant difference and a−that it was significantly worse. . 54 5.1 Evaluation results. δ_def andδ_EC denote comparison results with the default

(def) and the Euclidean baseline strategy (EC) respectively. ρis Spearman’s rank correlation coefficient, the higher the better. In t5p we give the average accuracy of the top five workflows proposed by each strategy, the higher the better. maeis the mean average error, the lower the better. X/Y indicates the number of times X that a method was better overall the experiments Y than the default or the baseline strategy where we denote by (+) a statistically significant improvement, by (=) no performance difference and by (-) a significant loss. In bold, the best method for a given evaluation measure. . . 73 6.1 Average accuracy of the top-k workflows suggested by each method. W

indicates the number of datasets that a method achieved a top-k average accuracy larger than the respective of the default, and L the number of datasets that it was smaller than the default. p−value is the result of the McNemar’s statistical signiﬁcance test; + indicates that the method is statistically better than the default. . . 96 7.1 Dataset statistics . . . 114 A.1 Dataset characteristics used for the meta-learning/mining experiments. . . . 133 A.2 Datasets used in the meta-learning/mining experiments. . . 135

(18)

B.1 Wins/Losses and respective P-values of the McNemar’s test on the number of times that the Kendal similarity of a method is better than the Kendal similarity of the default, Scenario 1. . . 137 B.2 Average Accuracy, Wins/Losses, and respective P-values of the McNemar’s

test on the number of times the Average Accuracy of a method is better than the Average Accuracy of the default, Scenario 1. . . 138 B.3 Wins/Losses and P-values of the McNemar’s test on the number of times

Kendal similarity of a method is better than the Kendal similarity of the default, Scenario 2. . . 139 B.4 Avg.Acc., Wins/Losses, and respective P-values of the McNemar’s test on

the number of times the Average Accuracy of a method is better than the Average Accuracy of the default, Scenario 2. . . 140

C.1 NDCG@5 results on meta-mining for the Matrix Completion setting. N is the number of workflows we keep in each dataset for training. For each method, we give the comparison results against the CofiRank and Lamb- daMart methods in the rows denoted byδ_CR and δ_LM respectively. More precisely we report the numbers of wins/losses, the p-values of the Mc- Nemar’s test on these values, and denote by (+) a statistically significant improvement, by (=) no performance difference and by (-) a significant loss.

In bold, the best method for a given N. . . 141 C.2 NDCG@k results on meta-mining for the User Cold Startsetting. For each

method, we give the comparison results against the user memory-based and LambdaMart methods in the rows denoted by δ_{U B} and δ_LM respectively.

The table explanation is as before. In bold, the best method for a givenk. . 141 C.3 NDCG@k results on meta-mining for the Full Cold Startsetting. For each

method, we give the comparison results against the full memory-base and LambdaMART methods in the rows denoted byδF B andδLM respectively.

The table explanation is as in table C.2. . . 142

(19)

memory-based and LambdaMART methods in the rows denoted by δ_{U B} andδ_LM respectively. More precisely we report the numbers of wins/losses, the p-values of the McNemar’s test on these values, and denote by (+) a statistically significant improvement, by (=) no performance difference and by (-) a significant loss. In bold, the best method for a givenk. . . 142 C.5 NDCG@k results on the two MovieLens datasets for the Full Cold Start

setting. For each method, we give the comparison results against the full memory-base and LambdaMART methods in the rows denoted byδ_{F B} and δLM respectively. The table annotation is as before. . . 143

(20)

(21)

Introduction

Learning models and extracting knowledge from data using data mining (DM) can be an extremely complex process which requires combining a number of DM operators selected from large pools of available operators to a DM workflow. Workflows have recently emerged as a new paradigm for representing and managing complex computations accelerating the pace of scientific progress. With the growing number and complexity of available operators, workflow (meta-)analysis is becoming increasingly challenging (Gil, Deelman, Ellisman, Fahringer, Fox, Gannon, Goble, Livny, Moreau, and Myers, 2007). Providing (meta-)analytic tools and systems to support the DM user in the modeling of her/his DM workflows has been recently identified as one of the top-ten data mining challenges;

according to Yang and Wu (2006) who asked the opinion of the most active researchers in data mining and machine learning on the worthy topics for future research in data mining:

”Specific issues include how to automate the composition of data mining operations and building a methodology into data mining systems to help users avoid many data mining mistakes”.

Nonetheless, few systems have been proposed during the last decade that can address the above challenge. Most of them rely on an ontology-based specification of DM operators from which they plan DM workflows following a standard heuristic planning approach; DM operators are selected according to their input-output data specifications which fulfill basic characteristics of the given mining problem; for a survey see Serban, Vanschoren, Kietz, and Bernstein (2012). However, the number of valid plans built by these systems can be extremely large. Moreover, none of them can design DM workflows according to their expected performance such as classification accuracy on the given dataset.

(22)

On the other hand, there is a large set of literature referring to meta-learning or learning to learn, (Brazdil, Giraud-Carrier, Soares, and Vilalta, 2008; Hilario and Kalousis, 2001;

Kalousis, Gama, and Hilario, 2004; Köpf, Taylor, and Keller, 2000; Michie, Spiegelhalter, Taylor, and Campbell, 1994; Pfahringer, Bensusan, and Giraud-Carrier., 2000; Soares and Brazdil, 2000). Meta-learning studies how learning systems can improve their efficiency through experiments; it addresses the task of selecting learning algorithms or models according to their match with a given dataset in terms of performance. The meta-model is learned on a predefined meta-space in which datasets are characterized by various measures and algorithms by their relative performances, for a survey on meta-learning see Smith- Miles (2008). The typical meta-learning approach focuses solely on the learning phase of the DM process. Moreover, it can only select algorithms from a predefined pool of training algorithms. Thus with meta-learning we can not fully automate the DM process because this approach can not generalize training performances to new algorithms nor plan DM workflows.

In this thesis we will explore a new approach which we call meta-mining or process- oriented meta-learning. Our framework significantly extends the state-of-the-art approaches in automating the DM process including meta-learning and ontology-based DM workflow planning systems. We will build our framework on top of several artificial intelligence tools ranging from meta-learning, knowledge representation with ontology, pattern mining, planning and machine learning, the result of which can support the DM user in the design of her/his DM process analysis in a robust manner.

Essentially our goal is to analyze the whole DM process in order to learn meta-models which can account for both the specificities of datasets and workflows, the pairs of which will be related by some performance metrics such as classification accuracy. To build the workflow descriptors we will make use of a unified framework of data mining and machine learning algorithms, theData Mining Optimization (dmop) ontology, (Hilario, Kalousis, Nguyen, and Woznica, 2009; Hilario, Nguyen, Do, Woznica, and Kalousis, 2011), in order to extract generalized relational features from the graph specification of workflows. To build the meta-mining models we will explore a number of different methods ranging from classification, metric-learning (Wang, Kalousis, and Woznica, 2012; Weinberger and Saul, 2009), to learning to rank (Fürnkranz and Hüllermeier, 2010). Finally we propose combining meta-mining with DM workflow planning, we will provide a planning system, the first to our knowledge that can dynamically optimize the building of potentially new DM workflows with respect to their expected performance on a given mining problem.

In addition, the proposed framework is not restricted to meta-mining but it can be also applied to any recommendation problems, such as movie recommendations, where

(23)

it can address the full cold start problem of recommender systems; that is, to be able to recommend potentially new items to new users, i.e. instances from which we have no historical data, by exploiting user and item side information.

1.1 Main Contributions

The main contributions of this thesis are the followings:

1. We propose a new framework which extends meta-learning to the whole DM process.

We see now the problem of automating the DM process as learning how to match dataset and workflow (or some parts of it) according to the performance one has for the other. The learned models can support DM users for various recommendation tasks including matching new workflows to datasets, matching new datasets to workflows and eventually matching new datasets to new workflows.

2. We develop specific algorithms to support our framework; we devise a pattern mining approach to mine DM workflows with the help of domain knowledge and we propose a metric-learning approach, that implicitly defines two inductive matrix factorization algorithms, for hybrid recommendations in meta-mining. We tailor the latter two algorithms for the cold start problem by learning meta-mining models which associate dataset and workflow side information with their relative performance in a similarity-based manner.

3. We propose a novel approach to plan DM workflows; we develop the first system which can plan DM workflows with respect to their expected performance on a given mining problem. The system is built on top of a DM workflow planner which we combine with our meta-mining framework. In addition, the system has the unique capability of designing new DM workflows, i.e. workflows that have never been experimented before, which achieve good performance for the given mining problem.

(24)

1.1. Main Contributions

Publications

This thesis is based on the following publications.

TheData Mining Optimization ontology and the frequent workﬂow pattern analysis were published in:

Melanie Hilario, Alexandros Kalousis, Phong Nguyen, and Adam Woznica. A data mining ontology for algorithm selection and meta-learning. In Proceedings of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service- oriented Knowledge Discovery, 2009.

Melanie Hilario, Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis. Ontology-based meta-mining of knowledge discovery workﬂows. In N.

Jankowski, W. Duch, and K. Grabczewski, editors,Meta-Learning in Computational Intelligence. Springer, 2011.

The matrix factorization algorithms for hybrid-recommendation in meta-mining were published in:

Phong Nguyen, Jun Wang, Melanie Hilario, and Alexandros Kalousis. Learning heterogeneous similarity measures for hybrid-recommendations in meta-mining. In IEEE 12th International Conference on Data Mining (ICDM), pages 1026-1031, dec.

2012.

Phong Nguyen, Jun Wang, and Alexandros Kalousis. Factorizing LambdaMART for cold start recommendations. In Machine Learning, Special Issue of the ECML- PKDD 2016 Journal Track, Springer, Vol. 105, 2016.

The meta-mining planning system was published in:

Phong Nguyen, Melanie Hilario, and Alexandros Kalousis. A meta-mining infrastructure to support KD workﬂow optimization. In Proceedings of the ECML/PKDD11 Workshop on Planning to Learn and Service-Oriented Knowledge Discovery, 2011.

Phong Nguyen, Melanie Hilario, and Alexandros Kalousis. Experimental Evalua- tion of the e-LICO Meta-Miner. In 5th Planning to learn Workshop WS28 at ECAI 2012, 2012.

(25)

Phong Nguyen, Melanie Hilario, and Alexandros Kalousis. Using Meta-mining to Support Data Mining Workﬂow Planning and Optimization. InJournal of Artificial Intelligence Research, Vol. 51, 2014, p. 605-644.

1.2 Thesis Outline

The outline of the thesis will be the following. In Chapter 2, we will deﬁne the context of the thesis where we will introduce the problem of supporting DM users in the modeling of their DM process. We will present the state-of-the-art approaches which fall broadly into two main categories: ontology-based DM workﬂow planning systems and meta-learning.

We will discuss the limitations of the current approaches from which we will deﬁne the starting point of the thesis.

In Chapter 3, we will describe theData Mining Optimization(dmop) ontology (Hilario et al., 2009, 2011), a new conceptual framework for data mining and machine learning which describes in a formal manner DM algorithms such as classification and feature selection algorithms by their learning components such as cost function and optimization strategy. We will usedmopin order to extract abstract relational features of DM workflows which we will use to characterize the different operator combinations that can compose a DM workflow.

In Chapter 4, we will describe the Rice model (Rice, 1976) which is used in meta- learning for defining the algorithm selection task, and which we will extend in order to define our meta-mining framework by accounting on workflow characteristics. We will describe the three meta-mining scenarios that we will address in the thesis: learning workflow preferencesfor a new dataset, learning dataset preferences for a given new workflow, and learning dataset-workflow preferences for a given new pair of dataset-workflow. We will finally describe a real-world meta-mining problem on which we will experiment with the three above scenarios in a classification setting.

In Chapter 5, we will develop a metric-learning approach for hybrid recommendation in meta-mining. We will develop two algorithms to learn similarity measures in the feature space of datasets and workflows: in the first algorithm, we will learn two homogenous metrics, one for datasets and one for workflows; and in the second algorithm, we will learn a heterogeneous metric which will measure the similarity between datasets and workflows according to their relative performance. We experimentally show that the approach significantly outperforms the state-of-the-art meta-learning methods for the three meta-mining

(26)

1.2. Thesis Outline

cold start scenarios introduced in Chapter 3.

In Chapter 6, we will explore the combination of the meta-mining models with a DM workflow planner (Kietz, Serban, Bernstein, and Fischer, 2009, 2012). At each planning steps the planner will select the candidate partial workflows for which the expected performance will be maximum with respect to a given mining problem. We will test with two planning scenarios: in the first we will design DM workflows from an already experimented set of DM operators, in the second we will design DM workflows with DM operators that have never been experimented before. In both scenarios we provide performance improve- ments over the default baselines with significant results for the first scenario.

In Chapter 7, we will develop a learning-to-rank approach for cold start recommendations. The approach is based on LambdaMART (Burges, 2010), the state-of-the-art learning to rank algorithm, which we will extend to learn user and item new representation, that is low-rank user and item profiles that describe their behavior in a latent space. In addition, we will propose several similarity-based regularization methods to learn robust profiles. We evaluate our approach on a meta-mining problem but also on a standard movie recommendation problem, MovieLens. In the two experiments we outper- form in a statistically significant manner the different baselines including LambdaMART itself for the cold start problem, and CofiRank, the state-of-the-art matrix factorization algorithm in collaborative ranking (Weimer, Karatzoglou, Le, and Smola, 2007), for the task of matrix completion.

We ﬁnally conclude and outline future work in Chapter 8.

(27)

Background

2.1 Introduction

In this chapter, we will give the context in which our work will take place. We will present in Section 2.2 the research problem that lies at the heart of our thesis, namely the problem of designing valid and useful data mining workﬂows. Then we will describe in Section 2.3 state-of-the-art DM workﬂow support systems which broadly fall into two main categories:

ontology-based workﬂow planning system and meta-learning. We ﬁnally discuss the main limitations of those systems in Section 2.4.

2.2 The Data Mining Process

Data Mining(DM) or Knowledge Discovery in Databases (KDD)¹ refers to the computational process in which low-level data are analyzed in order to extract high-level knowledge (Fayyad et al., 1996). This process is carried out through the specification of a DM workflow, i.e. the assembly of individual data transformations and analysis steps, implemented by DM operators, which composes the DM process with which a data analyst chooses to address her/his DM task. Standard workflow models such as the crisp-dm model (Chapman, Clinton, Kerber, Khabaza, Reinartz, Shearer, and Wirth, 2000) decomposes the life cycle of this process into five principal steps; selection, pre-processing, transformation, learning or modeling, and post-processing. Each step can be further decomposed

1Following current usage, we use these two terms synonymously.

(28)

2.2. The Data Mining Process

DATA

_____

Selection

PreProcessing

Transformation

Modeling

Interpretation / Evaluation

Target Data

Preprocessed Data

Transformed Data

Patterns

Knowledge

Figure 2.1: The knowledge discovery (KD) process and its steps, adapted from Fayyad et al. (1996).

into lower steps. At each workflow step, data objects are consumed by the respective operators which either transform them or produce new data objects that flow to the next step following the control flow defined by the DM workflow. The process is repeated until relevant knowledge is created, see Figure 2.1.

Despite the recent efforts to standardize the DM workflow modeling process with a workflow model such ascrisp-dm, the (meta-)analysis of DM workflows is becoming increasingly challenging with the growing number and complexity of available operators (Gil et al., 2007). Today’s second generation knowledge discovery support systems (KDSS) al- low complex modeling of workflows and contain several hundreds of operators; the Rapid- Miner platform (Klinkenberg, Mierswa, and Fischer, 2007), in its extended version with Weka (Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten, 2009) and R (R Core Team, 2013), proposes actually more than 500 operators, some of which can have very complex data and control flows, e.g. bagging or boosting operators, in which several sub- workflows are interleaved. As a consequence, the possible number of workflows that can be modeled within these systems is on the order of several millions, ranging from simple to very elaborated workflows with several hundred operators. Therefore the data analyst has to carefully select among those operators the ones that can be meaningfully combined to address his/her knowledge discovery problem. However, even the most sophisticated data miner can be overwhelmed by the complexity of such modeling, having to rely on his/her experience and biases as well as on thorough experimentation in the hope of finding out the best operator combination.

With the advance of new generation KDSS that provide even more advanced func- tionalities, it becomes important toprovide automated support to the user in the workflow modelling process, an issue that has been identiﬁed as one of the top-ten challenges in data

(29)

mining (Yang and Wu, 2006). During the last decade, a rather limited number of systems have been proposed to address this challenge. In the next section, we will review the two most important research avenues.

2.3 State of the Art in Data Mining Workflow Design Sup- port

In order to support the DM user in the building of her/his workflows, two main approaches have been developed in the last decades: the first one takes a planning approach based on an ontology of DM operators to automatically design DM workflows, while the second one, referred as meta-learning, makes use of learning methods to address, among other tasks, the task of algorithm selection for a given dataset.

2.3.1 Ontology-based Planning of DM Workflows

Bernstein, Provost, and Hill (2005) propose an ontology-based Intelligent Discovery As- sistant(ida) that plans valid DM workflows – valid in the sense that they can be executed without any failure – according to basic descriptions of the input dataset such as attribute types, presence of missing values, number of classes, etc. By describing into a DM ontology the input conditions and output effects of DM operators, according to the three main steps of the DM process, pre-processing, modeling and post-processing, see Figure 2.2, ida systematically enumerates with a workflow planner all possible valid operator combinations, workflows, that fulfill the data input request. A ranking of the workflows is then computed according to user defined criteria such as speed or memory consumption which are measured from past experiments.

Záková, Kremen, Zelezny, and Lavrac (2011) propose the kd ontology to support automatic design of DM workflows for relational DM. In this ontology, DM relational algorithms and datasets are modeled with the semantic web language OWL-DL, providing thereby semantic reasoning and inference to query over a DM workflow repository.

Similarly toida, the ontology characterizes DM algorithms with their data input/output specifications to address DM workflow planning. The authors have developed a translator from their ontology representation to thePlanning Domain Definition Language(PDDL), (McDermott et al., 1998), with which they can produce abstract directed-acyclic graph workflows using a FF-style planning algorithm, (Hoffmann, 2001). They demonstrate their approach on genomic and product engineering (CAD) use-cases where complex workflows are produced which can make use of relational data structure and background knowledge.

(30)

2.3. State of the Art in Data Mining Workflow Design Support

Discretize Input:

instances with target Output:

instances with target Preconditions:

continuous data Incompatibilities:

none Effects:

add: categorical data remove: continuous data Heuristic Indicators:

speed = +20 ...

Naïve Bayes Input:

instances with target Output:

class probability estimator Preconditions:

not(continuous data) Incompatibilities:

op(inducer) Effects:

add: op(inducer) Heuristic Indicators:

speed = +30 ...

Tree Pruning Input:

decision tree Output:

decision tree Preconditions:

tree

Incompatibilities:

none Effects:

add: model size small Heuristic Indicators:

speed = -10 ...

Pre-Processing Induction Algorithm Post-Processing

Data Mining Operators

Figure 2.2: A snapshot of the DM ontology used by the IDEA system (Bernstein et al., 2005).

More recently, the e-LICO project² featured another ida built upon a planner which constructs DM plans following a hierarchical task networks (HTN) planning approach.

The speciﬁcation of the HTN is given in the Data Mining Workflow (dmwf) ontology, (Kietz, Serban, Bernstein, and Fischer, 2009). As its predecessors the e-LICO ida has been designed to identify operators which preconditions are met at a given planning step in order to plan valid DM workﬂows and does an exhaustive search in the space of possible DM plans.

None of the three DM support systems that we have just discussed consider the eventual performance of the workflows they plan with respect to the DM task that they are supposed to address. For example if our goal is to provide workflows that solve a classification problem, in planning these workflows we would like to consider a measure of classification performance, such as accuracy, and deliver workflows that optimize it. All the discussed DM support systems deliver an extremely large number of plans, DM workflows, which are typically ranked with simple heuristics, such as workflow complexity or expected execution time, leaving the user at a loss as to which is the best workflow in terms of the expected performance in the DM task that he/she needs to address. Even worse the planning search space can be so large that the systems can even fail to complete the planning process, see for example the discussion in Kietz et al. (2012).

2http://www.e-lico.eu

(31)

2.3.2 Meta-Learning

There has been considerable work that tries to support the user in view of performance maximization for a very specific part of the DM process, that of modeling or learning.

A number of approaches have been proposed, collectively identiﬁed as meta-learning or learning-to-learn, (Brazdil, Giraud-Carrier, Soares, and Vilalta, 2008; Hilario, 2002;

Kalousis, 2002; Kalousis and Theoharis, 1999; Soares and Brazdil, 2000). The main idea in meta-learning is that given an unseen dataset the system should be able to select or rank a pool of learning algorithms with respect to their expected performance on this dataset; this is referred as thealgorithm selectiontask (Smith-Miles, 2008). To do so, one builds a meta-learning model from the analysis of past learning experiments, searching for associations between algorithm’s performances and dataset characteristics.

In thestatlog(King, Feng, and Sutherland, 1995; Michie, Spiegelhalter, Taylor, and Campbell, 1994) and metal projects, the members compare a number of classiﬁcation algorithms on large real-world datasets in order to understand the relation between dataset characteristics and algorithm’s performances: they use statistical characteristics, as well as information-theoretic measures and the landmarking approach (Peng, Flach, Soares, and Brazdil, 2002), to build a meta-learning model that can predict the class, either best or rest, of an algorithm on unseen datasets, relatively to its performance on seen datasets. Other works on algorithm selection include the use of algorithm learning curves to estimate the performance of algorithms on dataset samples (Leite and Brazdil, 2005, 2010), the building of geometrical plus topological dataset characteristics to draw the geometrical complexity of classiﬁcation problems (Ho and Basu, 2002, 2006), and various regression- and ranking-based approaches to build a meta-model; these are most notably non-parametric methods including instance-based learning, rule-based learning, decision tree and naive bayes algorithms (Bensusan and Kalousis, 2001; Kalousis and Hilario, 2001;

Kalousis and Theoharis, 1999; Soares and Brazdil, 2000), and more recently a random forest approach built on the relative performances of pairs of algorithms over a set of datasets to extract pairwise meta-rules (Sun and Pfahringer, 2013).

It is also worth mentioning the works on portfolio-based algorithm selection for propo- sitional satisﬁability problem (SAT) (Nudelman, Leyton-Brown, Devkar, Shoham, and Hoos, 2004a; Nudelman, Leyton-Brown, Hoos, Devkar, and Shoham, 2004b; Xu, Hutter, Hoos, and Leyton-Brown, 2008). These works follow the same meta-learning approach that we just have described where they build a portfolio of SAT-solver algorithms, from which one can select the best performing algorithm for a given problem instance. As in meta-learning, the task is to build for each SAT-solver algorithm an empirical hardness

(32)

2.3. State of the Art in Data Mining Workflow Design Support

model, i.e. a meta-model, which can predict the runtime or cost of the algorithm on a selected problem instance according to the instance’s features. Dataset descriptors for SAT problems are provided by domain expert knowledge, which include various statistical features of the logical problems such as number of clauses, number of variables, variable- clause graph features, proximity to Horn Formula, etc. Experiment results reported by approaches like SATzilla (Xu et al., 2008, 2012) on diﬀerent SAT-solver competitions have showed the eﬀectiveness of this approach.

In addition to the algorithm selection task is the task of model – or parameter – selection³, whose goal is to adjust the procedural – or search/preference – bias in order to give priority to certain hypotheses over the others in the hypothesis space of a learning algorithm. Ali and Smith-Miles (2006) propose a meta-learning approach to learn the best kernel to use within support vector machines (SVMs) for classification. The same authors use a similar approach to determine the best method for selecting the width of the RBF kernel (Ali and Smith-Miles, 2007). Another approach for parameter selection is the algorithm configuration approach (Coy, Golden, Runger, and Wasil, 2001; Gratch and Dejong, 1992; Hutter, Hoos, Leyton-Brown, and Stützle, 2009; Hutter, Hoos, and Leyton-Brown, 2011; Minton, 1993; Terashima-Mar´ın and Ross, 1999). In these works, the goal is to search for the best performing parameter configuration of an algorithm when applied on a given problem instance. Various search algorithms have been proposed including hill-climbing (Gratch and Dejong, 1992), beam search (Minton, 1993), genetic algorithms (Terashima-Mar´ın and Ross, 1999), experimental design approaches (Coy et al., 2001) and more recently a trajectory-based method (Hutter et al., 2009), all of which try to sequentially minimize the cost related to the application of a given algorithm configuration on the problem instance.

More recently, Thornton, Hutter, Hoos, and Leyton-Brown (2013) propose a novel approach called AutoWekawhich is able to automatically and simultaneously choose a learning algorithm and its hyper-parameters for empirical performance optimization on a given learning problem; they combine theWekadata mining platform (Hall et al., 2009) together with a bayesian procedure, the sequential model-based algorithm conﬁguration smac(Hutter et al., 2011), to estimate the performance of a learning algorithm given the most promising candidate set of parameters selected a priori. This approach makes use of conditional parameters, i.e. parameters that are hierarchically conditioned by some others, in order to build complex algorithm conﬁgurations and selections. For instance,

3Note that we do not follow the same terminology as in Thornton et al. (2013) where the authors use the terms ”model selection” to refer to what we call algorithm selection and ”hyper-parameter optimization”

to what we call ”model selection”.

(33)

algorithm selection is carried out with what the authors call ”root-level” parameters, one per algorithm, which condition the selection of learning parameters with respect to the selected algorithm. It is also possible to configure feature selection component parameters, like the type of search (greedy or best first) or the type of feature evaluation (Relief or CFS), from which one can select and configure complex feature selection plus classification workflows for a given learning problem.

Overall, meta-learning diﬀers fundamentally from base-level learning in its objectives;

while the latter assumes a fixed bias in which learning will occur to find the best hypothesis – orsearch – bias for the given problem (Mitchell, 1997), the former will seek for the algorithm – orrepresentational – bias which will best fit the problem at hand. Represen- tational bias specifies the structure of an algorithm; e.g. the cost function it uses and the decision boundary it draws such as linear versus non-linear. Search bias specifies how this structure is built during learning. Thus while base-level learning is only concerned with model selection, meta-learning amounts to adjust or select dynamically the right learning or search bias by which an algorithm or its model will restrict the hypothesis space of a given problem (Vilalta and Drissi, 2002).

There are however two main limitations in meta-learning as well as in the algorithm portfolio and conﬁguration approaches that we just have described here. First, algorithms are considered as black-boxes; the only relation which is considered between learning methods is their relative performance on datasets. As a consequence, one has to experiment ﬁrst an algorithm on his datasets in order to characterize it. Moreover, there is no way where one can generalize the learned meta-models to select algorithms which have not been experimented in the past; in order to account for a new/unseen algorithm, one has to train it before being able to draw conclusions on its relations with other algorithms.

Second, by focusing mainly on the learning or modeling phase of the knowledge discovery process, meta-learning simply ignores the other steps that can compose this process, such as pre-processing, transformation and post-processing, and which can impact on the performance of the process. For all these reasons, it is not possible yet in meta-learning to automatically plan/design DM workﬂows for a given dataset as in ontology-based DM workﬂow planning systems.

2.4 Discussion

In Section 2.2, we presented the core problem that we will address in this thesis: that is, to provide intelligent automated support to the DM user in her DM workﬂow modeling process. We have reviewed in Section 2.3 the state-of-the-art approaches that try to

(34)

2.4. Discussion

address this problem. On one hand, there is knowledge-based planning systems that rely on an ontology of DM operators to build valid DM workﬂows according to the input/output conditions of the candidate operators. On the other hand, there is the meta-learning and algorithm portfolio approaches which learn associations between dataset characteristics and algorithm’s performances to address the task of algorithm selection for a given learning problem.

As already said before, none of these approaches can design potentially new DM workflows, combinations of DM algorithms, such that their building is optimized with respect to a performance measure like classification accuracy, for the main reason that DM algorithms are viewed in these methods as independent black-box components. To go beyond this limitation, we will propose to uncover not only relations between datasets and algorithms as in meta-learning, but also relations within (combinations of) learning algorithms that lead to good or bad performance on datasets. More precisely, while the previous meta- learning approaches aim at characterizing a learning problem in order to select the most appropriate algorithm by taking into consideration known structural properties of the problem (Smith-Miles, 2008), we will focus in this thesis on the characteristics, structural properties, of learning algorithms and the relations they can have between them inside a DM workflow. We will map these workflow characteristics with dataset characteristics according to the performance that the former has on the latter, in order to build or select sets of DM operators, workflows, which are the most appropriate for a given learning problem in terms of their performance.

Our work is close to the recent works on automatic algorithm configuration for algorithm selection like AutoWeka (Thornton et al., 2013) and AutoFolio (Lindauer, Hoos, Hutter, and Schaub, 2015), but with the ability to generalize our meta-models to unseen datasets and workflows. More precisely, we will combine the two approaches described above as follows. From the ontology-based approach, we will exploit a new DM ontology, the Data Mining Optimization (dmop) ontology (Hilario et al., 2009, 2011), the goal of which is to pry open DM algorithms and characterize them by their learning behaviors. On top of this ontology, we will derive new meta-models that will associate now dataset and algorithm characteristics with respect to their relative performance to support the task of DM operator/workflow selection and planning in view of performance optimization. As we will see in the next Chapters, our work provides a unique blending of data mining, machine learning and planning algorithms, where we will build the first system to our knowledge that is able to design potentially new DM workflows such that their performance, like classification accuracy, will be optimized for a given dataset.

(35)

Data Mining Workflow Pattern Analysis

3.1 Introduction

In this Chapter, we will describe a generalized pattern mining approach to extract frequent workflow patterns from the annotated graph structure of DM workflows. These relational patterns will give us provisions to build our meta-mining framework, where we will learn how to associate dataset and algorithm/workflow characteristics with respect to their relative performance. To extract the workflow patterns, we will use the Data Mining OPtimization (dmop) ontology (Hilario et al., 2009, 2011) which we describe in Section 3.4. Before proceeding to the high-level description of dmop, we first give some related works in Section 3.3. In Section 3.5, we give a formal definition of DM workflows and in Section 3.6 we describe our generalized pattern mining approach. Finally we discuss our approach in Section 3.7.

3.2 Motivation

In Chapter 2, we introduced the notion of DM workﬂows. As we will see in section 3.5, these are hierarchical graph structures composed of various data transformations and analysis steps which can be very complex structures, i.e. they can contain several nested sub-structures that specify complex operator combinations, e.g. an operator of type boosting is typically composed of several interleaved combinations of learning algorithms with

(36)

3.2. Motivation

which different learning models are produced. These structures are inherently difficult to describe and analyze, not only because of their “spaghetti-like“ aspects but also because we do not have any abstract information on which subtask is addressed by the sub-workflows or the different operator combinations (Gil, Deelman, Ellisman, Fahringer, Fox, Gannon, Goble, Livny, Moreau, and Myers, 2007; Van der Aalst and Giinther, 2007).

In order to build our meta-mining framework, where we will learn how to prioritize – eventually plan – workflows to recommend the most promising ones for a given dataset, we first need an appropriate description of the possible DM workflow structures and their components, so that we can use these descriptors as workflow characteristics in our framework.

More precisely, we will follow theprocess mining analysisapproach (Bose and Aalst, 2009;

Greco, Guzzo, and Pontieri, 2008; Medeiros, Karla, and Aalst, 2008; Polyvyanyy, Smirnov, and Weske, 2008), whose task is to extract general patterns over workflow structures using abstractions where the taxonomical patterns found characterize relations among similar groups of operators or workflows. We will use thedmopontology – a formal DM ontology which overlay ground specifications of DM workflows and conceptualizes the DM domain in terms of DM algorithm, task, model and workflow (Hilario et al., 2009, 2011) – in order to extract generalized frequent workflow patterns from the graph structure of a set of DM workflows. These generalized frequent workflow patterns will give us provisions to characterize DM workflows, with which we will build new meta-mining models and a novel planning system that will account both for dataset and workflow characteristics with respect to their relative performance.

The main contributions of this Chapter are as follows:

1. By providing a formal definition of DM workflows, we propose to analyze them using the dmop ontology. The goal of this meta-analysis is to decompose DM workflows in a bottom-up approach following the dmop ontology in order to extract frequent abstract workflow patterns that can be reused, interpreted, or adapted, in the DM workflow modeling process (Gil et al., 2007).

2. To address our meta-analysis, we will develop a new abstract representation of DM workﬂows which can be used with standard pattern mining methods. We demonstrate our approach with a tree mining approach but the proposed method can be adapted to more complex workﬂow representation such as graphs as well as adapted to more advanced mining methods such as constraint-based data mining (Han, Kam- ber, and Pei, 2006) or relational data mining (Dˇzeroski, 2010).

(37)

This Chapter is based on the following publications.

Melanie Hilario, Alexandros Kalousis, Phong Nguyen, and Adam Woznica. A data mining ontology for algorithm selection and meta-learning. In Proceedings of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service- oriented Knowledge Discovery, 2009.

Melanie Hilario, Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis. Ontology-based meta-mining of knowledge discovery workﬂows. In N.

Jankowski, W. Duch, and K. Grabczewski, editors,Meta-Learning in Computational Intelligence. Springer, 2011.

3.3 Related Work

Frequent pattern extraction is a research problem that has received considerable attention.

There exists a large body of work on mining over transaction datasets, sequences, trees and graphs; for an overview of the domain, see Han, Cheng, Xin, and Yan (2007). Clearly, in the case of DM workﬂows, we are mostly interested in work on frequent pattern extraction from sequences, trees, and graphs since all of them can be used to represent DM workﬂows.

Additionally there has been considerable work in the analysis and mining of workflow data, especially in the field of business process analysis under different names such as process abstraction, process simplification, and semantic process mining (Bose and Aalst, 2009;

Greco, Guzzo, and Pontieri, 2008; Medeiros, Karla, and Aalst, 2008; Polyvyanyy, Smirnov, and Weske, 2008).

The main goal in these works is to derive simplified versions of the workflows in view of understandability and reuse of their components. Such simplifications have been derived through the use of frequent pattern extraction (Bose and Aalst, 2009), or through clustering of workflows (Greco et al., 2008). In the latter the authors use a top-down hierarchical clustering to group similar workflows together in order to discover different variants of the workflows, i.e. different usage scenarios; they then use a bottom-up approach over the cluster tree in order to automatically build a taxonomy of the different variants, from specific to more general usage scenarios.

(38)

3.4. The Data Mining Optimization Ontology

DM−Experiment

instantiated in DMEX−DB instantiated in DMKB

DM−Operator

specifiesOutputType specifiesInputType

DM−Report

hasInput hasOutput

DM−Operation

hasStep hasNode

achieves

realizes

executes DM−Algorithm

DM−Task

executes

addresses implements

DM−Workflow

Data

DM−Model

Figure 3.1: dmop’s core concepts (Hilario et al., 2011)

3.4 The Data Mining Optimization Ontology

In this section, we will give a brief description of theData Mining Optimization ontology (dmop) since we will extensively use it in the characterization of the building blocks of the DM workflows, i.e. the DM operators. The purpose of dmop is to provide a formal conceptualization of the DM domain by describing DM algorithms and defining their relations in terms of DM tasks, models and workflows. Other DM ontologies such as the ida’s DM ontology (Bernstein et al., 2005), the kd ontology (Záková et al., 2011), and theData Mining Workflow(dmwf) ontology (Kietz et al., 2009, 2012), describe DM algorithms with basic characteristics such as the types of I/O objects they use and produce or the DM task they achieve in view of DM workflow planning. Thedmopontology takes a different perspective, it is the first one that pries open DM algorithm’s black boxes. It describes DM algorithms in terms of their model structure, optimization cost function and decision strategies as well as their learning behavior such as their bias/variance profile, their sensitivity to the type of attributes, etc. dmop provides thus a rich and in-depth conceptual framework to characterize the DM domain, the goal of which is to support the meta-analysis of DM workflows and their applications on a given mining problem, overall to support all decision-making steps that determine the outcome of the DM process.

In the next section, we will present the dmop’s core concepts and how these are architectured. In Section 3.4.2, we will describe a taxonomy which categorizes classiﬁcation algorithms according to their model building. Then we will exemplify in Section 3.4.3 the

(39)

C4.5 algorithm (Quinlan, 1986, 1993), showing how it is conceptualized with dmop. We will do the same for feature selection algorithms in Section 3.4.4. On the basis of these conceptualizations, we will see in the remaining of this chapter how we can extract frequent workflow patterns from feature selection plus classification workflows.

3.4.1 Core Concepts

At the core of the dmop ontology is the concept of DM-Algorithm. A DM algorithm is related to the DM-task it addresses such as predictive modeling or descriptive modeling and to the input Data it will have to analyze. The execution of a DM algorithm on the input data will output knowledge in the form of a descriptive or predictive DM-Model, typically accompanied by some kind of DM-Report containing the learned models, estimated performance and other meta-data. From a workﬂow perspective, a DM algorithm is implemented by a DM-Operatorwhich is a node of the complex graph structure given by a DM-Workflow. The execution of a DM workﬂow gives a DM-Experiment where the execution of each DM operator gives aDM-Operation.

In the dmop ontology, instances of the DM-Algorithm, DM-Task and DM-Operator concepts are instantiated in the DM knowledge base (dmkb) or dmop’s assertion box.

For a given application domain for which we want to experiment speciﬁc DM workﬂows on a given set of datasets, instances of theData,DM-ModelandDM-Reportconcepts as well as those of theDM-Workflow,DM-Experimentand DM-Operatorconcepts, are instantiated in the respective DM experiment database (dmex-db). Each dmex-dbis located at the lowest level of the dmop’s architecture with which further meta-analysis can be carried out with the help of the dmop ontology. Figure 3.1 gives the dmop’s core concepts and their relations.

3.4.2 Taxonomy of Classification Algorithms

In data mining or machine learning, classification modeling algorithms constitute one of the main classes of algorithms in which the task is to learn a predictive model from an input spaceX to an output spaceY. With the plethora of existing classification modeling algorithms, providing a hierarchy of those algorithms is important in order to characterize them. Figure 3.2 shows a part of the concept hierarchy or taxonomy for classification algorithms of thedmopontology. In this Figure, the top conceptClassificationModellingAl- gorithmis a direct subclass of the core conceptDM-Algorithm.

Classiﬁcation modeling algorithms are divided into three broad categories (Bishop and Nasrabadi, 2006). Generative methods approximate the class conditional distribution

(40)

NaiveBayesMultinomial NaiveBayesNormal NaiveBayesKernel

CART Algorithm

NaiveBayesDiscretized

LinearSVC SoftMarginSVC HardMarginSVC

...

NormalQuadDiscriminant NaiveBayes

CHAID DiscriminantFunc

Generative Algorithm

Classification Modelling Algorithm

Discriminative Algorithm

...

RecursivePartitioning

MLP−Backprop

C4.5 NormalLinearDiscriminant

SupportVectorClassifier FishersLinearDiscriminant

NeuralNetworks LogisticRegression KNearestNeighbors

SetCovering

Figure 3.2: dmop’s classiﬁcation algorithm taxonomy.

P(x|y;Θ) and the class priorsP(y;Θ) or the joint probability distributionP(x, y;Θ) by computing those values of the Θ parameters that optimize a given cost function, most often the likelihood of the data. Having done so they use Bayes theorem to compute the posterior of the class P(y|x;Θ). Discriminative methods such as logistic regression and k-nearest neighbors approximate directly the class posteriorP(y|x;Θ) to determine class memberships. Discriminative functionslearn directly a mapping functionf(x) from input xonto class labely; most of state-of-art machine learning methods such as support vectors classiﬁers (SVC), neural networks and decision trees follow this approach.

Classification methods of the same algorithm family, i.e. which produce the same type of model structure, are grouped together to form the third stage of the taxonomy such as NaiveBayes,SupportVectorClassifier,RecursivePartitioning, etc. Below each algorithm family are the different variants that we can find in the literature. For instance, in the work of John and Langley (1995), we have the description of three different versions of theNaive Bayes algorithm, each of which has a specific modeling approach to model probabilities on numeric attributes: there is the normal one,NaiveBayesNormal, which assumes a normal distribution for each numeric attribute, there is the kernelized one,NaiveBayesKernel, which uses a kernel density estimation on those numeric attributes, and there is the dis- crete one,NaiveBayesDiscretized, which makes use of a discretization approach to compute probabilities on numeric attributes. In addition, we have the multinomial Naive Bayes

(41)

hasLeafPredictor

hasQuality

IID−Assumption

MultiWayTreeBranchingFactor {InfoGain, InfoGainRatio}

hasQuality

HandleMulticlassClassification hasFeatureTestEval

HandleCategoricalFeature hasQuality

TolerateIrrelevantFeatures EagerLearningPolicy ErrorBasedPruning

hasObjectiveFunction

hasQuality

MinCondClassEntropy

CondClassEntropy assumes

hasOptimizationGoal

HighVarianceProfile

Minimize hasOptimizationProblem

hasComponentStep

hasQuality C4.5

HandleContinuousFeature MajorityVoteClassifier

hasQuality hasQuality

hasQuality

TolerateHighDimensionality

TolerateMissingValues

Figure 3.3: dmop’s characteristics of theC4.5 decision tree algorithm.

version,NaiveBayesMultiNomial, for text classification in which word probabilities follow a multinomial distribution, (McCallum and Nigam, 1998). In the same manner, we model the support vector classification algorithm (Cortes and Vapnik, 1995; Vapnik, 1998), with respect to their optimization strategies: either using a soft or hard margin, which gives theSoftMarginSVCand HardMarginSVCalgorithm concepts, or with respect to the kernel they use: linear which gives theLinearSVCalgorithm concept, radial, polynomial, etc. For decision tree algorithms, also known as recursive partitioning algorithms, we have specific algorithm concepts such asCHAID(Kass, 1980), CART(Breiman, 2001), andC4.5(Quin- lan, 1986, 1993).

3.4.3 Characteristics of Classification Algorithms: C4.5

We will see now an example of classiﬁcation algorithm modeling with dmop; the C4.5 decision tree algorithm, (Quinlan, 1986, 1993). This decision tree algorithm is shown in Figure 3.3. Decision tree algorithms recursively partition training examples by ﬁnding at each stage of the tree the best feature split that minimizes a given cost function. In the case of C4.5, this algorithm uses information gain or information gain ratio as splitting

(42)

criterion to minimize the conditional class entropy. dmop describes these two learning components by the two taxonomic relations:

C4.5 ⊑ ∀hasFeatureTestEval.InfoGain

C4.5 ⊑ ∀hasOptimizationProblem.MinCondClassEntropy

where ⊑ deﬁnes the concept inclusion axiom in the description logic language (Baader, Calvanese, McGuinness, Nardi, and Patel-Schneider, 2003).

There is an additional post-processing step in C4.5 in which leaves are pruned according to their error rate on a validation set to avoid over-ﬁtting, referred aserror-based pruning (Breiman, 2001). Finally, to predict class labels on new instances, C4.5 uses a majority vote rule of the leaf in which those instances fall. dmop describes these two properties with the following taxonomic relations:

C4.5 ⊑ ∀hasComponentStep.ErrorBasedPruning C4.5 ⊑ ∀hasLeafPredictor.MajorityVoteClassifier

In addition to those characteristics which describe the structure of theC4.5algorithm, dmopalso provides qualitative algorithm characteristics following thehasQualityproperty of the dolce upper-ontology (Keet, Lawrynowicz, d’Amato, and Hilario, 2013; Keet, Lawrynowicz, dAmato, Kalousis, Nguyen, Palma, Stevens, and Hilario, 2015). See for example the taxonomic relations:

C4.5 ⊑ ∀hasQuality.HandleContinuousFeature C4.5 ⊑ ∀hasQuality.TolerateHighDimensionality

in Figure 3.3. These qualitative attributes describe capacities of the algorithms with respect to the input dataset; if they are able to handle continuous features, to be tolerant or not to high dimensional datasets and to tolerate missing values, etc. They also describe learning capabilities such as to tolerate irrelevant features or to have a high bias or a high variance proﬁle (Domingos, 2000; Kohavi, Wolpert, et al., 1996).

3.4.4 Characteristics of Feature Selection Algorithms

Another important class of DM algorithms are feature selection (FS) algorithms. Feature selection is a particular case of dimensionality reduction where the feature dimensionality is reduced by eliminating those features that are irrelevant or redundant according to