Discussion - Meta-mining: a meta-learning framework to support the recommendation, planning and

Chapter 2 Background 7

2.4 Discussion

In Section 2.2, we presented the core problem that we will address in this thesis: that is, to provide intelligent automated support to the DM user in her DM workﬂow modeling process. We have reviewed in Section 2.3 the state-of-the-art approaches that try to

2.4. Discussion

address this problem. On one hand, there is knowledge-based planning systems that rely on an ontology of DM operators to build valid DM workﬂows according to the input/output conditions of the candidate operators. On the other hand, there is the meta-learning and algorithm portfolio approaches which learn associations between dataset characteristics and algorithm’s performances to address the task of algorithm selection for a given learning problem.

As already said before, none of these approaches can design potentially new DM work-flows, combinations of DM algorithms, such that their building is optimized with respect to a performance measure like classification accuracy, for the main reason that DM algo-rithms are viewed in these methods as independent black-box components. To go beyond this limitation, we will propose to uncover not only relations between datasets and algo-rithms as in meta-learning, but also relations within (combinations of) learning algoalgo-rithms that lead to good or bad performance on datasets. More precisely, while the previous meta-learning approaches aim at characterizing a meta-learning problem in order to select the most appropriate algorithm by taking into consideration known structural properties of the problem (Smith-Miles, 2008), we will focus in this thesis on the characteristics, structural properties, of learning algorithms and the relations they can have between them inside a DM workflow. We will map these workflow characteristics with dataset characteristics according to the performance that the former has on the latter, in order to build or se-lect sets of DM operators, workflows, which are the most appropriate for a given learning problem in terms of their performance.

Our work is close to the recent works on automatic algorithm configuration for al-gorithm selection like AutoWeka (Thornton et al., 2013) and AutoFolio (Lindauer, Hoos, Hutter, and Schaub, 2015), but with the ability to generalize our meta-models to unseen datasets and workflows. More precisely, we will combine the two approaches de-scribed above as follows. From the ontology-based approach, we will exploit a new DM ontology, the Data Mining Optimization (dmop) ontology (Hilario et al., 2009, 2011), the goal of which is to pry open DM algorithms and characterize them by their learning behaviors. On top of this ontology, we will derive new meta-models that will associate now dataset and algorithm characteristics with respect to their relative performance to support the task of DM operator/workflow selection and planning in view of performance optimization. As we will see in the next Chapters, our work provides a unique blending of data mining, machine learning and planning algorithms, where we will build the first system to our knowledge that is able to design potentially new DM workflows such that their performance, like classification accuracy, will be optimized for a given dataset.

Data Mining Workflow Pattern Analysis

3.1 Introduction

In this Chapter, we will describe a generalized pattern mining approach to extract frequent workflow patterns from the annotated graph structure of DM workflows. These relational patterns will give us provisions to build our meta-mining framework, where we will learn how to associate dataset and algorithm/workflow characteristics with respect to their relative performance. To extract the workflow patterns, we will use the Data Mining OPtimization (dmop) ontology (Hilario et al., 2009, 2011) which we describe in Section 3.4. Before proceeding to the high-level description of dmop, we first give some related works in Section 3.3. In Section 3.5, we give a formal definition of DM workflows and in Section 3.6 we describe our generalized pattern mining approach. Finally we discuss our approach in Section 3.7.

3.2 Motivation

In Chapter 2, we introduced the notion of DM workﬂows. As we will see in section 3.5, these are hierarchical graph structures composed of various data transformations and analysis steps which can be very complex structures, i.e. they can contain several nested sub-structures that specify complex operator combinations, e.g. an operator of type boost-ing is typically composed of several interleaved combinations of learnboost-ing algorithms with

3.2. Motivation

which different learning models are produced. These structures are inherently difficult to describe and analyze, not only because of their “spaghetti-like“ aspects but also because we do not have any abstract information on which subtask is addressed by the sub-workflows or the different operator combinations (Gil, Deelman, Ellisman, Fahringer, Fox, Gannon, Goble, Livny, Moreau, and Myers, 2007; Van der Aalst and Giinther, 2007).

In order to build our meta-mining framework, where we will learn how to prioritize – eventually plan – workflows to recommend the most promising ones for a given dataset, we first need an appropriate description of the possible DM workflow structures and their com-ponents, so that we can use these descriptors as workflow characteristics in our framework.

More precisely, we will follow theprocess mining analysisapproach (Bose and Aalst, 2009;

Greco, Guzzo, and Pontieri, 2008; Medeiros, Karla, and Aalst, 2008; Polyvyanyy, Smirnov, and Weske, 2008), whose task is to extract general patterns over workflow structures using abstractions where the taxonomical patterns found characterize relations among similar groups of operators or workflows. We will use thedmopontology – a formal DM ontology which overlay ground specifications of DM workflows and conceptualizes the DM domain in terms of DM algorithm, task, model and workflow (Hilario et al., 2009, 2011) – in or-der to extract generalized frequent workflow patterns from the graph structure of a set of DM workflows. These generalized frequent workflow patterns will give us provisions to characterize DM workflows, with which we will build new meta-mining models and a novel planning system that will account both for dataset and workflow characteristics with respect to their relative performance.

The main contributions of this Chapter are as follows:

1. By providing a formal definition of DM workflows, we propose to analyze them using the dmop ontology. The goal of this meta-analysis is to decompose DM workflows in a bottom-up approach following the dmop ontology in order to extract frequent abstract workflow patterns that can be reused, interpreted, or adapted, in the DM workflow modeling process (Gil et al., 2007).

2. To address our meta-analysis, we will develop a new abstract representation of DM workﬂows which can be used with standard pattern mining methods. We demon-strate our approach with a tree mining approach but the proposed method can be adapted to more complex workﬂow representation such as graphs as well as adapted to more advanced mining methods such as constraint-based data mining (Han, Kam-ber, and Pei, 2006) or relational data mining (Dˇzeroski, 2010).

This Chapter is based on the following publications.

Melanie Hilario, Alexandros Kalousis, Phong Nguyen, and Adam Woznica. A data mining ontology for algorithm selection and meta-learning. In Proceedings of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery, 2009.

Melanie Hilario, Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis. Ontology-based meta-mining of knowledge discovery workﬂows. In N.

Jankowski, W. Duch, and K. Grabczewski, editors,Meta-Learning in Computational Intelligence. Springer, 2011.

3.3 Related Work

Frequent pattern extraction is a research problem that has received considerable attention.

There exists a large body of work on mining over transaction datasets, sequences, trees and graphs; for an overview of the domain, see Han, Cheng, Xin, and Yan (2007). Clearly, in the case of DM workﬂows, we are mostly interested in work on frequent pattern extraction from sequences, trees, and graphs since all of them can be used to represent DM workﬂows.

Additionally there has been considerable work in the analysis and mining of workflow data, especially in the field of business process analysis under different names such as process abstraction, process simplification, and semantic process mining (Bose and Aalst, 2009;

Greco, Guzzo, and Pontieri, 2008; Medeiros, Karla, and Aalst, 2008; Polyvyanyy, Smirnov, and Weske, 2008).

The main goal in these works is to derive simplified versions of the workflows in view of understandability and reuse of their components. Such simplifications have been derived through the use of frequent pattern extraction (Bose and Aalst, 2009), or through cluster-ing of workflows (Greco et al., 2008). In the latter the authors use a top-down hierarchical clustering to group similar workflows together in order to discover different variants of the workflows, i.e. different usage scenarios; they then use a bottom-up approach over the cluster tree in order to automatically build a taxonomy of the different variants, from specific to more general usage scenarios.

3.4. The Data Mining Optimization Ontology

Figure 3.1: dmop’s core concepts (Hilario et al., 2011)

3.4 The Data Mining Optimization Ontology

In this section, we will give a brief description of theData Mining Optimization ontology (dmop) since we will extensively use it in the characterization of the building blocks of the DM workflows, i.e. the DM operators. The purpose of dmop is to provide a formal conceptualization of the DM domain by describing DM algorithms and defining their relations in terms of DM tasks, models and workflows. Other DM ontologies such as the ida’s DM ontology (Bernstein et al., 2005), the kd ontology (Záková et al., 2011), and theData Mining Workflow(dmwf) ontology (Kietz et al., 2009, 2012), describe DM algorithms with basic characteristics such as the types of I/O objects they use and produce or the DM task they achieve in view of DM workflow planning. Thedmopontology takes a different perspective, it is the first one that pries open DM algorithm’s black boxes. It describes DM algorithms in terms of their model structure, optimization cost function and decision strategies as well as their learning behavior such as their bias/variance profile, their sensitivity to the type of attributes, etc. dmop provides thus a rich and in-depth conceptual framework to characterize the DM domain, the goal of which is to support the meta-analysis of DM workflows and their applications on a given mining problem, overall to support all decision-making steps that determine the outcome of the DM process.

In the next section, we will present the dmop’s core concepts and how these are architectured. In Section 3.4.2, we will describe a taxonomy which categorizes classiﬁcation algorithms according to their model building. Then we will exemplify in Section 3.4.3 the

C4.5 algorithm (Quinlan, 1986, 1993), showing how it is conceptualized with dmop. We will do the same for feature selection algorithms in Section 3.4.4. On the basis of these conceptualizations, we will see in the remaining of this chapter how we can extract frequent workflow patterns from feature selection plus classification workflows.

3.4.1 Core Concepts

At the core of the dmop ontology is the concept of DM-Algorithm. A DM algorithm is related to the DM-task it addresses such as predictive modeling or descriptive modeling and to the input Data it will have to analyze. The execution of a DM algorithm on the input data will output knowledge in the form of a descriptive or predictive DM-Model, typically accompanied by some kind of DM-Report containing the learned models, esti-mated performance and other meta-data. From a workﬂow perspective, a DM algorithm is implemented by a DM-Operatorwhich is a node of the complex graph structure given by a DM-Workflow. The execution of a DM workﬂow gives a DM-Experiment where the execution of each DM operator gives aDM-Operation.

In the dmop ontology, instances of the DM-Algorithm, DM-Task and DM-Operator concepts are instantiated in the DM knowledge base (dmkb) or dmop’s assertion box.

For a given application domain for which we want to experiment speciﬁc DM workﬂows on a given set of datasets, instances of theData,DM-ModelandDM-Reportconcepts as well as those of theDM-Workflow,DM-Experimentand DM-Operatorconcepts, are instantiated in the respective DM experiment database (dmex-db). Each dmex-dbis located at the lowest level of the dmop’s architecture with which further meta-analysis can be carried out with the help of the dmop ontology. Figure 3.1 gives the dmop’s core concepts and their relations.

3.4.2 Taxonomy of Classification Algorithms

In data mining or machine learning, classification modeling algorithms constitute one of the main classes of algorithms in which the task is to learn a predictive model from an input spaceX to an output spaceY. With the plethora of existing classification modeling algorithms, providing a hierarchy of those algorithms is important in order to character-ize them. Figure 3.2 shows a part of the concept hierarchy or taxonomy for classification algorithms of thedmopontology. In this Figure, the top concept ClassificationModellingAl-gorithmis a direct subclass of the core conceptDM-Algorithm.

Classiﬁcation modeling algorithms are divided into three broad categories (Bishop and Nasrabadi, 2006). Generative methods approximate the class conditional distribution

3.4. The Data Mining Optimization Ontology

Figure 3.2: dmop’s classiﬁcation algorithm taxonomy.

P(x|y;Θ) and the class priorsP(y;Θ) or the joint probability distributionP(x, y;Θ) by computing those values of the Θ parameters that optimize a given cost function, most often the likelihood of the data. Having done so they use Bayes theorem to compute the posterior of the class P(y|x;Θ). Discriminative methods such as logistic regression and k-nearest neighbors approximate directly the class posteriorP(y|x;Θ) to determine class memberships. Discriminative functionslearn directly a mapping functionf(x) from input xonto class labely; most of state-of-art machine learning methods such as support vectors classiﬁers (SVC), neural networks and decision trees follow this approach.

Classification methods of the same algorithm family, i.e. which produce the same type of model structure, are grouped together to form the third stage of the taxonomy such as NaiveBayes,SupportVectorClassifier,RecursivePartitioning, etc. Below each algorithm fam-ily are the different variants that we can find in the literature. For instance, in the work of John and Langley (1995), we have the description of three different versions of theNaive Bayes algorithm, each of which has a specific modeling approach to model probabilities on numeric attributes: there is the normal one,NaiveBayesNormal, which assumes a nor-mal distribution for each numeric attribute, there is the kernelized one,NaiveBayesKernel, which uses a kernel density estimation on those numeric attributes, and there is the dis-crete one,NaiveBayesDiscretized, which makes use of a discretization approach to compute probabilities on numeric attributes. In addition, we have the multinomial Naive Bayes

hasLeafPredictor

Figure 3.3: dmop’s characteristics of theC4.5 decision tree algorithm.

version,NaiveBayesMultiNomial, for text classification in which word probabilities follow a multinomial distribution, (McCallum and Nigam, 1998). In the same manner, we model the support vector classification algorithm (Cortes and Vapnik, 1995; Vapnik, 1998), with respect to their optimization strategies: either using a soft or hard margin, which gives theSoftMarginSVCand HardMarginSVCalgorithm concepts, or with respect to the kernel they use: linear which gives theLinearSVCalgorithm concept, radial, polynomial, etc. For decision tree algorithms, also known as recursive partitioning algorithms, we have specific algorithm concepts such asCHAID(Kass, 1980), CART(Breiman, 2001), andC4.5 (Quin-lan, 1986, 1993).

3.4.3 Characteristics of Classification Algorithms: C4.5

We will see now an example of classiﬁcation algorithm modeling with dmop; the C4.5 decision tree algorithm, (Quinlan, 1986, 1993). This decision tree algorithm is shown in Figure 3.3. Decision tree algorithms recursively partition training examples by ﬁnding at each stage of the tree the best feature split that minimizes a given cost function. In the case of C4.5, this algorithm uses information gain or information gain ratio as splitting

3.4. The Data Mining Optimization Ontology

criterion to minimize the conditional class entropy. dmop describes these two learning components by the two taxonomic relations:

C4.5 ⊑ ∀hasFeatureTestEval.InfoGain

C4.5 ⊑ ∀hasOptimizationProblem.MinCondClassEntropy

where ⊑ deﬁnes the concept inclusion axiom in the description logic language (Baader, Calvanese, McGuinness, Nardi, and Patel-Schneider, 2003).

There is an additional post-processing step in C4.5 in which leaves are pruned ac-cording to their error rate on a validation set to avoid over-ﬁtting, referred aserror-based pruning (Breiman, 2001). Finally, to predict class labels on new instances, C4.5 uses a majority vote rule of the leaf in which those instances fall. dmop describes these two properties with the following taxonomic relations:

C4.5 ⊑ ∀hasComponentStep.ErrorBasedPruning C4.5 ⊑ ∀hasLeafPredictor.MajorityVoteClassifier

In addition to those characteristics which describe the structure of theC4.5algorithm, dmopalso provides qualitative algorithm characteristics following thehasQualityproperty of the dolce upper-ontology (Keet, Lawrynowicz, d’Amato, and Hilario, 2013; Keet, Lawrynowicz, dAmato, Kalousis, Nguyen, Palma, Stevens, and Hilario, 2015). See for example the taxonomic relations:

C4.5 ⊑ ∀hasQuality.HandleContinuousFeature C4.5 ⊑ ∀hasQuality.TolerateHighDimensionality

in Figure 3.3. These qualitative attributes describe capacities of the algorithms with respect to the input dataset; if they are able to handle continuous features, to be tolerant or not to high dimensional datasets and to tolerate missing values, etc. They also describe learning capabilities such as to tolerate irrelevant features or to have a high bias or a high variance proﬁle (Domingos, 2000; Kohavi, Wolpert, et al., 1996).

3.4.4 Characteristics of Feature Selection Algorithms

Another important class of DM algorithms are feature selection (FS) algorithms. Feature selection is a particular case of dimensionality reduction where the feature dimensionality is reduced by eliminating those features that are irrelevant or redundant according to

hasOptimizationStrategy DiscreteOptimizationStrategy RelaxationStrategy SearchStrategy

hasEvaluationTarget {SingleFeature, FeatureSubset}

hasEvaluationContext {Univariate, Multivariate}

hasEvaluationFunction {InfoGain, Chi2, CFS−Merit, Consistency ...}

hasFeatureEvaluator

interactsWithLearnerAs {Filter, Wrapper, Embedded}

hasChoicePolicy {Irrevocable, Tentative}

hasSearchGuidance {Blind, Informed}

hasUncertaintyLevel {Deterministic,Stochastic}

hasCoverage {Global, Local}

{Forward, Backward ...}

hasSearchDirection

hasDecisionStrategy DecisionStrategy

StatisticalTest DecisionRule

FeatureSelectionAlgorithm

FeatureWeightingAlgorithm

Figure 3.4: dmop’s characteristics of feature selection algorithms.

some criterion. For instance, in the case of classiﬁcation, the selection criterion is the discriminative power of a feature with respect to the class labels. Its is thus a combinatorial search inside the feature space where at each step one or more features are evaluated until the best ones are found. FS algorithms can be characterized along four dimensions that we will brieﬂy describe now, see Figure 3.4.

The ﬁrst dimension, interactsWithLearnerAs, describes how they are coupled with the learning algorithm. In ﬁlter methods such as Correlation Feature Selection (CFS) (Hall, 1998), or ReliefF (Kononenko, 1994), feature selection is done separately from the learning method as a pre-processing step. The quality of the selected feature subsets is then eval-uated by the learning procedure itself. In wrapper methods, feature selection is wrapped around the learning procedure where the estimated performance of the learned model is used as the selection criterion. In embedded methods such asSVM-RFE (Guyon, Gunn, Nikravesh, and Zadeh, 2006), or decision trees, feature selection is directly encoded in the learning procedure.

The second dimension, hasOptimizationStrategy, describes the (discrete) optimization strategy that FS algorithms use to search in the discrete space of feature subsets. It is determined by ﬁve properties: its search coverage (global, local), its direction (forward, backward), its choice policy (irrevocable, tentative), the amount of state knowledge that

Dans le document Meta-mining: a meta-learning framework to support the recommendation, planning and optimization of data mining workflows (Page 33-0)