Chapter 3 Data Mining Workflow Pattern Analysis 15

3.4 The Data Mining Optimization Ontology

3.4.1 Core Concepts

At the core of the dmop ontology is the concept of DM-Algorithm. A DM algorithm is related to the DM-task it addresses such as predictive modeling or descriptive modeling and to the input Data it will have to analyze. The execution of a DM algorithm on the input data will output knowledge in the form of a descriptive or predictive DM-Model, typically accompanied by some kind of DM-Report containing the learned models, esti-mated performance and other meta-data. From a workflow perspective, a DM algorithm is implemented by a DM-Operatorwhich is a node of the complex graph structure given by a DM-Workflow. The execution of a DM workflow gives a DM-Experiment where the execution of each DM operator gives aDM-Operation.

In the dmop ontology, instances of the DM-Algorithm, DM-Task and DM-Operator concepts are instantiated in the DM knowledge base (dmkb) or dmop’s assertion box.

For a given application domain for which we want to experiment specific DM workflows on a given set of datasets, instances of theData,DM-ModelandDM-Reportconcepts as well as those of theDM-Workflow,DM-Experimentand DM-Operatorconcepts, are instantiated in the respective DM experiment database (dmex-db). Each dmex-dbis located at the lowest level of the dmop’s architecture with which further meta-analysis can be carried out with the help of the dmop ontology. Figure 3.1 gives the dmop’s core concepts and their relations.

3.4.2 Taxonomy of Classification Algorithms

In data mining or machine learning, classification modeling algorithms constitute one of the main classes of algorithms in which the task is to learn a predictive model from an input spaceX to an output spaceY. With the plethora of existing classification modeling algorithms, providing a hierarchy of those algorithms is important in order to character-ize them. Figure 3.2 shows a part of the concept hierarchy or taxonomy for classification algorithms of thedmopontology. In this Figure, the top concept ClassificationModellingAl-gorithmis a direct subclass of the core conceptDM-Algorithm.

Classification modeling algorithms are divided into three broad categories (Bishop and Nasrabadi, 2006). Generative methods approximate the class conditional distribution

3.4. The Data Mining Optimization Ontology

Figure 3.2: dmop’s classification algorithm taxonomy.

P(x|y;Θ) and the class priorsP(y;Θ) or the joint probability distributionP(x, y;Θ) by computing those values of the Θ parameters that optimize a given cost function, most often the likelihood of the data. Having done so they use Bayes theorem to compute the posterior of the class P(y|x;Θ). Discriminative methods such as logistic regression and k-nearest neighbors approximate directly the class posteriorP(y|x;Θ) to determine class memberships. Discriminative functionslearn directly a mapping functionf(x) from input xonto class labely; most of state-of-art machine learning methods such as support vectors classifiers (SVC), neural networks and decision trees follow this approach.

Classification methods of the same algorithm family, i.e. which produce the same type of model structure, are grouped together to form the third stage of the taxonomy such as NaiveBayes,SupportVectorClassifier,RecursivePartitioning, etc. Below each algorithm fam-ily are the different variants that we can find in the literature. For instance, in the work of John and Langley (1995), we have the description of three different versions of theNaive Bayes algorithm, each of which has a specific modeling approach to model probabilities on numeric attributes: there is the normal one,NaiveBayesNormal, which assumes a nor-mal distribution for each numeric attribute, there is the kernelized one,NaiveBayesKernel, which uses a kernel density estimation on those numeric attributes, and there is the dis-crete one,NaiveBayesDiscretized, which makes use of a discretization approach to compute probabilities on numeric attributes. In addition, we have the multinomial Naive Bayes

hasLeafPredictor

Figure 3.3: dmop’s characteristics of theC4.5 decision tree algorithm.

version,NaiveBayesMultiNomial, for text classification in which word probabilities follow a multinomial distribution, (McCallum and Nigam, 1998). In the same manner, we model the support vector classification algorithm (Cortes and Vapnik, 1995; Vapnik, 1998), with respect to their optimization strategies: either using a soft or hard margin, which gives theSoftMarginSVCand HardMarginSVCalgorithm concepts, or with respect to the kernel they use: linear which gives theLinearSVCalgorithm concept, radial, polynomial, etc. For decision tree algorithms, also known as recursive partitioning algorithms, we have specific algorithm concepts such asCHAID(Kass, 1980), CART(Breiman, 2001), andC4.5 (Quin-lan, 1986, 1993).

3.4.3 Characteristics of Classification Algorithms: C4.5

We will see now an example of classification algorithm modeling with dmop; the C4.5 decision tree algorithm, (Quinlan, 1986, 1993). This decision tree algorithm is shown in Figure 3.3. Decision tree algorithms recursively partition training examples by finding at each stage of the tree the best feature split that minimizes a given cost function. In the case of C4.5, this algorithm uses information gain or information gain ratio as splitting

3.4. The Data Mining Optimization Ontology

criterion to minimize the conditional class entropy. dmop describes these two learning components by the two taxonomic relations:

C4.5 ⊑ ∀hasFeatureTestEval.InfoGain

C4.5 ⊑ ∀hasOptimizationProblem.MinCondClassEntropy

where ⊑ defines the concept inclusion axiom in the description logic language (Baader, Calvanese, McGuinness, Nardi, and Patel-Schneider, 2003).

There is an additional post-processing step in C4.5 in which leaves are pruned ac-cording to their error rate on a validation set to avoid over-fitting, referred aserror-based pruning (Breiman, 2001). Finally, to predict class labels on new instances, C4.5 uses a majority vote rule of the leaf in which those instances fall. dmop describes these two properties with the following taxonomic relations:

C4.5 ⊑ ∀hasComponentStep.ErrorBasedPruning C4.5 ⊑ ∀hasLeafPredictor.MajorityVoteClassifier

In addition to those characteristics which describe the structure of theC4.5algorithm, dmopalso provides qualitative algorithm characteristics following thehasQualityproperty of the dolce upper-ontology (Keet, Lawrynowicz, d’Amato, and Hilario, 2013; Keet, Lawrynowicz, dAmato, Kalousis, Nguyen, Palma, Stevens, and Hilario, 2015). See for example the taxonomic relations:

C4.5 ⊑ ∀hasQuality.HandleContinuousFeature C4.5 ⊑ ∀hasQuality.TolerateHighDimensionality

in Figure 3.3. These qualitative attributes describe capacities of the algorithms with respect to the input dataset; if they are able to handle continuous features, to be tolerant or not to high dimensional datasets and to tolerate missing values, etc. They also describe learning capabilities such as to tolerate irrelevant features or to have a high bias or a high variance profile (Domingos, 2000; Kohavi, Wolpert, et al., 1996).

3.4.4 Characteristics of Feature Selection Algorithms

Another important class of DM algorithms are feature selection (FS) algorithms. Feature selection is a particular case of dimensionality reduction where the feature dimensionality is reduced by eliminating those features that are irrelevant or redundant according to

hasOptimizationStrategy DiscreteOptimizationStrategy RelaxationStrategy SearchStrategy

hasEvaluationTarget {SingleFeature, FeatureSubset}

hasEvaluationContext {Univariate, Multivariate}

hasEvaluationFunction {InfoGain, Chi2, CFS−Merit, Consistency ...}

hasFeatureEvaluator

interactsWithLearnerAs {Filter, Wrapper, Embedded}

hasChoicePolicy {Irrevocable, Tentative}

hasSearchGuidance {Blind, Informed}

hasUncertaintyLevel {Deterministic,Stochastic}

hasCoverage {Global, Local}

{Forward, Backward ...}

hasSearchDirection

hasDecisionStrategy DecisionStrategy

StatisticalTest DecisionRule

FeatureSelectionAlgorithm

FeatureWeightingAlgorithm

Figure 3.4: dmop’s characteristics of feature selection algorithms.

some criterion. For instance, in the case of classification, the selection criterion is the discriminative power of a feature with respect to the class labels. Its is thus a combinatorial search inside the feature space where at each step one or more features are evaluated until the best ones are found. FS algorithms can be characterized along four dimensions that we will briefly describe now, see Figure 3.4.

The first dimension, interactsWithLearnerAs, describes how they are coupled with the learning algorithm. In filter methods such as Correlation Feature Selection (CFS) (Hall, 1998), or ReliefF (Kononenko, 1994), feature selection is done separately from the learning method as a pre-processing step. The quality of the selected feature subsets is then eval-uated by the learning procedure itself. In wrapper methods, feature selection is wrapped around the learning procedure where the estimated performance of the learned model is used as the selection criterion. In embedded methods such asSVM-RFE (Guyon, Gunn, Nikravesh, and Zadeh, 2006), or decision trees, feature selection is directly encoded in the learning procedure.

The second dimension, hasOptimizationStrategy, describes the (discrete) optimization strategy that FS algorithms use to search in the discrete space of feature subsets. It is determined by five properties: its search coverage (global, local), its direction (forward, backward), its choice policy (irrevocable, tentative), the amount of state knowledge that guides search (blind, informed), and its level of uncertainty (deterministic, stochastic).

3.5. A Formal Definition of Data Mining Workflow

For instance,C4.5uses a global greedy (blind) forward selection scheme while SVM-RFE uses a global greedy backward elimination scheme.

The third dimension,hasFeatureEvaluator, determines the way that FS algorithms eval-uate/weight the features found at eat step of the optimization procedure. The evaluation can be targeted towards a single feature or a feature subset. Its context can be either univariate like inInformationGain or multivariate like in SVM-RFE.ReliefF is the only FS algorithm whose target is single with multivariate context. Finally, we have the eval-uation function which gives the selection criterion under which (subsets of) features are evaluated.

Finally, the fourth dimension,hasDecisionStrategy, has to cope with the final decision to select the final feature subset. This can be done either with statistical test as inχ2, or using a simple thresholding cut-off function over the feature weights where the threshold is given as parameter or to keep the top-kfeatures with the highest weights.

In the next section, we will see how to mine feature selection and classification work-flows where we will extract frequent patterns defined over the above algorithm characteris-tics. These generalized frequent patterns will describe different combinations of algorithm characteristics appearing frequently in a set of training workflows which we will use in the rest of the thesis to characterize the workflow space and to address our different meta-mining tasks.

3.5 A Formal Definition of Data Mining Workflow

We will now give a formal definition of a DM workflow and how we represent it. DM work-flows are directed acyclic typed graphs (DAGs), in which nodes correspond to operators and edges between nodes to data input/output objects. In fact they are hierarchical DAGs since they can have dominating nodes/operators that contain themselves sub-workflows.

A typical example is the cross-validation operator whose control flow is given by the exe-cution in parallel of training and testing sub-workflows, or a complex operator of the type boosting. More formally, let:

ˆ O be the set of all available operators that can appear in a DM workflow, e.g. clas-sification operators, such as J48, SVMs, etc. O also includes dominatingoperators which are defined by one or more sub-workflows they dominate, e.g. cross-validation or model combination operators such as boosting, etc.

ˆ E be the set of all available data types that can appear in a DM workflow, namely the data types of the various I/O objects that can appear in DM workflow such as

models, datasets, attributes, etc.

Formally, an operatoro∈O is defined by its name through a labelling function λ(o), the data types e ∈ E of its inputs and outputs, and its direct sub-workflows if o is a dominatingoperator. Then a DM workflow is a pair (O, E), which also contains all sub-workflows if any, where: O ⊂ O is the set of vertices or nodes which correspond to all the operators used in this DM workflow and its sub-workflow(s), and E ⊂ E is the set of pairs of nodes, (oi, oj), called directed edges, that correspond to the data types of the input/output objects, that are passed from operator oi to operator oj. Thus, the graph structure of a DM workflow is a pair (O, E), which also contains all sub-workflows if any. O ⊂O is the set of vertices which correspond to all the operators used in this DM workflow and its sub-workflow(s), andE⊂E is the set of pairs of nodes, (oi, oj), directed edges, that correspond to the data types of the output/input objects, that are passed from operatoroi to operator oj.

Following the above definition, we can build a vector representation of a DM workflow by considering thetopological orderof its operators; the topological sort or order of a DM workflow is a permutation of the vertices of its graph structure such that an edge (oi, oj) implies thatoi appears beforeoj, i.e. this is a complete ordering of the nodes of a directed acyclic graph which is given by the node sequence:

wl= [o1, .., ol] (3.1)

where the subscript inwl denotes the lengthl(i.e. number of operators) of the topological sort. If the topological sort is not unique, then it is always possible to get an unique sort using second order information such as the lexicographic order of the vertex labels.

The topological sort of a DM workflow can be structurally represented with a rooted, labelled and ordered tree (Bringmann, 2004; Zaki, 2005), by doing a depth-first search over its graph structure where the maximum depth is given by expanding recursively the sub-workflows of the dominating operators. Thus the topological sort of a workflow or its tree representation is a reduction of the original directed acyclic graph where the nodes and edges have been fully ordered.

An example of the hierarchical directed acyclic graph representing a RapidMiner DM workflow is given in Figure 3.5. The graph corresponds to a DM workflow that cross-validates a feature selection method followed by a classification model building step with theNaive Bayesclassifier. RM X-Validationis a typical example of a dominating operator which itself is a workflow – it has two basic blocks, a training block which can be any arbitrary workflow that receives as input a dataset and outputs a model, and a testing

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.5: Example of a DM workflow that does performance estimation of a combination of feature selection and classification.

blockwhich receives as input a model and a dataset and outputs a performance measure.

In particular this specific cross-validation operator has a training block, in which feature weights are computed by theRM WeightByInformationGainoperator, after which a number of features is selected by the RM SelectByWeights operator. These steps are followed by the final model building given by the RM NaiveBayes operator. The testing block is rather trivial consisting of the RM ApplyModeloperator followed by the RM Performance computation operator. The topological sort of this graph is given by the tree of Figure 3.6.

3.6 Data Mining Workflow Frequent Pattern Extraction

In this section, we will focus on the extraction of frequent patterns from DM work-flows. As already explained, DM workflows are characterized by structures that appear and are reused very often. The simplest example was that of the chain of operators RM WeightByInformationGainand RM SelectByWeights which when they appear together actually perform feature selection. Along the same lines, we can have complex workflows

RM Retrieve

Figure 3.6: Topological order of the DM workflow given in Figure 3.5.

that perform tasks such as model evaluation, we already show the example of the com-posite operator of cross-validation, or model building with bagging, boosting, stacking, etc. What is important here is the combined use of the basic operators which result in a specific higher level meaning.

In addition, there might be interesting frequent structures that go beyond the concept of complex workflows whose structure is known a priori. A typical example could be frequent structures of the form: feature selection with information gain is often used together with decision trees. Such rules are not known a priori for novice data miners and we can only discover them by analyzing DM workflows produced by experimented data miners. Finally, we can use such frequent structures as higher level descriptors of workflows in the process of meta-mining for discovery of success patterns.

So far we have been discussing structural components, sub-workflows, under the im-plicit assumption that these are defined over specific operators. However exactly the same sub-workflow can be defined over different operators, or over families of operators. For example, model building with bagging can be defined for any classification algorithm. If we are not able to account for such generalizations, the frequent patterns that we will be discovering will be always defined over specific operators. It is clear that we will be able to discover patterns which have much stronger support if we include in the frequent pattern search procedure the means to perform a search also over generalizations. This is one of the roles of thedmop ontology. If we examine again Figures 3.1 and 3.2, we see that we can easily define generalizations over operators by following theexecutes and implements relations in the first figure, generalizations which will correspond to the different families of algorithms defined in the second figure.

In the following sections, we will describe how we search for frequent patterns over DM workflows that can also be defined over generalizations of operators and not only over grounded operators. So we will be looking for generalized workflow patterns; similar

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.7: Augmented parse tree of the DM workflow originally given in Figure 3.6. Thin edges depict workflow decomposition, double lines depict dmop’s concept subsumption and bold lines depictdmop’simplement relation.

work has been done in frequent sequence mining, see for example the generalized sequence pattern algorithm of Srikant and Agrawal (1996).

3.6.1 Workflow Representation for Generalized Frequent Pattern Min-ing

As discussed in the last section, DM workflows are hierarchical directed acyclic graphs where the hierarchy is a result of the presence of composite operators. In addition, we will see in section 6.4 that HTN plans are best represented in the form of a tree, which gives the task to method to operator decomposition. For both cases, it is more natural to look for frequent patterns under a tree representation. We already saw that we can represent a DM workflow as a parse tree which delivers the unique topological order of the nodes of the workflow, i.e. the order of execution of the operators, with a depth first search, and the hierarchy of the composite operators as the tree structure, see Figures 3.5 and 3.6 for an example of a DM workflow under a graph representation and its parse tree.

Augmented Parse Trees

Given the parse tree representation of a workflow, we will show now how we can augment it with the help of thedmopontology in view of deriving frequent patterns over general-izations of the components of the workflows. The generalgeneral-izations that we will be using are

given by the concepts, relations and subsumptions of thedmop ontology. Starting from theDM-Operatorconcept, we have seen in Section 3.4.1 that an operatoro∈Oimplements some algorithma∈A(Figure 3.1). In addition, we have seen in Section 3.4.2 thatdmop defines a very refined taxonomy over the algorithms, a snapshot of which we have already seen in Figure 3.2.

We also define a distance measure between two concepts C and D, which is related to the terminological inclusion axiom, C ⊑ D, as the length of the shortest path between

We also define a distance measure between two concepts C and D, which is related to the terminological inclusion axiom, C ⊑ D, as the length of the shortest path between

Dans le document Meta-mining: a meta-learning framework to support the recommendation, planning and optimization of data mining workflows (Page 39-0)