Chapter 3 Data Mining Workflow Pattern Analysis 15

3.4 The Data Mining Optimization Ontology

3.4.4 Characteristics of Feature Selection Algorithms

Another important class of DM algorithms are feature selection (FS) algorithms. Feature selection is a particular case of dimensionality reduction where the feature dimensionality is reduced by eliminating those features that are irrelevant or redundant according to

hasOptimizationStrategy DiscreteOptimizationStrategy RelaxationStrategy SearchStrategy

hasEvaluationTarget {SingleFeature, FeatureSubset}

hasEvaluationContext {Univariate, Multivariate}

hasEvaluationFunction {InfoGain, Chi2, CFS−Merit, Consistency ...}

hasFeatureEvaluator

interactsWithLearnerAs {Filter, Wrapper, Embedded}

hasChoicePolicy {Irrevocable, Tentative}

hasSearchGuidance {Blind, Informed}

hasUncertaintyLevel {Deterministic,Stochastic}

hasCoverage {Global, Local}

{Forward, Backward ...}

hasSearchDirection

hasDecisionStrategy DecisionStrategy

StatisticalTest DecisionRule

FeatureSelectionAlgorithm

FeatureWeightingAlgorithm

Figure 3.4: dmop’s characteristics of feature selection algorithms.

some criterion. For instance, in the case of classification, the selection criterion is the discriminative power of a feature with respect to the class labels. Its is thus a combinatorial search inside the feature space where at each step one or more features are evaluated until the best ones are found. FS algorithms can be characterized along four dimensions that we will briefly describe now, see Figure 3.4.

The first dimension, interactsWithLearnerAs, describes how they are coupled with the learning algorithm. In filter methods such as Correlation Feature Selection (CFS) (Hall, 1998), or ReliefF (Kononenko, 1994), feature selection is done separately from the learning method as a pre-processing step. The quality of the selected feature subsets is then eval-uated by the learning procedure itself. In wrapper methods, feature selection is wrapped around the learning procedure where the estimated performance of the learned model is used as the selection criterion. In embedded methods such asSVM-RFE (Guyon, Gunn, Nikravesh, and Zadeh, 2006), or decision trees, feature selection is directly encoded in the learning procedure.

The second dimension, hasOptimizationStrategy, describes the (discrete) optimization strategy that FS algorithms use to search in the discrete space of feature subsets. It is determined by five properties: its search coverage (global, local), its direction (forward, backward), its choice policy (irrevocable, tentative), the amount of state knowledge that guides search (blind, informed), and its level of uncertainty (deterministic, stochastic).

3.5. A Formal Definition of Data Mining Workflow

For instance,C4.5uses a global greedy (blind) forward selection scheme while SVM-RFE uses a global greedy backward elimination scheme.

The third dimension,hasFeatureEvaluator, determines the way that FS algorithms eval-uate/weight the features found at eat step of the optimization procedure. The evaluation can be targeted towards a single feature or a feature subset. Its context can be either univariate like inInformationGain or multivariate like in SVM-RFE.ReliefF is the only FS algorithm whose target is single with multivariate context. Finally, we have the eval-uation function which gives the selection criterion under which (subsets of) features are evaluated.

Finally, the fourth dimension,hasDecisionStrategy, has to cope with the final decision to select the final feature subset. This can be done either with statistical test as inχ2, or using a simple thresholding cut-off function over the feature weights where the threshold is given as parameter or to keep the top-kfeatures with the highest weights.

In the next section, we will see how to mine feature selection and classification work-flows where we will extract frequent patterns defined over the above algorithm characteris-tics. These generalized frequent patterns will describe different combinations of algorithm characteristics appearing frequently in a set of training workflows which we will use in the rest of the thesis to characterize the workflow space and to address our different meta-mining tasks.

3.5 A Formal Definition of Data Mining Workflow

We will now give a formal definition of a DM workflow and how we represent it. DM work-flows are directed acyclic typed graphs (DAGs), in which nodes correspond to operators and edges between nodes to data input/output objects. In fact they are hierarchical DAGs since they can have dominating nodes/operators that contain themselves sub-workflows.

A typical example is the cross-validation operator whose control flow is given by the exe-cution in parallel of training and testing sub-workflows, or a complex operator of the type boosting. More formally, let:

ˆ O be the set of all available operators that can appear in a DM workflow, e.g. clas-sification operators, such as J48, SVMs, etc. O also includes dominatingoperators which are defined by one or more sub-workflows they dominate, e.g. cross-validation or model combination operators such as boosting, etc.

ˆ E be the set of all available data types that can appear in a DM workflow, namely the data types of the various I/O objects that can appear in DM workflow such as

models, datasets, attributes, etc.

Formally, an operatoro∈O is defined by its name through a labelling function λ(o), the data types e ∈ E of its inputs and outputs, and its direct sub-workflows if o is a dominatingoperator. Then a DM workflow is a pair (O, E), which also contains all sub-workflows if any, where: O ⊂ O is the set of vertices or nodes which correspond to all the operators used in this DM workflow and its sub-workflow(s), and E ⊂ E is the set of pairs of nodes, (oi, oj), called directed edges, that correspond to the data types of the input/output objects, that are passed from operator oi to operator oj. Thus, the graph structure of a DM workflow is a pair (O, E), which also contains all sub-workflows if any. O ⊂O is the set of vertices which correspond to all the operators used in this DM workflow and its sub-workflow(s), andE⊂E is the set of pairs of nodes, (oi, oj), directed edges, that correspond to the data types of the output/input objects, that are passed from operatoroi to operator oj.

Following the above definition, we can build a vector representation of a DM workflow by considering thetopological orderof its operators; the topological sort or order of a DM workflow is a permutation of the vertices of its graph structure such that an edge (oi, oj) implies thatoi appears beforeoj, i.e. this is a complete ordering of the nodes of a directed acyclic graph which is given by the node sequence:

wl= [o1, .., ol] (3.1)

where the subscript inwl denotes the lengthl(i.e. number of operators) of the topological sort. If the topological sort is not unique, then it is always possible to get an unique sort using second order information such as the lexicographic order of the vertex labels.

The topological sort of a DM workflow can be structurally represented with a rooted, labelled and ordered tree (Bringmann, 2004; Zaki, 2005), by doing a depth-first search over its graph structure where the maximum depth is given by expanding recursively the sub-workflows of the dominating operators. Thus the topological sort of a workflow or its tree representation is a reduction of the original directed acyclic graph where the nodes and edges have been fully ordered.

An example of the hierarchical directed acyclic graph representing a RapidMiner DM workflow is given in Figure 3.5. The graph corresponds to a DM workflow that cross-validates a feature selection method followed by a classification model building step with theNaive Bayesclassifier. RM X-Validationis a typical example of a dominating operator which itself is a workflow – it has two basic blocks, a training block which can be any arbitrary workflow that receives as input a dataset and outputs a model, and a testing

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.5: Example of a DM workflow that does performance estimation of a combination of feature selection and classification.

blockwhich receives as input a model and a dataset and outputs a performance measure.

In particular this specific cross-validation operator has a training block, in which feature weights are computed by theRM WeightByInformationGainoperator, after which a number of features is selected by the RM SelectByWeights operator. These steps are followed by the final model building given by the RM NaiveBayes operator. The testing block is rather trivial consisting of the RM ApplyModeloperator followed by the RM Performance computation operator. The topological sort of this graph is given by the tree of Figure 3.6.

3.6 Data Mining Workflow Frequent Pattern Extraction

In this section, we will focus on the extraction of frequent patterns from DM work-flows. As already explained, DM workflows are characterized by structures that appear and are reused very often. The simplest example was that of the chain of operators RM WeightByInformationGainand RM SelectByWeights which when they appear together actually perform feature selection. Along the same lines, we can have complex workflows

RM Retrieve

Figure 3.6: Topological order of the DM workflow given in Figure 3.5.

that perform tasks such as model evaluation, we already show the example of the com-posite operator of cross-validation, or model building with bagging, boosting, stacking, etc. What is important here is the combined use of the basic operators which result in a specific higher level meaning.

In addition, there might be interesting frequent structures that go beyond the concept of complex workflows whose structure is known a priori. A typical example could be frequent structures of the form: feature selection with information gain is often used together with decision trees. Such rules are not known a priori for novice data miners and we can only discover them by analyzing DM workflows produced by experimented data miners. Finally, we can use such frequent structures as higher level descriptors of workflows in the process of meta-mining for discovery of success patterns.

So far we have been discussing structural components, sub-workflows, under the im-plicit assumption that these are defined over specific operators. However exactly the same sub-workflow can be defined over different operators, or over families of operators. For example, model building with bagging can be defined for any classification algorithm. If we are not able to account for such generalizations, the frequent patterns that we will be discovering will be always defined over specific operators. It is clear that we will be able to discover patterns which have much stronger support if we include in the frequent pattern search procedure the means to perform a search also over generalizations. This is one of the roles of thedmop ontology. If we examine again Figures 3.1 and 3.2, we see that we can easily define generalizations over operators by following theexecutes and implements relations in the first figure, generalizations which will correspond to the different families of algorithms defined in the second figure.

In the following sections, we will describe how we search for frequent patterns over DM workflows that can also be defined over generalizations of operators and not only over grounded operators. So we will be looking for generalized workflow patterns; similar

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.7: Augmented parse tree of the DM workflow originally given in Figure 3.6. Thin edges depict workflow decomposition, double lines depict dmop’s concept subsumption and bold lines depictdmop’simplement relation.

work has been done in frequent sequence mining, see for example the generalized sequence pattern algorithm of Srikant and Agrawal (1996).

3.6.1 Workflow Representation for Generalized Frequent Pattern Min-ing

As discussed in the last section, DM workflows are hierarchical directed acyclic graphs where the hierarchy is a result of the presence of composite operators. In addition, we will see in section 6.4 that HTN plans are best represented in the form of a tree, which gives the task to method to operator decomposition. For both cases, it is more natural to look for frequent patterns under a tree representation. We already saw that we can represent a DM workflow as a parse tree which delivers the unique topological order of the nodes of the workflow, i.e. the order of execution of the operators, with a depth first search, and the hierarchy of the composite operators as the tree structure, see Figures 3.5 and 3.6 for an example of a DM workflow under a graph representation and its parse tree.

Augmented Parse Trees

Given the parse tree representation of a workflow, we will show now how we can augment it with the help of thedmopontology in view of deriving frequent patterns over general-izations of the components of the workflows. The generalgeneral-izations that we will be using are

given by the concepts, relations and subsumptions of thedmop ontology. Starting from theDM-Operatorconcept, we have seen in Section 3.4.1 that an operatoro∈Oimplements some algorithma∈A(Figure 3.1). In addition, we have seen in Section 3.4.2 thatdmop defines a very refined taxonomy over the algorithms, a snapshot of which we have already seen in Figure 3.2.

We also define a distance measure between two concepts C and D, which is related to the terminological inclusion axiom, C ⊑ D, as the length of the shortest path between the two concepts. We will use this measure to order the subsumptions. Note that the subsumption order is not unique if we work with an inferred taxonomy, i.e., in an inferred taxonomy, a concept can have multiple ancestors. But we assume in this work to have at most one subsumption order for each concept.

For instance, taking the implementation of the NaiveBayesNormal algorithm provided by the RM NaiveBayes operator, we have the following concepts and roles subsumptions order:

RM NaiveBayes ⊑ ∀implements.NaiveBayesNormal RM NaiveBayes ⊑ ∀implements.NaiveBayesAlgorithm RM NaiveBayes ⊑ ∀implements.GenerativeAlgorithm

which reflects the taxonomic relations:

NaiveBayesNormal⊑NaiveBayesAlgorithm⊑GenerativeAlgorithm⊑...

Given a parse tree of a DM workflow, we can derive its augmented parse tree using the dmop’s concepts and roles subsumptions given above. Anaugmented parse tree is derived from an original parse tree T by adding for each node v ∈ T the concept subsumptions order betweenvand its parentπ(v). For instance, the augmented parse tree of Figure 3.6 is given in Figure 3.7, where for each v of T, we add the algorithm ancestors of v. Note that not allv’s have a concept subsumption in thedmopontology. Mining over augmented parse trees will allow us to capture workflow structures which are not limited to the use of ground operators, making it possible to detect abstract workflow patterns which otherwise would have gone unnoticed simply because they do not have strong support.

Parse Trees Rewriting

An augmented parse tree is derived by adding concept subsumptions given by thedmop ontology. However we would like to be able to express also composite operator and

com-3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.8: Rewritten parse tree of the augmented parse tree given in Figure 3.7.

posite algorithm definitions and include them in the rewriting of the parse trees. dmop offers for the moment the possibility to define modeling algorithms that must be preceded or followed by other specific algorithms using object properties, e.g.:

ModelingAlgorithm ⊑ ∃hasPreProcessor.DataProcessingAlgorithm ModelingAlgorithm ⊑ ∃hasPostProcessor.ModelProcessingAlgorithm

A typical example of the second property is the C4.5 modeling algorithm which first constructs an unprunned decision tree and then is followed by aModelProcessingAlgorithm that does prunning. We would like to be able to define more complex rules using concept equivalency, such as the following:

The first rule states that a FeatureWeightingAlgorithm followed by some DecisionRule is equivalent to a FeatureSelectionAlgorithm, i.e. the feature weights learned in the Fea-tureWeightingAlgorithm are used to select features based on some user’s decision like a TopKRule. In fact, the rule could be made much more explicit and define that the output of theFeatureWeightingAlgorithmshould be passed as input to theDecisionRule; this would be the appropriate definition if we were using the full graph representation in which the data flow is also described.

The second and the third are specializations of the first over the class of univariate and multivariate feature weigthing and feature selection. Such ontological statements should also be taken into account when we process the parse tree of a workflow. This concept equivalence when applied to the augmented parse tree of Figure 3.7 results to inserting the left-hand side of the equivalence, FeatureSelectionAlgorithm, between the ancestors of theRM WeightsByInformationGainoperator, and moving theRM DecisionRuleoperator in-stance,RM SelectByWeights, on the right side of the new node and at the same level of the FeatureWeightingAlgorithmconcept, producing the rewritten augmented tree of Figure 3.8.

We use the PρLog rule-based system, (Dundua et al., 2010; Kutsia, 2006; Marin and Kutsia, 2006), which allows us to define rewriting rules for trees with context and sequence variables. It is this representation of the parse trees that we will be using to mine for the frequent patterns.

3.6.2 Tree Pattern Mining

We will use the tree miner of Zaki (2005) to search for frequent trees over the augmented tree representation of our DM workflows. A key concept of the algorithm is the embedded tree. A tree t is embedded in a tree t, denoted as t e t, if and only if there exists a mappingϕ:Ot→Ot such that

∀u, v∈Ot:λ(u) =λ(ϕ(u))

∧u≺v⇐ϕ(u) ≺ϕ(v)

∧π(u) =v⇔π(ϕ(u)) =ϕ(v)

This subtree definition has the properties that the order of the children of a node is preserved as well as the transitive closure of its parents (ancestors). This is a less restricted definition than theinducedsubtree definition, and as such, embeddedsubtrees are able to extract patterns ”hidden” or embedded deep within large trees which might be missed by the induced subtree definition (Zaki, 2005).

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.9: Embbeded subtree (d) from trees (a), (b) and (c)

For instance, in Figure 3.9, given three trees T1, T2 and T2, and a support of 100%, the resulting embedded subtree is shown in Figure 3.9(d). This pattern has skipped the middle node in each tree, showing the common ordered structure within the three trees.

Which is what we want from our frequent pattern extractor since it will keep the total order of the parse trees and their common structure.

Key Definitions

Given a database (forest) D of trees, the tree miner algorithm will produce a set P of embedded subtrees (patterns). For a given treeT ∈D and a pattern S ∈ P, if S e T,

Two subtreesP, Q∈ P are said to be equivalent if they share the same set of occurences, i.e., sup(P) = sup(Q). We denote such equivalence byP ≡Q, and the equivalence class forP is defined by [P] ={Q∈ P:P ≡Q} (Arimura, 2008).

Definition(Maximal subtree)

Given an equivalence class [P], a subtreeP ∈[P] is said to bemaximalif and only if there exists no strictly more specific subtreeQ∈[P] such thatP ⊏Q.

An Example

We will demonstrate the extraction of frequent tree patterns from workflows using the knowledge encoded indmopwith a simple scenario, in which we have four DM workflows

that evaluate using cross validation the performance of feature selection and classification with different algorithms. More precisely the four workflows are:

a) feature selection based on Information Gainand classification with NaiveBayes b) feature selection based onReliefF and classification withC4.5

c) feature selection with CFSand classification with C4.5

d) feature selection using the wrapper approach, in which the search in the feature space is guided by the performance of NaiveBayes, and classification with NaiveBayes.

Their parse trees are given in Figure 3.10). Workflow a) performs univariate feature

Their parse trees are given in Figure 3.10). Workflow a) performs univariate feature

Dans le document Meta-mining: a meta-learning framework to support the recommendation, planning and optimization of data mining workflows (Page 42-0)