• Aucun résultat trouvé

Chapter 3 Data Mining Workflow Pattern Analysis 15

3.6 Data Mining Workflow Frequent Pattern Extraction

3.6.2 Tree Pattern Mining

We will use the tree miner of Zaki (2005) to search for frequent trees over the augmented tree representation of our DM workflows. A key concept of the algorithm is the embedded tree. A tree t is embedded in a tree t, denoted as t e t, if and only if there exists a mappingϕ:Ot→Ot such that

∀u, v∈Ot:λ(u) =λ(ϕ(u))

∧u≺v⇐ϕ(u) ≺ϕ(v)

∧π(u) =v⇔π(ϕ(u)) =ϕ(v)

This subtree definition has the properties that the order of the children of a node is preserved as well as the transitive closure of its parents (ancestors). This is a less restricted definition than theinducedsubtree definition, and as such, embeddedsubtrees are able to extract patterns ”hidden” or embedded deep within large trees which might be missed by the induced subtree definition (Zaki, 2005).

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.9: Embbeded subtree (d) from trees (a), (b) and (c)

For instance, in Figure 3.9, given three trees T1, T2 and T2, and a support of 100%, the resulting embedded subtree is shown in Figure 3.9(d). This pattern has skipped the middle node in each tree, showing the common ordered structure within the three trees.

Which is what we want from our frequent pattern extractor since it will keep the total order of the parse trees and their common structure.

Key Definitions

Given a database (forest) D of trees, the tree miner algorithm will produce a set P of embedded subtrees (patterns). For a given treeT ∈D and a pattern S ∈ P, if S e T,

Two subtreesP, Q∈ P are said to be equivalent if they share the same set of occurences, i.e., sup(P) = sup(Q). We denote such equivalence byP ≡Q, and the equivalence class forP is defined by [P] ={Q∈ P:P ≡Q} (Arimura, 2008).

Definition(Maximal subtree)

Given an equivalence class [P], a subtreeP ∈[P] is said to bemaximalif and only if there exists no strictly more specific subtreeQ∈[P] such thatP ⊏Q.

An Example

We will demonstrate the extraction of frequent tree patterns from workflows using the knowledge encoded indmopwith a simple scenario, in which we have four DM workflows

that evaluate using cross validation the performance of feature selection and classification with different algorithms. More precisely the four workflows are:

a) feature selection based on Information Gainand classification with NaiveBayes b) feature selection based onReliefF and classification withC4.5

c) feature selection with CFSand classification with C4.5

d) feature selection using the wrapper approach, in which the search in the feature space is guided by the performance of NaiveBayes, and classification with NaiveBayes.

Their parse trees are given in Figure 3.10). Workflow a) performs univariate feature selection based on a univariate weighting algorithm. The three remaining workflows are all doing multivariate feature selection, where in b) this is done using a multivariate feature weighting algorithm, and in c) and d) using heuristic search, implemented by the OptimizeSelectionoperator, in the space of feature sets where the cost function used to guide the search isCFSand the Naive Bayesaccuracy respectively. In Figure 3.11, we give the augmented parse trees of the workflows a) and c).

Since the feature selection part of dmopis not mature enough the two versions of the operatorRM OptimizeSelectionare both registered underMultiVariateFeatureSelection.

However the extensions of dmop describing feature selection algorithms and operators should take into account and describe such differences, i.e. that some of the methods are using an explicit search mechanism, which is coupled with a cost function, define the wrapper based approach to feature selection etc.

We applied on this small set of workflows the tree miner algorithm of Zaki (2005) setting the minimum support to two in order to discover frequent embedded subtrees. Some of the extracted patterns and their support are shown in Figure 3.12. Pattern (a) shows that in two of the four workflows (a) and b)), a feature weighting algorithm is followed by theRM SelectByWeightsoperator, and that this pair forms a feature selection algorithm nested altogether in a cross-validation operator. Pattern b) captures the fact that two of the four workflows (b) and c)), contain a multivariate feature selection followed by a decision tree algorithm, again nested inside a cross-validation. Pattern (c) corresponds to a MultivariateFeatureSelectionAlgorithmperformed by the RM OptimizeSelection operator and then followed by some classification algorithm. As discussed above, the RM OptimizeSelectionoperator corresponds to a heuristic search over feature sets with some search strategy and some cost function over the feature sets, which nevertheless are not described for the moment. Pattern (d) is a generalization of patterns (a), (b) and

3.7. Discussion

(c), and covers all four workflows. It simply says that a feature selection algorithm is followed by a classification modeling algorithm. Finally, Patterns (e) and (f) show two more patterns with support of 100%. Pattern (e) corresponds to the validation step where a learned model is applied on a testing set inside a cross-validation, and some performance metrics are produced from the applied model. Pattern (f) is a super pattern of Pattern(e) where we identify that before doing the validation step, some learned model should be produced by some classification modeling algorithm.

3.7 Discussion

In this chapter, we have proposed a method for mining generalized frequent patterns from DM workflows using a novel DM ontology,dmop, whose purpose is to provide an in-depth characterization of DM algorithms and their relations in terms of DM tasks, models and workflows, see Section 3.4. As we will see later, these relational workflow descriptors will be one of the major building blocks of our meta-mining framework where we will use them to characterize DM workflows, such that we will be able to abstract from the ground specification of DM workflows and use these abstractions for two main purposes.

First, we will build meta-mining models which will account on both dataset and workflow characteristics to learn how to match datasets and workflows with respect to their relative performance where we will propose different approaches to learn this match. Second, we will use the ontological representation of DM workflows to define the applicability model of our planning system. We will characterize the different workflow steps, DM operator combinations, of the partial candidate workflows produced by the DM workflow planner with these patterns where at each planning step we will select those candidate which will be expected to achieve maximum performance on the given dataset according to the learned meta-mining models.

Note that so far we have examined in this Chapter the use of trees to represent work-flows where we have useddmopwith its taxonomic information to derive the augmented parse trees. Two comments should be added here. First, the graph representation of DM workflows, although more difficult to manage, might contain more information. It could be thus interesting to explore in future work graph based approaches to frequent pat-tern extraction from DM workflows. Second, though very challenging, it should be also possible to extract patterns directly from the tree representation of DM workflows, i.e.

without the help of an ontology likedmop, so that we can automatically learn algorithm relations and taxonomies. It will be however very difficult to extract patterns that will contain the same level of knowledge as provided by the dmop ontology. For instance, it

will be very challenging to infer that the ReliefF and SVM-RFEfeature selection meth-ods are both multivariate methmeth-ods. One way to do this could be to explore supervised pattern mining approaches which will take into account workflow performances on a set of datasets. It could be thus interesting to explore in future work a supervised pattern mining approach able to automatically learn DM algorithm taxonomies based on their relative performances.

Finally, the propositional workflow representation proposed in this Thesis, see Section 4.5.2, can easily deal with the parameter values of the different operators that appear within the workflows. To do so, we can discretize the range of values of a continuous parameter to ranges such as low, medium, high, or other ranges depending on the nature of the parameter, and treat these discretized values as simply a property of the operators.

The resulting patterns will now be ”parameter-aware” and can be used to support the task of parameter or model selection from a DM workflow perspective together with the task of workflow selection and/or planning, in a similar manner to works on automatic algorithm configuration for algorithm selection likeAutoWeka (Thornton et al., 2013) andAutoFolio (Lindauer et al., 2015).

3.7. Discussion

RM Retrieve RM X-Validation

RM WeightBy InformationGain

RM SelectBy Weights

RM NaiveBayes RM ApplyModel RM Performance End

(a) Feature Selection withInformationGain

RM Retrieve RM X-Validation

RM WeightBy ReliefF

RM SelectBy Weights

RM DecisionTree RM ApplyModel RM Performance End

(b) Feature Selection with ReliefF

RM Retrieve RM X-Validation

RM OptimizeSelection RM Performance(CFS)

RM DecisionTree RM ApplyModel RM Performance End

(c) Feature Selection withCFS

RM Retrieve RM X-Validation

RM OptimizeSelection RM X-Validation

RM NaiveBayes RM ApplyModel RM Performance

RM NaiveBayes RM ApplyModel RM Performance End

(d) Feature Selection with the wrapper approach usingNaiveBayes

Figure 3.10: Parse trees of the four experimented feature selection workflows.

RM Retrieve RM X-Validation

(a) Rewritten augmented parse tree of Figure 3.10(b)

(b) Augmented parse tree of Figure 3.10(c)

Figure 3.11: Augmented parse trees of feature selection withReliefFandCFS. Thin edges depict workflow decomposition, double lines depictdmop’s concept subsumption and bold lines depictdmop’s implementrelation.

3.7. Discussion

Figure 3.12: Six patterns extracted from the augmented parse trees of the four workflows given in Figure 3.10

Meta-mining as a Classification Problem

4.1 Introduction

In this Chapter, we will present the different pieces of our meta-mining framework. We will first describe in Section 4.4 the extended Rice model in which meta-mining will take place. In addition, we will describe the three selection tasks that we will address with our framework. In Section 4.5 we will provide the full description of a real-world learning problem from which we extract dataset characteristics to build a meta-learning problem, on top of which we also extract workflow characteristics to build a meta-mining-oriented problem. Then in Section 4.6 we will experiment these two problems as classification problems where we will compare the built mining models with the baseline meta-learning model. Finally, we will discuss our approach in Section 4.7.

4.2 Motivation

In standard meta-learning, we only consider dataset characteristics and try to make pre-dictions concerning either the performance of individual learning algorithms, or the rel-ative performance of groups of algorithms, (Anderson and Oates, 2007; Brazdil, Giraud-Carrier, Soares, and Vilalta, 2008; Giraud-Giraud-Carrier, Vilalta, and Brazdil, 2004; Hilario, 2002; Kalousis, 2002; Kalousis and Theoharis, 1999; Kalousis, Gama, and Hilario, 2004;

4.2. Motivation

Leite and Brazdil, 2010; Pfahringer, Bensusan, and Giraud-Carrier., 2000; Soares and Brazdil, 2000; Vilalta, Giraud-Carrier, Brazdil, and Soares, 2004). The main idea in meta-learning is to learn meta-models which can describe the dataset characteristics un-der which a given algorithm or groups of algorithm will perform well. As a result, there is no way to generalize the learned meta-models to new algorithms since these models do not account on any algorithm characteristics other than their relative performance.

In this Chapter, we will describe the framework in which meta-mining will take place.

The main idea in meta-mining is to go beyond algorithms and consider workflows of algo-rithms and try to make predictions concerning the relative performance of these workflows over different datasets, where we will characterize workflows with frequent workflow pat-terns as these are described in Chapter 3. Principally, we will propose to extend the Rice model (Rice, 1976) by accounting now for workflow descriptors, in addition to dataset descriptors. This will provide support for building new predictive models between learn-ing problems and workflows with several new features. We evaluate our approach on a real-world meta-mining problem composed of related biological datasets on which we ap-plied various feature selection plus classification workflows. We show that our approach outperforms the standard meta-learning approach solely based on dataset descriptors.

The main contributions are as follows:

(i) Unlike the traditional algorithm selection setting in which the built models can only account for dataset characteristics to predict the (relative) performance of some already-experimented (groups of) algorithms, we will model on the structural char-acteristics of algorithm combinations, ie workflows, by combining the frequent work-flow patterns defined in the previous chapter with dataset characteristics. We will thus deepen the notion of algorithm to the notion of workflow and consider relations between algorithms within a workflow to address the task ofworkflow selection. This will gives us the possibility to select not only over already-experimented workflows but also over workflows that have never been seen during past experimentation.

(ii) In addition, we will address two other selection tasks with our modeling framework:

first, to select datasets for a given new workflow, which we call the dataset selection task, and second to select over pairs of dataset-workflows; ie pairs where both the dataset and the workflow have never been seen during past experimentation, which we call thedataset-workflow selection task.

This Chapter is based on the following publication.

Melanie Hilario, Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis. Ontology-based meta-mining of knowledge discovery workflows. In N.

Jankowski, W. Duch, and K. Grabczewski, editors, Meta-Learning in Computational In-telligence. Springer, 2011.

4.3 Notations

We will provide the most basic notations here, we will introduce additional notations as they are needed. We will use the term DM experiment to designate the execution of a DM workflow w∈ W on a dataset x∈ X. We will denote by X the n×dmatrix which will contain the descriptions of the n datasets that will be used for the training of the meta-models, where a datasetx∈X will be described by a d-dimensional column vector x= (x1, . . . xd)T; we describe in detail the dataset descriptions that we use in Section 4.5.2.

Let denote by W a collection of m workflows, which will be used in the training of the meta-model, by P the set of frequent workflow patterns extracted from W using the method described in the previous chapter, and byW the corresponding m× |P| matrix that contains their propositional representations. For a given workflow, we will have two representations. First,wl∈W will denote thel-length operator sequence that constitutes a DM workflow as described in Section 3.5. Note that different workflows can have different lengths, so this is not a fixed length representation. Second,w∈W will denote the fixed-length |P|-dimensional binary vector representation of the workflow wl; each feature of w indicates the presence or absence of some relational feature/pattern in the workflow.

Essentially w is the propositional representation of the workflow wl; we will describe in more detail in Section 4.5.2 how we extract this propositional representation. Depending on the context the different workflow notations can be used interchangingly.

In addition, we will characterize each DM experiment by some performance mea-sure; for example if the mining problem we are addressing is a classification problem one such performance measure can be the accuracy. From the performance measures of the workflows applied on a given dataset x we will extract the relative performance score y(x,w) ∈ R+ of each workflow w for the x dataset. We will do so by statistically com-paring the performance differences of the different workflows for the given dataset, see Section 4.5.2. We will denote by Y the n×m preference matrix between datasets and workflows, the Yij entry of which gives some measure of appropriateness, preference, or

4.4. Meta-mining Framework

Symbol Meaning

o a workflow operator.

e a workflow data type.

wl= [o1, . . . , ol] a ground DM workflow as a sequence ofl operators.

wl= (I(p1twl), . . . ,I(p|P|twl))T the fixed length |P|-dimensional vector description of a workflowwl

x= (x1, . . . , xd)T thed-dimensional vector description of a dataset.

y(x,w) the relative performance score of w workflow on x dataset

g a DM goal.

t a HTN task.

m a HTN method.

Oˆ={o1, . . . , on} a HTN abstract operator withnpossible operators.

Cl a set of candidate workflows at some abstract

oper-ator ˆO.

Sl a set of candidate workflows selected fromCl.

Table 4.1: Summary of notations used.

match, like classification accuracy, of thexi and wj instances. Table 4.1 summarizes the most important notations.

4.4 Meta-mining Framework

We will provide here the general description of our framework. In the next section, we will define the task of algorithm selection (Hilario, 2002; Smith-Miles, 2008) that is best described with the Rice Model and is one of the two main tasks in meta-learning – the other task is model selection (Ali and Smith-Miles, 2006, 2007). Then we will see how we will extend this model to account for workflow descriptors. From this new model, we will describe in Section 4.4.3 the three meta-mining tasks that we will address with our framework.

4.4.1 Rice Model

The task of algorithm selection traces back to the mid-seventies when J. Rice formalized this task in a seminal paper (Rice, 1976). In his model, this task is decomposed into four main components; the problem space X, the algorithm space A, the performance space Y between problems and workflows, and a problem feature space which describes the original problem space. Basically, the task of algorithm selection amounts to extract relevant featuresx∈Rdfrom different problemsx∈ X in order to find the best mapS(x) to select the algorithma∈ Awhich will maximize the performance metric y(x, a). Figure

Selection Mapping

Performance Mapping

xRd

Algorithm SpaceA aA Performance SpaceY

yY

y(x, a)

a=argmaxa∈AS(x) Problem SpaceX

xX

Problem Feature Space

Figure 4.1: Rice model and its four components.

4.1 shows the Rice model and its four components.

In meta-learning, we will typically proceed as follows. First we will have to extract training meta-data, i.e. a set of meta-examples of the form hx,y(x, a)i from a set of learning problemsx∈Xand a set of algorithms A⊆ A. With these training data, we will train one or more meta-learning models of the formH :X × A → Y, which will be used to predict from the poolA of base-level learners the best algorithm a ∈A that maximizes a given performance metricy ∈ Y on a given learning problemx∈ X:

a = S(x) (4.1)

= arg max

a∈A

h(x, a)

where h ∈ H is a meta-learning model to predict the performance of a given algorithm a∈A for a given learning problemx. Several variants for building a meta-learning model exist in the literature; one can build either one meta-learning model per algorithm and select the algorithm that delivers the highest performance on a given dataset (King et al., 1995; Michie et al., 1994), or build one meta-learning model per pair of algorithms and then rank each algorithm according to its relative performance on the dataset (Kalousis and Theoharis, 1999; Peng et al., 2002; Sun and Pfahringer, 2013), or even build a single model to predict the performance of similarly performing algorithms on the dataset (Aha, 1992). Overall, the main assumption behind the algorithm selection task in meta-learning is thatproblems characterized by similar features map to the same algorithm and exhibit similar performance (Giraud-Carrier, 2008).

However, the main drawback in meta-learning is that this framework is restricted to model solely over the specific algorithm entities given in the poolAof base-level learners;