# Tree Pattern Mining

## Chapter 3 Data Mining Workflow Pattern Analysis 15

### 3.6.2 Tree Pattern Mining

We will use the tree miner of Zaki (2005) to search for frequent trees over the augmented tree representation of our DM workﬂows. A key concept of the algorithm is the embedded tree. A tree t is embedded in a tree t, denoted as t e t, if and only if there exists a mappingϕ:Ot→Ot such that

∀u, v∈Ot:λ(u) =λ(ϕ(u))

∧u≺v⇐ϕ(u) ≺ϕ(v)

∧π(u) =v⇔π(ϕ(u)) =ϕ(v)

This subtree deﬁnition has the properties that the order of the children of a node is preserved as well as the transitive closure of its parents (ancestors). This is a less restricted deﬁnition than theinducedsubtree deﬁnition, and as such, embeddedsubtrees are able to extract patterns ”hidden” or embedded deep within large trees which might be missed by the induced subtree deﬁnition (Zaki, 2005).

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.9: Embbeded subtree (d) from trees (a), (b) and (c)

For instance, in Figure 3.9, given three trees T1, T2 and T2, and a support of 100%, the resulting embedded subtree is shown in Figure 3.9(d). This pattern has skipped the middle node in each tree, showing the common ordered structure within the three trees.

Which is what we want from our frequent pattern extractor since it will keep the total order of the parse trees and their common structure.

Key Definitions

Given a database (forest) D of trees, the tree miner algorithm will produce a set P of embedded subtrees (patterns). For a given treeT ∈D and a pattern S ∈ P, if S e T,

Two subtreesP, Q∈ P are said to be equivalent if they share the same set of occurences, i.e., sup(P) = sup(Q). We denote such equivalence byP ≡Q, and the equivalence class forP is deﬁned by [P] ={Q∈ P:P ≡Q} (Arimura, 2008).

Definition(Maximal subtree)

Given an equivalence class [P], a subtreeP ∈[P] is said to bemaximalif and only if there exists no strictly more speciﬁc subtreeQ∈[P] such thatP ⊏Q.

An Example

We will demonstrate the extraction of frequent tree patterns from workﬂows using the knowledge encoded indmopwith a simple scenario, in which we have four DM workﬂows

that evaluate using cross validation the performance of feature selection and classiﬁcation with diﬀerent algorithms. More precisely the four workﬂows are:

a) feature selection based on Information Gainand classiﬁcation with NaiveBayes b) feature selection based onReliefF and classiﬁcation withC4.5

c) feature selection with CFSand classiﬁcation with C4.5

d) feature selection using the wrapper approach, in which the search in the feature space is guided by the performance of NaiveBayes, and classiﬁcation with NaiveBayes.

Their parse trees are given in Figure 3.10). Workﬂow a) performs univariate feature selection based on a univariate weighting algorithm. The three remaining workﬂows are all doing multivariate feature selection, where in b) this is done using a multivariate feature weighting algorithm, and in c) and d) using heuristic search, implemented by the OptimizeSelectionoperator, in the space of feature sets where the cost function used to guide the search isCFSand the Naive Bayesaccuracy respectively. In Figure 3.11, we give the augmented parse trees of the workﬂows a) and c).

Since the feature selection part of dmopis not mature enough the two versions of the operatorRM OptimizeSelectionare both registered underMultiVariateFeatureSelection.

However the extensions of dmop describing feature selection algorithms and operators should take into account and describe such diﬀerences, i.e. that some of the methods are using an explicit search mechanism, which is coupled with a cost function, deﬁne the wrapper based approach to feature selection etc.

We applied on this small set of workﬂows the tree miner algorithm of Zaki (2005) setting the minimum support to two in order to discover frequent embedded subtrees. Some of the extracted patterns and their support are shown in Figure 3.12. Pattern (a) shows that in two of the four workﬂows (a) and b)), a feature weighting algorithm is followed by theRM SelectByWeightsoperator, and that this pair forms a feature selection algorithm nested altogether in a cross-validation operator. Pattern b) captures the fact that two of the four workﬂows (b) and c)), contain a multivariate feature selection followed by a decision tree algorithm, again nested inside a cross-validation. Pattern (c) corresponds to a MultivariateFeatureSelectionAlgorithmperformed by the RM OptimizeSelection operator and then followed by some classiﬁcation algorithm. As discussed above, the RM OptimizeSelectionoperator corresponds to a heuristic search over feature sets with some search strategy and some cost function over the feature sets, which nevertheless are not described for the moment. Pattern (d) is a generalization of patterns (a), (b) and

3.7. Discussion

(c), and covers all four workﬂows. It simply says that a feature selection algorithm is followed by a classiﬁcation modeling algorithm. Finally, Patterns (e) and (f) show two more patterns with support of 100%. Pattern (e) corresponds to the validation step where a learned model is applied on a testing set inside a cross-validation, and some performance metrics are produced from the applied model. Pattern (f) is a super pattern of Pattern(e) where we identify that before doing the validation step, some learned model should be produced by some classiﬁcation modeling algorithm.

### 3.7 Discussion

In this chapter, we have proposed a method for mining generalized frequent patterns from DM workﬂows using a novel DM ontology,dmop, whose purpose is to provide an in-depth characterization of DM algorithms and their relations in terms of DM tasks, models and workﬂows, see Section 3.4. As we will see later, these relational workﬂow descriptors will be one of the major building blocks of our meta-mining framework where we will use them to characterize DM workﬂows, such that we will be able to abstract from the ground speciﬁcation of DM workﬂows and use these abstractions for two main purposes.

First, we will build meta-mining models which will account on both dataset and workﬂow characteristics to learn how to match datasets and workﬂows with respect to their relative performance where we will propose diﬀerent approaches to learn this match. Second, we will use the ontological representation of DM workﬂows to deﬁne the applicability model of our planning system. We will characterize the diﬀerent workﬂow steps, DM operator combinations, of the partial candidate workﬂows produced by the DM workﬂow planner with these patterns where at each planning step we will select those candidate which will be expected to achieve maximum performance on the given dataset according to the learned meta-mining models.

Note that so far we have examined in this Chapter the use of trees to represent work-ﬂows where we have useddmopwith its taxonomic information to derive the augmented parse trees. Two comments should be added here. First, the graph representation of DM workﬂows, although more diﬃcult to manage, might contain more information. It could be thus interesting to explore in future work graph based approaches to frequent pat-tern extraction from DM workﬂows. Second, though very challenging, it should be also possible to extract patterns directly from the tree representation of DM workﬂows, i.e.

without the help of an ontology likedmop, so that we can automatically learn algorithm relations and taxonomies. It will be however very diﬃcult to extract patterns that will contain the same level of knowledge as provided by the dmop ontology. For instance, it

will be very challenging to infer that the ReliefF and SVM-RFEfeature selection meth-ods are both multivariate methmeth-ods. One way to do this could be to explore supervised pattern mining approaches which will take into account workﬂow performances on a set of datasets. It could be thus interesting to explore in future work a supervised pattern mining approach able to automatically learn DM algorithm taxonomies based on their relative performances.

Finally, the propositional workﬂow representation proposed in this Thesis, see Section 4.5.2, can easily deal with the parameter values of the diﬀerent operators that appear within the workﬂows. To do so, we can discretize the range of values of a continuous parameter to ranges such as low, medium, high, or other ranges depending on the nature of the parameter, and treat these discretized values as simply a property of the operators.

The resulting patterns will now be ”parameter-aware” and can be used to support the task of parameter or model selection from a DM workﬂow perspective together with the task of workﬂow selection and/or planning, in a similar manner to works on automatic algorithm conﬁguration for algorithm selection likeAutoWeka (Thornton et al., 2013) andAutoFolio (Lindauer et al., 2015).

3.7. Discussion

RM Retrieve RM X-Validation

RM WeightBy InformationGain

RM SelectBy Weights

RM NaiveBayes RM ApplyModel RM Performance End

(a) Feature Selection withInformationGain

RM Retrieve RM X-Validation

RM WeightBy ReliefF

RM SelectBy Weights

RM DecisionTree RM ApplyModel RM Performance End

(b) Feature Selection with ReliefF

RM Retrieve RM X-Validation

RM OptimizeSelection RM Performance(CFS)

RM DecisionTree RM ApplyModel RM Performance End

(c) Feature Selection withCFS

RM Retrieve RM X-Validation

RM OptimizeSelection RM X-Validation

RM NaiveBayes RM ApplyModel RM Performance

RM NaiveBayes RM ApplyModel RM Performance End

(d) Feature Selection with the wrapper approach usingNaiveBayes

Figure 3.10: Parse trees of the four experimented feature selection workﬂows.

RM Retrieve RM X-Validation

(a) Rewritten augmented parse tree of Figure 3.10(b)

(b) Augmented parse tree of Figure 3.10(c)

Figure 3.11: Augmented parse trees of feature selection withReliefFandCFS. Thin edges depict workﬂow decomposition, double lines depictdmop’s concept subsumption and bold lines depictdmop’s implementrelation.

3.7. Discussion

Figure 3.12: Six patterns extracted from the augmented parse trees of the four workﬂows given in Figure 3.10

## Meta-mining as a Classification Problem

### 4.1 Introduction

In this Chapter, we will present the diﬀerent pieces of our meta-mining framework. We will ﬁrst describe in Section 4.4 the extended Rice model in which meta-mining will take place. In addition, we will describe the three selection tasks that we will address with our framework. In Section 4.5 we will provide the full description of a real-world learning problem from which we extract dataset characteristics to build a meta-learning problem, on top of which we also extract workﬂow characteristics to build a meta-mining-oriented problem. Then in Section 4.6 we will experiment these two problems as classiﬁcation problems where we will compare the built mining models with the baseline meta-learning model. Finally, we will discuss our approach in Section 4.7.

### 4.2 Motivation

In standard meta-learning, we only consider dataset characteristics and try to make pre-dictions concerning either the performance of individual learning algorithms, or the rel-ative performance of groups of algorithms, (Anderson and Oates, 2007; Brazdil, Giraud-Carrier, Soares, and Vilalta, 2008; Giraud-Giraud-Carrier, Vilalta, and Brazdil, 2004; Hilario, 2002; Kalousis, 2002; Kalousis and Theoharis, 1999; Kalousis, Gama, and Hilario, 2004;

4.2. Motivation

Leite and Brazdil, 2010; Pfahringer, Bensusan, and Giraud-Carrier., 2000; Soares and Brazdil, 2000; Vilalta, Giraud-Carrier, Brazdil, and Soares, 2004). The main idea in meta-learning is to learn meta-models which can describe the dataset characteristics un-der which a given algorithm or groups of algorithm will perform well. As a result, there is no way to generalize the learned meta-models to new algorithms since these models do not account on any algorithm characteristics other than their relative performance.

In this Chapter, we will describe the framework in which meta-mining will take place.

The main idea in meta-mining is to go beyond algorithms and consider workﬂows of algo-rithms and try to make predictions concerning the relative performance of these workﬂows over diﬀerent datasets, where we will characterize workﬂows with frequent workﬂow pat-terns as these are described in Chapter 3. Principally, we will propose to extend the Rice model (Rice, 1976) by accounting now for workﬂow descriptors, in addition to dataset descriptors. This will provide support for building new predictive models between learn-ing problems and workﬂows with several new features. We evaluate our approach on a real-world meta-mining problem composed of related biological datasets on which we ap-plied various feature selection plus classiﬁcation workﬂows. We show that our approach outperforms the standard meta-learning approach solely based on dataset descriptors.

The main contributions are as follows:

(i) Unlike the traditional algorithm selection setting in which the built models can only account for dataset characteristics to predict the (relative) performance of some already-experimented (groups of) algorithms, we will model on the structural char-acteristics of algorithm combinations, ie workﬂows, by combining the frequent work-ﬂow patterns deﬁned in the previous chapter with dataset characteristics. We will thus deepen the notion of algorithm to the notion of workﬂow and consider relations between algorithms within a workﬂow to address the task ofworkflow selection. This will gives us the possibility to select not only over already-experimented workﬂows but also over workﬂows that have never been seen during past experimentation.

ﬁrst, to select datasets for a given new workﬂow, which we call the dataset selection task, and second to select over pairs of dataset-workﬂows; ie pairs where both the dataset and the workﬂow have never been seen during past experimentation, which we call thedataset-workflow selection task.

This Chapter is based on the following publication.

Melanie Hilario, Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis. Ontology-based meta-mining of knowledge discovery workﬂows. In N.

Jankowski, W. Duch, and K. Grabczewski, editors, Meta-Learning in Computational In-telligence. Springer, 2011.

### 4.3 Notations

We will provide the most basic notations here, we will introduce additional notations as they are needed. We will use the term DM experiment to designate the execution of a DM workﬂow w∈ W on a dataset x∈ X. We will denote by X the n×dmatrix which will contain the descriptions of the n datasets that will be used for the training of the meta-models, where a datasetx∈X will be described by a d-dimensional column vector x= (x1, . . . xd)T; we describe in detail the dataset descriptions that we use in Section 4.5.2.

Let denote by W a collection of m workﬂows, which will be used in the training of the meta-model, by P the set of frequent workﬂow patterns extracted from W using the method described in the previous chapter, and byW the corresponding m× |P| matrix that contains their propositional representations. For a given workﬂow, we will have two representations. First,wl∈W will denote thel-length operator sequence that constitutes a DM workﬂow as described in Section 3.5. Note that diﬀerent workﬂows can have diﬀerent lengths, so this is not a ﬁxed length representation. Second,w∈W will denote the ﬁxed-length |P|-dimensional binary vector representation of the workﬂow wl; each feature of w indicates the presence or absence of some relational feature/pattern in the workﬂow.

Essentially w is the propositional representation of the workﬂow wl; we will describe in more detail in Section 4.5.2 how we extract this propositional representation. Depending on the context the diﬀerent workﬂow notations can be used interchangingly.

In addition, we will characterize each DM experiment by some performance mea-sure; for example if the mining problem we are addressing is a classiﬁcation problem one such performance measure can be the accuracy. From the performance measures of the workﬂows applied on a given dataset x we will extract the relative performance score y(x,w) ∈ R+ of each workﬂow w for the x dataset. We will do so by statistically com-paring the performance diﬀerences of the diﬀerent workﬂows for the given dataset, see Section 4.5.2. We will denote by Y the n×m preference matrix between datasets and workﬂows, the Yij entry of which gives some measure of appropriateness, preference, or

4.4. Meta-mining Framework

Symbol Meaning

o a workﬂow operator.

e a workﬂow data type.

wl= [o1, . . . , ol] a ground DM workﬂow as a sequence ofl operators.

wl= (I(p1twl), . . . ,I(p|P|twl))T the ﬁxed length |P|-dimensional vector description of a workﬂowwl

x= (x1, . . . , xd)T thed-dimensional vector description of a dataset.

y(x,w) the relative performance score of w workﬂow on x dataset

g a DM goal.

m a HTN method.

Oˆ={o1, . . . , on} a HTN abstract operator withnpossible operators.

Cl a set of candidate workﬂows at some abstract

oper-ator ˆO.

Sl a set of candidate workﬂows selected fromCl.

Table 4.1: Summary of notations used.

match, like classiﬁcation accuracy, of thexi and wj instances. Table 4.1 summarizes the most important notations.

### 4.4 Meta-mining Framework

We will provide here the general description of our framework. In the next section, we will deﬁne the task of algorithm selection (Hilario, 2002; Smith-Miles, 2008) that is best described with the Rice Model and is one of the two main tasks in meta-learning – the other task is model selection (Ali and Smith-Miles, 2006, 2007). Then we will see how we will extend this model to account for workﬂow descriptors. From this new model, we will describe in Section 4.4.3 the three meta-mining tasks that we will address with our framework.

4.4.1 Rice Model

The task of algorithm selection traces back to the mid-seventies when J. Rice formalized this task in a seminal paper (Rice, 1976). In his model, this task is decomposed into four main components; the problem space X, the algorithm space A, the performance space Y between problems and workﬂows, and a problem feature space which describes the original problem space. Basically, the task of algorithm selection amounts to extract relevant featuresx∈Rdfrom diﬀerent problemsx∈ X in order to ﬁnd the best mapS(x) to select the algorithma∈ Awhich will maximize the performance metric y(x, a). Figure

Selection Mapping

Performance Mapping

xRd

Algorithm SpaceA aA Performance SpaceY

yY

y(x, a)

a=argmaxa∈AS(x) Problem SpaceX

xX

Problem Feature Space

Figure 4.1: Rice model and its four components.

4.1 shows the Rice model and its four components.

In meta-learning, we will typically proceed as follows. First we will have to extract training meta-data, i.e. a set of meta-examples of the form hx,y(x, a)i from a set of learning problemsx∈Xand a set of algorithms A⊆ A. With these training data, we will train one or more meta-learning models of the formH :X × A → Y, which will be used to predict from the poolA of base-level learners the best algorithm a ∈A that maximizes a given performance metricy ∈ Y on a given learning problemx∈ X:

a = S(x) (4.1)

= arg max

a∈A

h(x, a)

where h ∈ H is a meta-learning model to predict the performance of a given algorithm a∈A for a given learning problemx. Several variants for building a meta-learning model exist in the literature; one can build either one meta-learning model per algorithm and select the algorithm that delivers the highest performance on a given dataset (King et al., 1995; Michie et al., 1994), or build one meta-learning model per pair of algorithms and then rank each algorithm according to its relative performance on the dataset (Kalousis and Theoharis, 1999; Peng et al., 2002; Sun and Pfahringer, 2013), or even build a single model to predict the performance of similarly performing algorithms on the dataset (Aha, 1992). Overall, the main assumption behind the algorithm selection task in meta-learning is thatproblems characterized by similar features map to the same algorithm and exhibit similar performance (Giraud-Carrier, 2008).

However, the main drawback in meta-learning is that this framework is restricted to model solely over the speciﬁc algorithm entities given in the poolAof base-level learners;