# Characteristics of Feature Selection Algorithms

## Chapter 3 Data Mining Workflow Pattern Analysis 15

### 3.4.4 Characteristics of Feature Selection Algorithms

Another important class of DM algorithms are feature selection (FS) algorithms. Feature selection is a particular case of dimensionality reduction where the feature dimensionality is reduced by eliminating those features that are irrelevant or redundant according to

hasOptimizationStrategy DiscreteOptimizationStrategy RelaxationStrategy SearchStrategy

hasEvaluationTarget {SingleFeature, FeatureSubset}

hasEvaluationContext {Univariate, Multivariate}

hasEvaluationFunction {InfoGain, Chi2, CFS−Merit, Consistency ...}

hasFeatureEvaluator

interactsWithLearnerAs {Filter, Wrapper, Embedded}

hasChoicePolicy {Irrevocable, Tentative}

hasSearchGuidance {Blind, Informed}

hasUncertaintyLevel {Deterministic,Stochastic}

hasCoverage {Global, Local}

{Forward, Backward ...}

hasSearchDirection

hasDecisionStrategy DecisionStrategy

StatisticalTest DecisionRule

FeatureSelectionAlgorithm

FeatureWeightingAlgorithm

Figure 3.4: dmop’s characteristics of feature selection algorithms.

some criterion. For instance, in the case of classiﬁcation, the selection criterion is the discriminative power of a feature with respect to the class labels. Its is thus a combinatorial search inside the feature space where at each step one or more features are evaluated until the best ones are found. FS algorithms can be characterized along four dimensions that we will brieﬂy describe now, see Figure 3.4.

The ﬁrst dimension, interactsWithLearnerAs, describes how they are coupled with the learning algorithm. In ﬁlter methods such as Correlation Feature Selection (CFS) (Hall, 1998), or ReliefF (Kononenko, 1994), feature selection is done separately from the learning method as a pre-processing step. The quality of the selected feature subsets is then eval-uated by the learning procedure itself. In wrapper methods, feature selection is wrapped around the learning procedure where the estimated performance of the learned model is used as the selection criterion. In embedded methods such asSVM-RFE (Guyon, Gunn, Nikravesh, and Zadeh, 2006), or decision trees, feature selection is directly encoded in the learning procedure.

The second dimension, hasOptimizationStrategy, describes the (discrete) optimization strategy that FS algorithms use to search in the discrete space of feature subsets. It is determined by ﬁve properties: its search coverage (global, local), its direction (forward, backward), its choice policy (irrevocable, tentative), the amount of state knowledge that guides search (blind, informed), and its level of uncertainty (deterministic, stochastic).

3.5. A Formal Definition of Data Mining Workflow

For instance,C4.5uses a global greedy (blind) forward selection scheme while SVM-RFE uses a global greedy backward elimination scheme.

The third dimension,hasFeatureEvaluator, determines the way that FS algorithms eval-uate/weight the features found at eat step of the optimization procedure. The evaluation can be targeted towards a single feature or a feature subset. Its context can be either univariate like inInformationGain or multivariate like in SVM-RFE.ReliefF is the only FS algorithm whose target is single with multivariate context. Finally, we have the eval-uation function which gives the selection criterion under which (subsets of) features are evaluated.

Finally, the fourth dimension,hasDecisionStrategy, has to cope with the ﬁnal decision to select the ﬁnal feature subset. This can be done either with statistical test as inχ2, or using a simple thresholding cut-oﬀ function over the feature weights where the threshold is given as parameter or to keep the top-kfeatures with the highest weights.

In the next section, we will see how to mine feature selection and classiﬁcation work-ﬂows where we will extract frequent patterns deﬁned over the above algorithm characteris-tics. These generalized frequent patterns will describe diﬀerent combinations of algorithm characteristics appearing frequently in a set of training workﬂows which we will use in the rest of the thesis to characterize the workﬂow space and to address our diﬀerent meta-mining tasks.

### 3.5 A Formal Definition of Data Mining Workflow

We will now give a formal deﬁnition of a DM workﬂow and how we represent it. DM work-ﬂows are directed acyclic typed graphs (DAGs), in which nodes correspond to operators and edges between nodes to data input/output objects. In fact they are hierarchical DAGs since they can have dominating nodes/operators that contain themselves sub-workﬂows.

A typical example is the cross-validation operator whose control ﬂow is given by the exe-cution in parallel of training and testing sub-workﬂows, or a complex operator of the type boosting. More formally, let:

 O be the set of all available operators that can appear in a DM workﬂow, e.g. clas-siﬁcation operators, such as J48, SVMs, etc. O also includes dominatingoperators which are deﬁned by one or more sub-workﬂows they dominate, e.g. cross-validation or model combination operators such as boosting, etc.

 E be the set of all available data types that can appear in a DM workﬂow, namely the data types of the various I/O objects that can appear in DM workﬂow such as

models, datasets, attributes, etc.

Formally, an operatoro∈O is deﬁned by its name through a labelling function λ(o), the data types e ∈ E of its inputs and outputs, and its direct sub-workﬂows if o is a dominatingoperator. Then a DM workﬂow is a pair (O, E), which also contains all sub-workﬂows if any, where: O ⊂ O is the set of vertices or nodes which correspond to all the operators used in this DM workﬂow and its sub-workﬂow(s), and E ⊂ E is the set of pairs of nodes, (oi, oj), called directed edges, that correspond to the data types of the input/output objects, that are passed from operator oi to operator oj. Thus, the graph structure of a DM workﬂow is a pair (O, E), which also contains all sub-workﬂows if any. O ⊂O is the set of vertices which correspond to all the operators used in this DM workﬂow and its sub-workﬂow(s), andE⊂E is the set of pairs of nodes, (oi, oj), directed edges, that correspond to the data types of the output/input objects, that are passed from operatoroi to operator oj.

Following the above deﬁnition, we can build a vector representation of a DM workﬂow by considering thetopological orderof its operators; the topological sort or order of a DM workﬂow is a permutation of the vertices of its graph structure such that an edge (oi, oj) implies thatoi appears beforeoj, i.e. this is a complete ordering of the nodes of a directed acyclic graph which is given by the node sequence:

wl= [o1, .., ol] (3.1)

where the subscript inwl denotes the lengthl(i.e. number of operators) of the topological sort. If the topological sort is not unique, then it is always possible to get an unique sort using second order information such as the lexicographic order of the vertex labels.

The topological sort of a DM workﬂow can be structurally represented with a rooted, labelled and ordered tree (Bringmann, 2004; Zaki, 2005), by doing a depth-ﬁrst search over its graph structure where the maximum depth is given by expanding recursively the sub-workﬂows of the dominating operators. Thus the topological sort of a workﬂow or its tree representation is a reduction of the original directed acyclic graph where the nodes and edges have been fully ordered.

An example of the hierarchical directed acyclic graph representing a RapidMiner DM workﬂow is given in Figure 3.5. The graph corresponds to a DM workﬂow that cross-validates a feature selection method followed by a classiﬁcation model building step with theNaive Bayesclassiﬁer. RM X-Validationis a typical example of a dominating operator which itself is a workﬂow – it has two basic blocks, a training block which can be any arbitrary workﬂow that receives as input a dataset and outputs a model, and a testing

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.5: Example of a DM workﬂow that does performance estimation of a combination of feature selection and classiﬁcation.

blockwhich receives as input a model and a dataset and outputs a performance measure.

In particular this speciﬁc cross-validation operator has a training block, in which feature weights are computed by theRM WeightByInformationGainoperator, after which a number of features is selected by the RM SelectByWeights operator. These steps are followed by the ﬁnal model building given by the RM NaiveBayes operator. The testing block is rather trivial consisting of the RM ApplyModeloperator followed by the RM Performance computation operator. The topological sort of this graph is given by the tree of Figure 3.6.

### 3.6 Data Mining Workflow Frequent Pattern Extraction

In this section, we will focus on the extraction of frequent patterns from DM work-ﬂows. As already explained, DM workﬂows are characterized by structures that appear and are reused very often. The simplest example was that of the chain of operators RM WeightByInformationGainand RM SelectByWeights which when they appear together actually perform feature selection. Along the same lines, we can have complex workﬂows

RM Retrieve

Figure 3.6: Topological order of the DM workﬂow given in Figure 3.5.

that perform tasks such as model evaluation, we already show the example of the com-posite operator of cross-validation, or model building with bagging, boosting, stacking, etc. What is important here is the combined use of the basic operators which result in a speciﬁc higher level meaning.

In addition, there might be interesting frequent structures that go beyond the concept of complex workﬂows whose structure is known a priori. A typical example could be frequent structures of the form: feature selection with information gain is often used together with decision trees. Such rules are not known a priori for novice data miners and we can only discover them by analyzing DM workﬂows produced by experimented data miners. Finally, we can use such frequent structures as higher level descriptors of workﬂows in the process of meta-mining for discovery of success patterns.

So far we have been discussing structural components, sub-workﬂows, under the im-plicit assumption that these are deﬁned over speciﬁc operators. However exactly the same sub-workﬂow can be deﬁned over diﬀerent operators, or over families of operators. For example, model building with bagging can be deﬁned for any classiﬁcation algorithm. If we are not able to account for such generalizations, the frequent patterns that we will be discovering will be always deﬁned over speciﬁc operators. It is clear that we will be able to discover patterns which have much stronger support if we include in the frequent pattern search procedure the means to perform a search also over generalizations. This is one of the roles of thedmop ontology. If we examine again Figures 3.1 and 3.2, we see that we can easily deﬁne generalizations over operators by following theexecutes and implements relations in the ﬁrst ﬁgure, generalizations which will correspond to the diﬀerent families of algorithms deﬁned in the second ﬁgure.

In the following sections, we will describe how we search for frequent patterns over DM workﬂows that can also be deﬁned over generalizations of operators and not only over grounded operators. So we will be looking for generalized workﬂow patterns; similar

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.7: Augmented parse tree of the DM workﬂow originally given in Figure 3.6. Thin edges depict workﬂow decomposition, double lines depict dmop’s concept subsumption and bold lines depictdmop’simplement relation.

work has been done in frequent sequence mining, see for example the generalized sequence pattern algorithm of Srikant and Agrawal (1996).

3.6.1 Workflow Representation for Generalized Frequent Pattern Min-ing

As discussed in the last section, DM workﬂows are hierarchical directed acyclic graphs where the hierarchy is a result of the presence of composite operators. In addition, we will see in section 6.4 that HTN plans are best represented in the form of a tree, which gives the task to method to operator decomposition. For both cases, it is more natural to look for frequent patterns under a tree representation. We already saw that we can represent a DM workﬂow as a parse tree which delivers the unique topological order of the nodes of the workﬂow, i.e. the order of execution of the operators, with a depth ﬁrst search, and the hierarchy of the composite operators as the tree structure, see Figures 3.5 and 3.6 for an example of a DM workﬂow under a graph representation and its parse tree.

Augmented Parse Trees

Given the parse tree representation of a workﬂow, we will show now how we can augment it with the help of thedmopontology in view of deriving frequent patterns over general-izations of the components of the workﬂows. The generalgeneral-izations that we will be using are

given by the concepts, relations and subsumptions of thedmop ontology. Starting from theDM-Operatorconcept, we have seen in Section 3.4.1 that an operatoro∈Oimplements some algorithma∈A(Figure 3.1). In addition, we have seen in Section 3.4.2 thatdmop deﬁnes a very reﬁned taxonomy over the algorithms, a snapshot of which we have already seen in Figure 3.2.

We also deﬁne a distance measure between two concepts C and D, which is related to the terminological inclusion axiom, C ⊑ D, as the length of the shortest path between the two concepts. We will use this measure to order the subsumptions. Note that the subsumption order is not unique if we work with an inferred taxonomy, i.e., in an inferred taxonomy, a concept can have multiple ancestors. But we assume in this work to have at most one subsumption order for each concept.

For instance, taking the implementation of the NaiveBayesNormal algorithm provided by the RM NaiveBayes operator, we have the following concepts and roles subsumptions order:

RM NaiveBayes ⊑ ∀implements.NaiveBayesNormal RM NaiveBayes ⊑ ∀implements.NaiveBayesAlgorithm RM NaiveBayes ⊑ ∀implements.GenerativeAlgorithm

which reﬂects the taxonomic relations:

NaiveBayesNormal⊑NaiveBayesAlgorithm⊑GenerativeAlgorithm⊑...

Given a parse tree of a DM workﬂow, we can derive its augmented parse tree using the dmop’s concepts and roles subsumptions given above. Anaugmented parse tree is derived from an original parse tree T by adding for each node v ∈ T the concept subsumptions order betweenvand its parentπ(v). For instance, the augmented parse tree of Figure 3.6 is given in Figure 3.7, where for each v of T, we add the algorithm ancestors of v. Note that not allv’s have a concept subsumption in thedmopontology. Mining over augmented parse trees will allow us to capture workﬂow structures which are not limited to the use of ground operators, making it possible to detect abstract workﬂow patterns which otherwise would have gone unnoticed simply because they do not have strong support.

Parse Trees Rewriting

An augmented parse tree is derived by adding concept subsumptions given by thedmop ontology. However we would like to be able to express also composite operator and

com-3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.8: Rewritten parse tree of the augmented parse tree given in Figure 3.7.

posite algorithm deﬁnitions and include them in the rewriting of the parse trees. dmop oﬀers for the moment the possibility to deﬁne modeling algorithms that must be preceded or followed by other speciﬁc algorithms using object properties, e.g.:

ModelingAlgorithm ⊑ ∃hasPreProcessor.DataProcessingAlgorithm ModelingAlgorithm ⊑ ∃hasPostProcessor.ModelProcessingAlgorithm

A typical example of the second property is the C4.5 modeling algorithm which ﬁrst constructs an unprunned decision tree and then is followed by aModelProcessingAlgorithm that does prunning. We would like to be able to deﬁne more complex rules using concept equivalency, such as the following:

The ﬁrst rule states that a FeatureWeightingAlgorithm followed by some DecisionRule is equivalent to a FeatureSelectionAlgorithm, i.e. the feature weights learned in the Fea-tureWeightingAlgorithm are used to select features based on some user’s decision like a TopKRule. In fact, the rule could be made much more explicit and deﬁne that the output of theFeatureWeightingAlgorithmshould be passed as input to theDecisionRule; this would be the appropriate deﬁnition if we were using the full graph representation in which the data ﬂow is also described.

The second and the third are specializations of the ﬁrst over the class of univariate and multivariate feature weigthing and feature selection. Such ontological statements should also be taken into account when we process the parse tree of a workﬂow. This concept equivalence when applied to the augmented parse tree of Figure 3.7 results to inserting the left-hand side of the equivalence, FeatureSelectionAlgorithm, between the ancestors of theRM WeightsByInformationGainoperator, and moving theRM DecisionRuleoperator in-stance,RM SelectByWeights, on the right side of the new node and at the same level of the FeatureWeightingAlgorithmconcept, producing the rewritten augmented tree of Figure 3.8.

We use the PρLog rule-based system, (Dundua et al., 2010; Kutsia, 2006; Marin and Kutsia, 2006), which allows us to deﬁne rewriting rules for trees with context and sequence variables. It is this representation of the parse trees that we will be using to mine for the frequent patterns.

3.6.2 Tree Pattern Mining

We will use the tree miner of Zaki (2005) to search for frequent trees over the augmented tree representation of our DM workﬂows. A key concept of the algorithm is the embedded tree. A tree t is embedded in a tree t, denoted as t e t, if and only if there exists a mappingϕ:Ot→Ot such that

∀u, v∈Ot:λ(u) =λ(ϕ(u))

∧u≺v⇐ϕ(u) ≺ϕ(v)

∧π(u) =v⇔π(ϕ(u)) =ϕ(v)

This subtree deﬁnition has the properties that the order of the children of a node is preserved as well as the transitive closure of its parents (ancestors). This is a less restricted deﬁnition than theinducedsubtree deﬁnition, and as such, embeddedsubtrees are able to extract patterns ”hidden” or embedded deep within large trees which might be missed by the induced subtree deﬁnition (Zaki, 2005).

3.6. Data Mining Workflow Frequent Pattern Extraction

Figure 3.9: Embbeded subtree (d) from trees (a), (b) and (c)

For instance, in Figure 3.9, given three trees T1, T2 and T2, and a support of 100%, the resulting embedded subtree is shown in Figure 3.9(d). This pattern has skipped the middle node in each tree, showing the common ordered structure within the three trees.

Which is what we want from our frequent pattern extractor since it will keep the total order of the parse trees and their common structure.

Key Definitions

Given a database (forest) D of trees, the tree miner algorithm will produce a set P of embedded subtrees (patterns). For a given treeT ∈D and a pattern S ∈ P, if S e T,

Two subtreesP, Q∈ P are said to be equivalent if they share the same set of occurences, i.e., sup(P) = sup(Q). We denote such equivalence byP ≡Q, and the equivalence class forP is deﬁned by [P] ={Q∈ P:P ≡Q} (Arimura, 2008).

Definition(Maximal subtree)

Given an equivalence class [P], a subtreeP ∈[P] is said to bemaximalif and only if there exists no strictly more speciﬁc subtreeQ∈[P] such thatP ⊏Q.

An Example

We will demonstrate the extraction of frequent tree patterns from workﬂows using the knowledge encoded indmopwith a simple scenario, in which we have four DM workﬂows

that evaluate using cross validation the performance of feature selection and classiﬁcation with diﬀerent algorithms. More precisely the four workﬂows are:

a) feature selection based on Information Gainand classiﬁcation with NaiveBayes b) feature selection based onReliefF and classiﬁcation withC4.5

c) feature selection with CFSand classiﬁcation with C4.5

d) feature selection using the wrapper approach, in which the search in the feature space is guided by the performance of NaiveBayes, and classiﬁcation with NaiveBayes.

Their parse trees are given in Figure 3.10). Workﬂow a) performs univariate feature

Their parse trees are given in Figure 3.10). Workﬂow a) performs univariate feature