• Aucun résultat trouvé

Chapter 2 Background 7

2.2 The Data Mining Process

Data Mining(DM) or Knowledge Discovery in Databases (KDD)1 refers to the computa-tional process in which low-level data are analyzed in order to extract high-level knowledge (Fayyad et al., 1996). This process is carried out through the specification of a DM work-flow, i.e. the assembly of individual data transformations and analysis steps, implemented by DM operators, which composes the DM process with which a data analyst chooses to address her/his DM task. Standard workflow models such as the crisp-dm model (Chapman, Clinton, Kerber, Khabaza, Reinartz, Shearer, and Wirth, 2000) decomposes the life cycle of this process into five principal steps; selection, pre-processing, transfor-mation, learning or modeling, and post-processing. Each step can be further decomposed

1Following current usage, we use these two terms synonymously.

2.2. The Data Mining Process

Figure 2.1: The knowledge discovery (KD) process and its steps, adapted from Fayyad et al. (1996).

into lower steps. At each workflow step, data objects are consumed by the respective operators which either transform them or produce new data objects that flow to the next step following the control flow defined by the DM workflow. The process is repeated until relevant knowledge is created, see Figure 2.1.

Despite the recent efforts to standardize the DM workflow modeling process with a workflow model such ascrisp-dm, the (meta-)analysis of DM workflows is becoming in-creasingly challenging with the growing number and complexity of available operators (Gil et al., 2007). Today’s second generation knowledge discovery support systems (KDSS) al-low complex modeling of workflows and contain several hundreds of operators; the Rapid-Miner platform (Klinkenberg, Mierswa, and Fischer, 2007), in its extended version with Weka (Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten, 2009) and R (R Core Team, 2013), proposes actually more than 500 operators, some of which can have very complex data and control flows, e.g. bagging or boosting operators, in which several sub-workflows are interleaved. As a consequence, the possible number of sub-workflows that can be modeled within these systems is on the order of several millions, ranging from simple to very elaborated workflows with several hundred operators. Therefore the data analyst has to carefully select among those operators the ones that can be meaningfully combined to address his/her knowledge discovery problem. However, even the most sophisticated data miner can be overwhelmed by the complexity of such modeling, having to rely on his/her experience and biases as well as on thorough experimentation in the hope of finding out the best operator combination.

With the advance of new generation KDSS that provide even more advanced func-tionalities, it becomes important toprovide automated support to the user in the workflow modelling process, an issue that has been identified as one of the top-ten challenges in data

mining (Yang and Wu, 2006). During the last decade, a rather limited number of systems have been proposed to address this challenge. In the next section, we will review the two most important research avenues.

2.3 State of the Art in Data Mining Workflow Design Sup-port

In order to support the DM user in the building of her/his workflows, two main approaches have been developed in the last decades: the first one takes a planning approach based on an ontology of DM operators to automatically design DM workflows, while the second one, referred as meta-learning, makes use of learning methods to address, among other tasks, the task of algorithm selection for a given dataset.

2.3.1 Ontology-based Planning of DM Workflows

Bernstein, Provost, and Hill (2005) propose an ontology-based Intelligent Discovery As-sistant(ida) that plans valid DM workflows – valid in the sense that they can be executed without any failure – according to basic descriptions of the input dataset such as attribute types, presence of missing values, number of classes, etc. By describing into a DM on-tology the input conditions and output effects of DM operators, according to the three main steps of the DM process, pre-processing, modeling and post-processing, see Figure 2.2, ida systematically enumerates with a workflow planner all possible valid operator combinations, workflows, that fulfill the data input request. A ranking of the workflows is then computed according to user defined criteria such as speed or memory consumption which are measured from past experiments.

Z´akov´a, Kremen, Zelezny, and Lavrac (2011) propose the kd ontology to support automatic design of DM workflows for relational DM. In this ontology, DM relational algorithms and datasets are modeled with the semantic web language OWL-DL, provid-ing thereby semantic reasonprovid-ing and inference to query over a DM workflow repository.

Similarly toida, the ontology characterizes DM algorithms with their data input/output specifications to address DM workflow planning. The authors have developed a translator from their ontology representation to thePlanning Domain Definition Language(PDDL), (McDermott et al., 1998), with which they can produce abstract directed-acyclic graph workflows using a FF-style planning algorithm, (Hoffmann, 2001). They demonstrate their approach on genomic and product engineering (CAD) use-cases where complex workflows are produced which can make use of relational data structure and background knowledge.

2.3. State of the Art in Data Mining Workflow Design Support

Figure 2.2: A snapshot of the DM ontology used by the IDEA system (Bernstein et al., 2005).

More recently, the e-LICO project2 featured another ida built upon a planner which constructs DM plans following a hierarchical task networks (HTN) planning approach.

The specification of the HTN is given in the Data Mining Workflow (dmwf) ontology, (Kietz, Serban, Bernstein, and Fischer, 2009). As its predecessors the e-LICO ida has been designed to identify operators which preconditions are met at a given planning step in order to plan valid DM workflows and does an exhaustive search in the space of possible DM plans.

None of the three DM support systems that we have just discussed consider the eventual performance of the workflows they plan with respect to the DM task that they are supposed to address. For example if our goal is to provide workflows that solve a classification problem, in planning these workflows we would like to consider a measure of classification performance, such as accuracy, and deliver workflows that optimize it. All the discussed DM support systems deliver an extremely large number of plans, DM workflows, which are typically ranked with simple heuristics, such as workflow complexity or expected execution time, leaving the user at a loss as to which is the best workflow in terms of the expected performance in the DM task that he/she needs to address. Even worse the planning search space can be so large that the systems can even fail to complete the planning process, see for example the discussion in Kietz et al. (2012).

2http://www.e-lico.eu

2.3.2 Meta-Learning

There has been considerable work that tries to support the user in view of performance maximization for a very specific part of the DM process, that of modeling or learning.

A number of approaches have been proposed, collectively identified as meta-learning or learning-to-learn, (Brazdil, Giraud-Carrier, Soares, and Vilalta, 2008; Hilario, 2002;

Kalousis, 2002; Kalousis and Theoharis, 1999; Soares and Brazdil, 2000). The main idea in meta-learning is that given an unseen dataset the system should be able to select or rank a pool of learning algorithms with respect to their expected performance on this dataset; this is referred as thealgorithm selectiontask (Smith-Miles, 2008). To do so, one builds a meta-learning model from the analysis of past learning experiments, searching for associations between algorithm’s performances and dataset characteristics.

In thestatlog(King, Feng, and Sutherland, 1995; Michie, Spiegelhalter, Taylor, and Campbell, 1994) and metal projects, the members compare a number of classification algorithms on large real-world datasets in order to understand the relation between dataset characteristics and algorithm’s performances: they use statistical characteristics, as well as information-theoretic measures and the landmarking approach (Peng, Flach, Soares, and Brazdil, 2002), to build a meta-learning model that can predict the class, either best or rest, of an algorithm on unseen datasets, relatively to its performance on seen datasets. Other works on algorithm selection include the use of algorithm learning curves to estimate the performance of algorithms on dataset samples (Leite and Brazdil, 2005, 2010), the building of geometrical plus topological dataset characteristics to draw the geometrical complexity of classification problems (Ho and Basu, 2002, 2006), and various regression- and ranking-based approaches to build a meta-model; these are most notably non-parametric methods including instance-based learning, rule-based learning, decision tree and naive bayes algorithms (Bensusan and Kalousis, 2001; Kalousis and Hilario, 2001;

Kalousis and Theoharis, 1999; Soares and Brazdil, 2000), and more recently a random forest approach built on the relative performances of pairs of algorithms over a set of datasets to extract pairwise meta-rules (Sun and Pfahringer, 2013).

It is also worth mentioning the works on portfolio-based algorithm selection for propo-sitional satisfiability problem (SAT) (Nudelman, Leyton-Brown, Devkar, Shoham, and Hoos, 2004a; Nudelman, Leyton-Brown, Hoos, Devkar, and Shoham, 2004b; Xu, Hutter, Hoos, and Leyton-Brown, 2008). These works follow the same meta-learning approach that we just have described where they build a portfolio of SAT-solver algorithms, from which one can select the best performing algorithm for a given problem instance. As in meta-learning, the task is to build for each SAT-solver algorithm an empirical hardness

2.3. State of the Art in Data Mining Workflow Design Support

model, i.e. a meta-model, which can predict the runtime or cost of the algorithm on a selected problem instance according to the instance’s features. Dataset descriptors for SAT problems are provided by domain expert knowledge, which include various statistical features of the logical problems such as number of clauses, number of variables, variable-clause graph features, proximity to Horn Formula, etc. Experiment results reported by approaches like SATzilla (Xu et al., 2008, 2012) on different SAT-solver competitions have showed the effectiveness of this approach.

In addition to the algorithm selection task is the task of model – or parameter – selection3, whose goal is to adjust the procedural – or search/preference – bias in order to give priority to certain hypotheses over the others in the hypothesis space of a learning algorithm. Ali and Smith-Miles (2006) propose a meta-learning approach to learn the best kernel to use within support vector machines (SVMs) for classification. The same authors use a similar approach to determine the best method for selecting the width of the RBF kernel (Ali and Smith-Miles, 2007). Another approach for parameter selection is the algorithm configuration approach (Coy, Golden, Runger, and Wasil, 2001; Gratch and Dejong, 1992; Hutter, Hoos, Leyton-Brown, and St¨utzle, 2009; Hutter, Hoos, and Leyton-Brown, 2011; Minton, 1993; Terashima-Mar´ın and Ross, 1999). In these works, the goal is to search for the best performing parameter configuration of an algorithm when applied on a given problem instance. Various search algorithms have been proposed including hill-climbing (Gratch and Dejong, 1992), beam search (Minton, 1993), genetic algorithms (Terashima-Mar´ın and Ross, 1999), experimental design approaches (Coy et al., 2001) and more recently a trajectory-based method (Hutter et al., 2009), all of which try to sequentially minimize the cost related to the application of a given algorithm configuration on the problem instance.

More recently, Thornton, Hutter, Hoos, and Leyton-Brown (2013) propose a novel approach called AutoWekawhich is able to automatically and simultaneously choose a learning algorithm and its hyper-parameters for empirical performance optimization on a given learning problem; they combine theWekadata mining platform (Hall et al., 2009) together with a bayesian procedure, the sequential model-based algorithm configuration smac(Hutter et al., 2011), to estimate the performance of a learning algorithm given the most promising candidate set of parameters selected a priori. This approach makes use of conditional parameters, i.e. parameters that are hierarchically conditioned by some others, in order to build complex algorithm configurations and selections. For instance,

3Note that we do not follow the same terminology as in Thornton et al. (2013) where the authors use the terms ”model selection” to refer to what we call algorithm selection and ”hyper-parameter optimization”

to what we call ”model selection”.

algorithm selection is carried out with what the authors call ”root-level” parameters, one per algorithm, which condition the selection of learning parameters with respect to the selected algorithm. It is also possible to configure feature selection component parameters, like the type of search (greedy or best first) or the type of feature evaluation (Relief or CFS), from which one can select and configure complex feature selection plus classification workflows for a given learning problem.

Overall, meta-learning differs fundamentally from base-level learning in its objectives;

while the latter assumes a fixed bias in which learning will occur to find the best hypoth-esis – orsearch – bias for the given problem (Mitchell, 1997), the former will seek for the algorithm – orrepresentational – bias which will best fit the problem at hand. Represen-tational bias specifies the structure of an algorithm; e.g. the cost function it uses and the decision boundary it draws such as linear versus non-linear. Search bias specifies how this structure is built during learning. Thus while base-level learning is only concerned with model selection, meta-learning amounts to adjust or select dynamically the right learning or search bias by which an algorithm or its model will restrict the hypothesis space of a given problem (Vilalta and Drissi, 2002).

There are however two main limitations in meta-learning as well as in the algorithm portfolio and configuration approaches that we just have described here. First, algorithms are considered as black-boxes; the only relation which is considered between learning meth-ods is their relative performance on datasets. As a consequence, one has to experiment first an algorithm on his datasets in order to characterize it. Moreover, there is no way where one can generalize the learned meta-models to select algorithms which have not been experimented in the past; in order to account for a new/unseen algorithm, one has to train it before being able to draw conclusions on its relations with other algorithms.

Second, by focusing mainly on the learning or modeling phase of the knowledge discov-ery process, meta-learning simply ignores the other steps that can compose this process, such as pre-processing, transformation and post-processing, and which can impact on the performance of the process. For all these reasons, it is not possible yet in meta-learning to automatically plan/design DM workflows for a given dataset as in ontology-based DM workflow planning systems.

2.4 Discussion

In Section 2.2, we presented the core problem that we will address in this thesis: that is, to provide intelligent automated support to the DM user in her DM workflow modeling process. We have reviewed in Section 2.3 the state-of-the-art approaches that try to

2.4. Discussion

address this problem. On one hand, there is knowledge-based planning systems that rely on an ontology of DM operators to build valid DM workflows according to the input/output conditions of the candidate operators. On the other hand, there is the meta-learning and algorithm portfolio approaches which learn associations between dataset characteristics and algorithm’s performances to address the task of algorithm selection for a given learning problem.

As already said before, none of these approaches can design potentially new DM work-flows, combinations of DM algorithms, such that their building is optimized with respect to a performance measure like classification accuracy, for the main reason that DM algo-rithms are viewed in these methods as independent black-box components. To go beyond this limitation, we will propose to uncover not only relations between datasets and algo-rithms as in meta-learning, but also relations within (combinations of) learning algoalgo-rithms that lead to good or bad performance on datasets. More precisely, while the previous meta-learning approaches aim at characterizing a meta-learning problem in order to select the most appropriate algorithm by taking into consideration known structural properties of the problem (Smith-Miles, 2008), we will focus in this thesis on the characteristics, structural properties, of learning algorithms and the relations they can have between them inside a DM workflow. We will map these workflow characteristics with dataset characteristics according to the performance that the former has on the latter, in order to build or se-lect sets of DM operators, workflows, which are the most appropriate for a given learning problem in terms of their performance.

Our work is close to the recent works on automatic algorithm configuration for al-gorithm selection like AutoWeka (Thornton et al., 2013) and AutoFolio (Lindauer, Hoos, Hutter, and Schaub, 2015), but with the ability to generalize our meta-models to unseen datasets and workflows. More precisely, we will combine the two approaches de-scribed above as follows. From the ontology-based approach, we will exploit a new DM ontology, the Data Mining Optimization (dmop) ontology (Hilario et al., 2009, 2011), the goal of which is to pry open DM algorithms and characterize them by their learning behaviors. On top of this ontology, we will derive new meta-models that will associate now dataset and algorithm characteristics with respect to their relative performance to support the task of DM operator/workflow selection and planning in view of performance optimization. As we will see in the next Chapters, our work provides a unique blending of data mining, machine learning and planning algorithms, where we will build the first system to our knowledge that is able to design potentially new DM workflows such that their performance, like classification accuracy, will be optimized for a given dataset.

Data Mining Workflow Pattern Analysis

3.1 Introduction

In this Chapter, we will describe a generalized pattern mining approach to extract frequent workflow patterns from the annotated graph structure of DM workflows. These relational patterns will give us provisions to build our meta-mining framework, where we will learn how to associate dataset and algorithm/workflow characteristics with respect to their relative performance. To extract the workflow patterns, we will use the Data Mining OPtimization (dmop) ontology (Hilario et al., 2009, 2011) which we describe in Section 3.4. Before proceeding to the high-level description of dmop, we first give some related works in Section 3.3. In Section 3.5, we give a formal definition of DM workflows and in Section 3.6 we describe our generalized pattern mining approach. Finally we discuss our approach in Section 3.7.

3.2 Motivation

In Chapter 2, we introduced the notion of DM workflows. As we will see in section 3.5, these are hierarchical graph structures composed of various data transformations and analysis steps which can be very complex structures, i.e. they can contain several nested sub-structures that specify complex operator combinations, e.g. an operator of type boost-ing is typically composed of several interleaved combinations of learnboost-ing algorithms with

3.2. Motivation

which different learning models are produced. These structures are inherently difficult to

which different learning models are produced. These structures are inherently difficult to