• Aucun résultat trouvé

The outline of the thesis will be the following. In Chapter 2, we will define the context of the thesis where we will introduce the problem of supporting DM users in the modeling of their DM process. We will present the state-of-the-art approaches which fall broadly into two main categories: ontology-based DM workflow planning systems and meta-learning.

We will discuss the limitations of the current approaches from which we will define the starting point of the thesis.

In Chapter 3, we will describe theData Mining Optimization(dmop) ontology (Hilario et al., 2009, 2011), a new conceptual framework for data mining and machine learning which describes in a formal manner DM algorithms such as classification and feature selection algorithms by their learning components such as cost function and optimization strategy. We will usedmopin order to extract abstract relational features of DM workflows which we will use to characterize the different operator combinations that can compose a DM workflow.

In Chapter 4, we will describe the Rice model (Rice, 1976) which is used in meta-learning for defining the algorithm selection task, and which we will extend in order to define our meta-mining framework by accounting on workflow characteristics. We will describe the three meta-mining scenarios that we will address in the thesis: learning work-flow preferencesfor a new dataset, learning dataset preferences for a given new workflow, and learning dataset-workflow preferences for a given new pair of dataset-workflow. We will finally describe a real-world meta-mining problem on which we will experiment with the three above scenarios in a classification setting.

In Chapter 5, we will develop a metric-learning approach for hybrid recommendation in meta-mining. We will develop two algorithms to learn similarity measures in the fea-ture space of datasets and workflows: in the first algorithm, we will learn two homogenous metrics, one for datasets and one for workflows; and in the second algorithm, we will learn a heterogeneous metric which will measure the similarity between datasets and workflows according to their relative performance. We experimentally show that the approach signif-icantly outperforms the state-of-the-art meta-learning methods for the three meta-mining

1.2. Thesis Outline

cold start scenarios introduced in Chapter 3.

In Chapter 6, we will explore the combination of the meta-mining models with a DM workflow planner (Kietz, Serban, Bernstein, and Fischer, 2009, 2012). At each planning steps the planner will select the candidate partial workflows for which the expected per-formance will be maximum with respect to a given mining problem. We will test with two planning scenarios: in the first we will design DM workflows from an already experimented set of DM operators, in the second we will design DM workflows with DM operators that have never been experimented before. In both scenarios we provide performance improve-ments over the default baselines with significant results for the first scenario.

In Chapter 7, we will develop a learning-to-rank approach for cold start recommen-dations. The approach is based on LambdaMART (Burges, 2010), the state-of-the-art learning to rank algorithm, which we will extend to learn user and item new represen-tation, that is low-rank user and item profiles that describe their behavior in a latent space. In addition, we will propose several similarity-based regularization methods to learn robust profiles. We evaluate our approach on a meta-mining problem but also on a standard movie recommendation problem, MovieLens. In the two experiments we outper-form in a statistically significant manner the different baselines including LambdaMART itself for the cold start problem, and CofiRank, the state-of-the-art matrix factorization algorithm in collaborative ranking (Weimer, Karatzoglou, Le, and Smola, 2007), for the task of matrix completion.

We finally conclude and outline future work in Chapter 8.

Background

2.1 Introduction

In this chapter, we will give the context in which our work will take place. We will present in Section 2.2 the research problem that lies at the heart of our thesis, namely the problem of designing valid and useful data mining workflows. Then we will describe in Section 2.3 state-of-the-art DM workflow support systems which broadly fall into two main categories:

ontology-based workflow planning system and meta-learning. We finally discuss the main limitations of those systems in Section 2.4.

2.2 The Data Mining Process

Data Mining(DM) or Knowledge Discovery in Databases (KDD)1 refers to the computa-tional process in which low-level data are analyzed in order to extract high-level knowledge (Fayyad et al., 1996). This process is carried out through the specification of a DM work-flow, i.e. the assembly of individual data transformations and analysis steps, implemented by DM operators, which composes the DM process with which a data analyst chooses to address her/his DM task. Standard workflow models such as the crisp-dm model (Chapman, Clinton, Kerber, Khabaza, Reinartz, Shearer, and Wirth, 2000) decomposes the life cycle of this process into five principal steps; selection, pre-processing, transfor-mation, learning or modeling, and post-processing. Each step can be further decomposed

1Following current usage, we use these two terms synonymously.

2.2. The Data Mining Process

Figure 2.1: The knowledge discovery (KD) process and its steps, adapted from Fayyad et al. (1996).

into lower steps. At each workflow step, data objects are consumed by the respective operators which either transform them or produce new data objects that flow to the next step following the control flow defined by the DM workflow. The process is repeated until relevant knowledge is created, see Figure 2.1.

Despite the recent efforts to standardize the DM workflow modeling process with a workflow model such ascrisp-dm, the (meta-)analysis of DM workflows is becoming in-creasingly challenging with the growing number and complexity of available operators (Gil et al., 2007). Today’s second generation knowledge discovery support systems (KDSS) al-low complex modeling of workflows and contain several hundreds of operators; the Rapid-Miner platform (Klinkenberg, Mierswa, and Fischer, 2007), in its extended version with Weka (Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten, 2009) and R (R Core Team, 2013), proposes actually more than 500 operators, some of which can have very complex data and control flows, e.g. bagging or boosting operators, in which several sub-workflows are interleaved. As a consequence, the possible number of sub-workflows that can be modeled within these systems is on the order of several millions, ranging from simple to very elaborated workflows with several hundred operators. Therefore the data analyst has to carefully select among those operators the ones that can be meaningfully combined to address his/her knowledge discovery problem. However, even the most sophisticated data miner can be overwhelmed by the complexity of such modeling, having to rely on his/her experience and biases as well as on thorough experimentation in the hope of finding out the best operator combination.

With the advance of new generation KDSS that provide even more advanced func-tionalities, it becomes important toprovide automated support to the user in the workflow modelling process, an issue that has been identified as one of the top-ten challenges in data

mining (Yang and Wu, 2006). During the last decade, a rather limited number of systems have been proposed to address this challenge. In the next section, we will review the two most important research avenues.

2.3 State of the Art in Data Mining Workflow Design Sup-port

In order to support the DM user in the building of her/his workflows, two main approaches have been developed in the last decades: the first one takes a planning approach based on an ontology of DM operators to automatically design DM workflows, while the second one, referred as meta-learning, makes use of learning methods to address, among other tasks, the task of algorithm selection for a given dataset.

2.3.1 Ontology-based Planning of DM Workflows

Bernstein, Provost, and Hill (2005) propose an ontology-based Intelligent Discovery As-sistant(ida) that plans valid DM workflows – valid in the sense that they can be executed without any failure – according to basic descriptions of the input dataset such as attribute types, presence of missing values, number of classes, etc. By describing into a DM on-tology the input conditions and output effects of DM operators, according to the three main steps of the DM process, pre-processing, modeling and post-processing, see Figure 2.2, ida systematically enumerates with a workflow planner all possible valid operator combinations, workflows, that fulfill the data input request. A ranking of the workflows is then computed according to user defined criteria such as speed or memory consumption which are measured from past experiments.

Z´akov´a, Kremen, Zelezny, and Lavrac (2011) propose the kd ontology to support automatic design of DM workflows for relational DM. In this ontology, DM relational algorithms and datasets are modeled with the semantic web language OWL-DL, provid-ing thereby semantic reasonprovid-ing and inference to query over a DM workflow repository.

Similarly toida, the ontology characterizes DM algorithms with their data input/output specifications to address DM workflow planning. The authors have developed a translator from their ontology representation to thePlanning Domain Definition Language(PDDL), (McDermott et al., 1998), with which they can produce abstract directed-acyclic graph workflows using a FF-style planning algorithm, (Hoffmann, 2001). They demonstrate their approach on genomic and product engineering (CAD) use-cases where complex workflows are produced which can make use of relational data structure and background knowledge.

2.3. State of the Art in Data Mining Workflow Design Support

Figure 2.2: A snapshot of the DM ontology used by the IDEA system (Bernstein et al., 2005).

More recently, the e-LICO project2 featured another ida built upon a planner which constructs DM plans following a hierarchical task networks (HTN) planning approach.

The specification of the HTN is given in the Data Mining Workflow (dmwf) ontology, (Kietz, Serban, Bernstein, and Fischer, 2009). As its predecessors the e-LICO ida has been designed to identify operators which preconditions are met at a given planning step in order to plan valid DM workflows and does an exhaustive search in the space of possible DM plans.

None of the three DM support systems that we have just discussed consider the eventual performance of the workflows they plan with respect to the DM task that they are supposed to address. For example if our goal is to provide workflows that solve a classification problem, in planning these workflows we would like to consider a measure of classification performance, such as accuracy, and deliver workflows that optimize it. All the discussed DM support systems deliver an extremely large number of plans, DM workflows, which are typically ranked with simple heuristics, such as workflow complexity or expected execution time, leaving the user at a loss as to which is the best workflow in terms of the expected performance in the DM task that he/she needs to address. Even worse the planning search space can be so large that the systems can even fail to complete the planning process, see for example the discussion in Kietz et al. (2012).

2http://www.e-lico.eu

2.3.2 Meta-Learning

There has been considerable work that tries to support the user in view of performance maximization for a very specific part of the DM process, that of modeling or learning.

A number of approaches have been proposed, collectively identified as meta-learning or learning-to-learn, (Brazdil, Giraud-Carrier, Soares, and Vilalta, 2008; Hilario, 2002;

Kalousis, 2002; Kalousis and Theoharis, 1999; Soares and Brazdil, 2000). The main idea in meta-learning is that given an unseen dataset the system should be able to select or rank a pool of learning algorithms with respect to their expected performance on this dataset; this is referred as thealgorithm selectiontask (Smith-Miles, 2008). To do so, one builds a meta-learning model from the analysis of past learning experiments, searching for associations between algorithm’s performances and dataset characteristics.

In thestatlog(King, Feng, and Sutherland, 1995; Michie, Spiegelhalter, Taylor, and Campbell, 1994) and metal projects, the members compare a number of classification algorithms on large real-world datasets in order to understand the relation between dataset characteristics and algorithm’s performances: they use statistical characteristics, as well as information-theoretic measures and the landmarking approach (Peng, Flach, Soares, and Brazdil, 2002), to build a meta-learning model that can predict the class, either best or rest, of an algorithm on unseen datasets, relatively to its performance on seen datasets. Other works on algorithm selection include the use of algorithm learning curves to estimate the performance of algorithms on dataset samples (Leite and Brazdil, 2005, 2010), the building of geometrical plus topological dataset characteristics to draw the geometrical complexity of classification problems (Ho and Basu, 2002, 2006), and various regression- and ranking-based approaches to build a meta-model; these are most notably non-parametric methods including instance-based learning, rule-based learning, decision tree and naive bayes algorithms (Bensusan and Kalousis, 2001; Kalousis and Hilario, 2001;

Kalousis and Theoharis, 1999; Soares and Brazdil, 2000), and more recently a random forest approach built on the relative performances of pairs of algorithms over a set of datasets to extract pairwise meta-rules (Sun and Pfahringer, 2013).

It is also worth mentioning the works on portfolio-based algorithm selection for propo-sitional satisfiability problem (SAT) (Nudelman, Leyton-Brown, Devkar, Shoham, and Hoos, 2004a; Nudelman, Leyton-Brown, Hoos, Devkar, and Shoham, 2004b; Xu, Hutter, Hoos, and Leyton-Brown, 2008). These works follow the same meta-learning approach that we just have described where they build a portfolio of SAT-solver algorithms, from which one can select the best performing algorithm for a given problem instance. As in meta-learning, the task is to build for each SAT-solver algorithm an empirical hardness

2.3. State of the Art in Data Mining Workflow Design Support

model, i.e. a meta-model, which can predict the runtime or cost of the algorithm on a selected problem instance according to the instance’s features. Dataset descriptors for SAT problems are provided by domain expert knowledge, which include various statistical features of the logical problems such as number of clauses, number of variables, variable-clause graph features, proximity to Horn Formula, etc. Experiment results reported by approaches like SATzilla (Xu et al., 2008, 2012) on different SAT-solver competitions have showed the effectiveness of this approach.

In addition to the algorithm selection task is the task of model – or parameter – selection3, whose goal is to adjust the procedural – or search/preference – bias in order to give priority to certain hypotheses over the others in the hypothesis space of a learning algorithm. Ali and Smith-Miles (2006) propose a meta-learning approach to learn the best kernel to use within support vector machines (SVMs) for classification. The same authors use a similar approach to determine the best method for selecting the width of the RBF kernel (Ali and Smith-Miles, 2007). Another approach for parameter selection is the algorithm configuration approach (Coy, Golden, Runger, and Wasil, 2001; Gratch and Dejong, 1992; Hutter, Hoos, Leyton-Brown, and St¨utzle, 2009; Hutter, Hoos, and Leyton-Brown, 2011; Minton, 1993; Terashima-Mar´ın and Ross, 1999). In these works, the goal is to search for the best performing parameter configuration of an algorithm when applied on a given problem instance. Various search algorithms have been proposed including hill-climbing (Gratch and Dejong, 1992), beam search (Minton, 1993), genetic algorithms (Terashima-Mar´ın and Ross, 1999), experimental design approaches (Coy et al., 2001) and more recently a trajectory-based method (Hutter et al., 2009), all of which try to sequentially minimize the cost related to the application of a given algorithm configuration on the problem instance.

More recently, Thornton, Hutter, Hoos, and Leyton-Brown (2013) propose a novel approach called AutoWekawhich is able to automatically and simultaneously choose a learning algorithm and its hyper-parameters for empirical performance optimization on a given learning problem; they combine theWekadata mining platform (Hall et al., 2009) together with a bayesian procedure, the sequential model-based algorithm configuration smac(Hutter et al., 2011), to estimate the performance of a learning algorithm given the most promising candidate set of parameters selected a priori. This approach makes use of conditional parameters, i.e. parameters that are hierarchically conditioned by some others, in order to build complex algorithm configurations and selections. For instance,

3Note that we do not follow the same terminology as in Thornton et al. (2013) where the authors use the terms ”model selection” to refer to what we call algorithm selection and ”hyper-parameter optimization”

to what we call ”model selection”.

algorithm selection is carried out with what the authors call ”root-level” parameters, one per algorithm, which condition the selection of learning parameters with respect to the selected algorithm. It is also possible to configure feature selection component parameters, like the type of search (greedy or best first) or the type of feature evaluation (Relief or CFS), from which one can select and configure complex feature selection plus classification workflows for a given learning problem.

Overall, meta-learning differs fundamentally from base-level learning in its objectives;

while the latter assumes a fixed bias in which learning will occur to find the best hypoth-esis – orsearch – bias for the given problem (Mitchell, 1997), the former will seek for the algorithm – orrepresentational – bias which will best fit the problem at hand. Represen-tational bias specifies the structure of an algorithm; e.g. the cost function it uses and the decision boundary it draws such as linear versus non-linear. Search bias specifies how this structure is built during learning. Thus while base-level learning is only concerned with model selection, meta-learning amounts to adjust or select dynamically the right learning or search bias by which an algorithm or its model will restrict the hypothesis space of a given problem (Vilalta and Drissi, 2002).

There are however two main limitations in meta-learning as well as in the algorithm portfolio and configuration approaches that we just have described here. First, algorithms are considered as black-boxes; the only relation which is considered between learning meth-ods is their relative performance on datasets. As a consequence, one has to experiment first an algorithm on his datasets in order to characterize it. Moreover, there is no way where one can generalize the learned meta-models to select algorithms which have not been experimented in the past; in order to account for a new/unseen algorithm, one has to train it before being able to draw conclusions on its relations with other algorithms.

Second, by focusing mainly on the learning or modeling phase of the knowledge discov-ery process, meta-learning simply ignores the other steps that can compose this process, such as pre-processing, transformation and post-processing, and which can impact on the performance of the process. For all these reasons, it is not possible yet in meta-learning to automatically plan/design DM workflows for a given dataset as in ontology-based DM

Second, by focusing mainly on the learning or modeling phase of the knowledge discov-ery process, meta-learning simply ignores the other steps that can compose this process, such as pre-processing, transformation and post-processing, and which can impact on the performance of the process. For all these reasons, it is not possible yet in meta-learning to automatically plan/design DM workflows for a given dataset as in ontology-based DM