• Aucun résultat trouvé

QTIPrefixSpan results

Dans le document The DART-Europe E-theses Portal (Page 63-68)

In this section, we present some results obtained with QTIPrefixSpan. QTIPrefixSpan pro-cesses a sequence of interval-based events. In the data at hand, we have punctual events corre-sponding to drug deliveries. Then, we apply a pre-processing to transform drug deliveries in drug exposure. This preprocessing is detailed in next Section, next we present the results.

E.3.1 Dataset preparation

The transformation of drug deliveries in drug exposure in a three steps procedure:

1. each drug delivery event (e, t) is transformed into an interval based event (e,[t, t+ 30]). In the French system, a drug delivery must be renew at least every month. Then, we assume that a drug delivery correspond to a 30 day drug exposure.

2. joint intervals are recursively merged. If two interval-based event (e,[l1, u1]) and (e,[l2, u2]) are such thatu1+ 7≥l2 then the two intervals are merged into an interval (e,[l1, u2]). We made the hypothesis of a delivery renew every 30 days. The value 7 allows to take into account a certain flexibility to consider that successive deliveries are contiguous.

The dataset of 18,499 sequences preceding seizure have been preprocessed to generate as much interval-based sequences. The transformation reduces the number of events from 341,100 to 188,512.

E.3.2 QTIPrefixSpan

The setup of QTIPrefixSpan is the following: minimal frequency thresholdfmin = 15%, simi-larity threshold= 30 with Hausdorff distance. We bounded the size of the symbolic signature to 4 items.

The algorithm extracted 142 patterns in few seconds. While focusing on patterns having both 114 and 383 events, two patterns are more especially interesting (illustrated in Figure3.19):

. p1QT I =h(86,[0,93]) (114,[33,85]) (383,[63,98])i

. p2QT I =h(86,[6,36]) (383,[6,36]) (114,[61,105]) (86,[61,111])i

00 10 20 30 40 50 60 70 80 90 100

86 114

383

00 10 20 30 40 50 60 70 80 90 100

86 86

114 383

Figure 3.19: Illustration of two patterns extracted byQTIPrefixSpan from interval-based tem-poral sequence of drug exposure preceding an epileptic seizure.

These two examples are interesting because they illustrate two types of switches between 114 and 383 drugs preceding epileptic seizures. p1QT I illustrates a change from 114 to 383. The overlap between the two intervals has to be carefully analyzed because, it does not necessarily means that the patient had both drugs at the same time. Indeed, we have seen that therepresentativenature of QTI patterns leads algorithms to enlarge the intervals extends (see discussion in SectionB.4).

p2QT I illustrate a switch from 383 to 114. The event 86 which (paracetamol) is here also split in two different events. This pattern illustrates that our approach can extract repetitive patterns.

F Conclusion and perspectives

This Chapter presented pattern domains and algorithms to extract frequent patterns where quan-titative temporal information is part of the extracted knowledge. It contributes to answer the challenge of enhancing sequential patterns with metric temporal information.

More precisely, we had initially two questions about extracting temporal pattern mining:

. is it practically tractable? Our answer is that the proposed solutions are practical for mid-size data set (data set with less than 100,000 sequences of length lower than 30). Indeed, the experiments on synthetic dataset shown that extracting temporal information requires computational resources (time and memory). Such data characteristics allows to address a wide range of applications, even if it requires to prepare the dataset (e.g. by sampling sequences). Large data set may be difficult to mine with our algorithm.

. does it bring new interesting information? Our case study on care pathway analysis illustrates potential interest for having such kind of expressive pattern while analyzing tem-poral data. Nonetheless, we put light on the necessity to carefully discuss the semantic of pattern while proposing a new pattern domain (e.g. boundaries of discriminant patterns, negation in negative sequential patterns). Simple syntax may hide complex semantics and possible misinterpretation.

A focus on pattern semantics

The main motivation of our work is to extract meaningful patterns. Contrary to most of the pattern mining algorithms, we do not think about the computational efficiency first, but about the semantic enhancement of patterns first. Then, we explore the formal properties of the desired pattern domains (e.g. chronicles, or negative sequential patterns) to propose algorithm strategies to extract patterns in reasonable time on mid-size dataset. Scaling up our algorithm is still a huge challenge.

Despite the limitations of proposed algorithms, mining temporal sequences attracts the interest of data analysts in various applications (e.g. health care pathways, emotion detection from facial

images, electricity consumption, ECG), and some extensions have been proposed by other research groups. TheDCM algorithm has been used in two other contexts:

. Sellami et al.[2018] usedDCM for predictive maintenance purposes in an industrial context, . K. Fauvel (PhD Candidate in LACODAM) explored chronicles and DCM for predicting

oestrus of cows during the I. Harada internship

TheQTempIntMinerand QTIPrefixSpanalgorithms have been extended by several research teams in various directions:

. inRuan et al.[2014], they proposed to parallelize the algorithm.

. inHassani et al.[2016], they explore alternative distance.

. in Dermouche and Pelachaud [2016], they propose to use a vertical format to use a search strategy inspired from SPADEZaki[2001].

Finally, NTGSP has been used by EDF to analyze their customer pathways.

Toward a general framework

This research line started with the proposal of QTempIntMiner. To the best of our knowledge, this algorithm proposed in 2008 [Guyet and Quiniou,2008] was the first temporal pattern mining algorithm to combine pattern domain exploration and machine learning. It initiates our work which leads to propose several algorithms presented in this Chapter: QTempIntMiner, QTIPrefixS-pan,DCM, NTGSP. Their principles are similar and based on temporal pattern definition in two parts:

. a symbolic signature: the structural skeleton of the pattern; Symbolic signatures are structured in a poset (ideally a concept lattice) such that the classical tricks of frequent pattern mining ensures their efficient extraction. The different types of symbolic signatures are: sequences, multisets or negative patterns.

. temporal constraints: it enhances the symbolic signature with temporal infor-mation; Then, through the “projection” of each symbolic signature on sequences, we build a tabular dataset representing timestamps of occurrences. A tabular dataset is processed by a clustering (KMeans or CLIQUE) or classification algorithm (Ripperk) to extract meaningful temporal information.

Sequence

KMeans

QTI patterns

Multisets

Ripperk

Chronicles

Negatives pattern

CLIQU E

NTP

Figure 3.20: Parallelism between the three proposed algorithms, from left to right: QTempInt-Miner,DCM and NTGSP.

Figure3.20illustrates the three proposed algorithm. It shows the obvious similarities between this approach, and it suggest that a generic framework may be developed.

Perspectives

Then, the first perspective of this research line is to unify some of our approaches, namely chron-icles and negative patterns. With NTGSP, we shown that modeling absent events and modeling temporal delay between events may be seen as constraints on the occurrences of a sequential pat-tern. With chronicles, multisets are preferred to sequential patterns with constraints between pairs of events. A unification would be to have couples (E,T) whereE is a multiset of event types and T is a set of constraints that are temporal constraints or negation constraints. This perspective

of designing such a pattern domain combines the expressivity of negative patterns and chronicles;

and we expect inheritance of their computational properties to propose efficient algorithms.

To improve the efficiency of our approaches, we believe in sampling approaches (see Section A.3.3of Chapter2). Approaches based on MCTS are dedicated to the exploration of large search space, and the solution developed byBosc et al.[2018] to extract diverse set of patterns may be adapted to temporal patterns.

Finally, there is some space to better promote our algorithm implementations. Indeed, the most population implementation of sequential pattern mining, namely SPMF library Fournier-Viger et al. [2016], does not have any algorithm to extract temporal patterns. Releasing solid implementation of our algorithms as a library of temporal pattern mining may be useful for a broad range of applications.

Declarative Sequential Pattern Mining 1

The challenge of mining a deluge of data is about to be solved, but is also about to be replaced by another issue: the deluge of patterns. In fact, the size of the complete set of frequent patterns explodes with the database size and density [Lhote,2010]. The data analyst cannot handle such volumes of results. A broad range of research, from data visualization [Perer and Wang,2014] to database sampling [Low-Kam et al., 2013] is currently attempting to tackle this issue. The data-mining community has mainly focused on the addition of expert constraints on sequential patterns [Pei et al.,2004].

Recent approaches have renewed the field of Inductive Logic Programming (ILP) [Muggleton and De Raedt, 1994] by exploring declarative data mining. Declarative data mining consists in encoding data mining or machine learning tasks in declarative languages (SAT, CP, Linear Pro-gramming, Logic Programs, etc) [De Raedt, 2012]. One of the objective is to benefit from their versatility to ease the addition of constraints. This approach has been developed for clustering [Adam et al., 2013; Dao et al., 2018], for supervised machine learning tasks (e.g. rule learning, ILP), but more especially for pattern mining tasks.

Many approaches addressed the problem of declarative pattern mining, especially itemset min-ing [Guns et al.,2015;J¨arvisalo,2011;Jabbour et al.,2018]. Some propositions have extended the approach to sequence mining [Negrevergne and Guns, 2015; M´etivier et al., 2013; Jabbour et al., 2013; Coquery et al., 2012]. Their practical use depends on the efficiency of their encoding to process real datasets. Thanks the improvements on satisfiability (SAT) or constraints program-ming (CP) solving techniques and solvers, such approaches become realistic alternatives for highly constrained mining tasks. Their computational performances closely reach those of dedicated al-gorithms.

The long term objective is to benefit from the genericity and versatility of solvers to let a user specify a potentially infinite range of constraints on the patterns. Thus, we expect to go from specific algorithm constraints to a rich query language for pattern mining.

This part presents the use of Answer Set Programming (ASP) to mine sequential patterns.

ASP is a high-level declarative logic programming paradigm for high level encoding combinatorial and optimization problem solving as well as knowledge representation and reasoning. In ASP, the programmer has to specify a search space (choice rules) and the description of the search space elements to select (constraints). This maps to the problem of pattern mining in which interesting patterns (e.g. frequent ones) are selected among the space of patterns. Thus, ASP is a good candidate for implementing pattern mining with background knowledge, which has been a data mining issue for a long time.

The contributions of this chapter are the following:

1This part is a collective work with Yves Moinard, Ren´e Quiniou, Torsten Schaub, Martin Gebser and Javier Romero

62

. We proposeencodings of the classical frequent sequential pattern mining taskwithin two representations of embeddings (fill-gaps vsskip-gaps).

. We illustrate the expressiveness of ASP by presenting encodings of different variant of sequential pattern miningtasks (closed, maximal, constraints, rares, etc.)

. We apply our framework to the mining of care pathways. It shows the advantage of ASP compare to other declarative paradigms to blend mining and reasoning.

A ASP – Answer Set Programming

In this section we introduce the Answer Set Programming (ASP) paradigm, syntax and tools.

SectionA.1introduces the main principles and notations of ASP. Section A.2illustrates them on the well-known graph coloring problem.

Dans le document The DART-Europe E-theses Portal (Page 63-68)