Mining frequent sequential pattern mining in alternative datasets

Sequential pattern mining is also classically defined for two alternative types of data:

. the temporal sequences, . the streams (or event logs).

Sequential pattern mining in these contexts remains very similar the problem presented above.

The pattern domain is identical, (L,C), but the difference lays in the way the support of a pattern is evaluated.

Next sections introduce some additional definitions for temporal sequences, logs and stream;

and they redefine the notion of support of a sequential pattern which fit our framework (anti-monotonicity).

A.4.1 Frequent sequential patterns in temporal sequences

In most of real applications, sequences that are collected are sequences of timestamped events. For instance, in a sequence of supermarket purchases, each purchase has a date. A temporal sequence is a sequence of timestamped items, so called events.

Definition 12(Event and Temporal sequence). Aneventis a couple(e, t)wheree∈ Iis the event type andt∈Ris a timestamped. (e, t)is strictly before an event(e⁰, t⁰), denoted(e, t)l(e⁰, t⁰), iff t < t⁰ ort=t⁰∧e <_Ee⁰.

Atemporal sequencesis an set of eventss=h(e1, t1) (e2, t2) . . . (en, tn)isuch that(ei, ti)l (e_i+1, t_i+1)for alli∈[n−1]. The length of sequences, denoted|s|, is equal to its number of events.

The duration of sequence ist_n−t₁.

In addition, we assume that a temporal sequence does not contain two identical events (same event type at same date).

It is worth noticing that an event is associated to one item, but not to an itemset. Nonetheless, it is possible to represent event types occurring at the same time by two events with the same timestamp.

Then, it is easy to convert a temporal sequence to a sequence and conversely.

Example 11(Sequences and Temporal sequences). For instance, in the example of a sequence of purchaseh(Chocolate,34) (M ilk,35) (Bread,35) (Apple,42)i, the corresponding sequence is hChoco-late(M ilk, Bread)Applei.

On the opposite, the sequencehChocolate(M ilk, Bread)Appleican be converted into a temporal sequence h(Chocolate,1) (M ilk,2) (Bread,2) (Apple,3)i.

One can notice that the conversion from sequence to temporal sequence (and conversely) pre-serves the order of the items, but not the timestamps.

Remark 2. Please note that wherever we write(E, n)for some event typeE and timestampn(as in Example11), the reduced fontsize of the timestamps has no meaning as it serves only readability purposes.

Definition 13(Occurrence and embedding of a sequential pattern). Let s=h(e₁, t₁) (e₂, t₂) . . . (en, tn)ibe a temporal sequence andp=hp1 p2 . . . pmibe a sequential pattern. poccurs ins, iff there exists a sequence of integers1 ≤i1 ≤i2≤ · · · ≤ im≤nsuch that ei_k ∈pk for all k∈[m], k < l =⇒ ti_k = ti_l and k < l =⇒ ti_k < ti_l. In other words, (ik)_1≤k≤m defines a mapping from [m], the set of indexes of t, to [n], the set of indexes of s. We denote by t ≺ s the strict sub-sequence relation such thatts andt6=s.

(ik)_1≤k≤mis called an embedding.

A.4.2 Frequent pattern in a stream

The mining of a unique long sequence is usually associated to the name serial episode mining⁶. This task consists in extracting frequent sub-sequences from a unique sequence. It can typically correspond to the mining of recurrent behaviors from a long functioning trace from a single system.

On the one hand, this type of dataset can easily be transformed into a set of smalled sequences by windowing the sequence. It can be regularly spaced windows in time or a sliding windows as proposed in the WinEPI algorithm [Mannila et al.,1997]; or it can be windows preceding a given

6In episode mining introduced byMannila et al. [1997] “serial” have to be opposed to concurrent episode, i.e.

episode that consider concurrency between item occurrences. Sequential pattern mining are very similar to serial episodes.

itemcas we did in our analysis of ECG datasets. The way the windows are built will have a lot of consequences on the results. In the last case, the extracted patterns correspond to patterns that frequently occur before occurrences of item c. Some correlations with c may be induced by the expert. This is not the case with other strategies.

One the other hand, the pattern mining algorithms of a such long sequence may be adapted considering alternative enumerations of pattern occurrences. Achar et al. [2010] reviewed several ways of enumerating occurrences and evaluated their respective complexity. In fact, in this setting, enumerating occurrences of a pattern in a sequence is a combinatorial task.

Some works was interested in mining sequence from a stream of items. The notion of stream brings the idea of incremental modifications of the dataset. In sequence mining, various kind of modifications have been proposed: some algorithms consider transactional incrementality and add new sequences in a database (CISpan [Yuan et al., 2008], PL4UP, [Ezeife. and Monwar, 2007]), some algorithms consider temporal incrementality and add new items at the end of the dataset transactions (IncSpam [Ho et al., 2006], IncSpan [Cheng et al., 2004], [Huang et al., 2006]) and we explored the mining of a unique stream of itemsets and adapt serial episode mining to the streaming case [Guyet and Quiniou, 2012].

B Case study: reproducing a pharmaco-epidemiological anal-ysis

In this section, we introduce the case study we use as running example in the different chapters of this manuscript. This case study has been investigated within a collaborative project with REPERES, a joint laboratory between Rennes University Hospital and EHESP. REPERES studies medico-administrative databases to answer pharmaco-epidemiological questions.

Pharmaco-epidemiology is the study of uses of health products on population in real life. It applies the methodologies developed in general epidemiology [Public Policy Committeeet al.,2016]

to answer questions about the uses of health products, drugs [Polard et al., 2015; Nowak et al., 2015] or medical devices [Colas et al., 2015], on population. In pharmaco-epidemiology studies, people who share common characteristics are recruited and data about them are collected to build a dataset. Then, meaningful data (drug exposures, events or outcomes, health determinants, etc.) are collected from the sampled population within a defined period of time. Finally, a statistical analysis highlights the links (or the lack of link) between drug exposures and outcomes (e.g. adverse effects). The main drawback of prospective cohort studies is the time required to collect the data and to integrate it. Indeed, in some cases of health product safety, health authorities have to answer quickly public health issues.

A recent approach consists in using administrative databases to perform such studies on care pathways, i.e. an individual sequence of drugs exposures, medical procedures and hospitaliza-tions. Using medico-administrative databases (MADB) has been proved to be useful in pharmaco-epidemiological issues⁷, since data is readily available and covering a large population. The french national health insurance database (SNDS/SNIIRAM) has these characteristics and recent national initiatives foster research effort on using it to improve cares.⁸ However, it has been conceived for administrative purposes to ensure health reimbursements. Their use in epidemiology is uneasy. It requires to develop specific skills to treat data and to interpret results. They record, with some level of details, all drug deliveries and all medical procedures, for all insured. Such database gives an abstract view on the longitudinal patient care pathways. Medico-administrative databases are often used for risk assessment purposes, and more especially for detection of adverse drug reactions [Zhao et al.,2015]. While pharmaco-vigilance is related to detection, assessment and prevention of risk related to health products, pharmaco-epidemiology could also be viewed as discovering or ex-plaining consequences of health products consumption. The later is a knowledge discovery process

7The interest of using this medico-administrative database has been demonstrated by the withdrawal of benflu-orex [Weill et al.,2010].

8In the Villani report on AI, which founded the French National Plan on AI, the SNDS databases is mentioned as one of the data source to tackle by AI to improve the health care system.

from a large data warehouse of patient care pathways.

Dans le document The DART-Europe E-theses Portal (Page 21-24)