Introduction to Sequential Pattern Mining

List of Abbreviations

3.2 Introduction to Sequential Pattern Mining

Thesequential pattern miningis the mining of frequently occurring patterns related to the time or other sequences [HK01]. In Example 1, we described a frequent sequen-tial pattern related to other sequences in the structured log file (i.e. 10% of all the sequences).

Using sequential pattern mining, one can easily identify the paths that users frequently follow on a Web site. Therefore, they are widely used in the WUM field, as mentioned in Section 1.2.2. In this section, we give a brief description of the sequential pattern mining problem, then we explain the issues arising when applying sequential pattern mining techniques in WUM and, finally, give some limitations of the actual algorithms.

3.2.1 Definitions

The sequential pattern mining (SPM) is based on association rule mining, and we present hereby the definitions for both terms, as given by [AIS93] and [SA96]. In [AIS93], anassociation rule is defined as follows:

Definition 1 LetI ={i₁, i₂, ..., i_m}be a set ofmliterals (items). LetD={t₁, t₂, ...t_n} be a set ofntransactions. Associated with each transaction is a unique identifier called its T ID and anitemsetI. I is ak-itemset whenk is the number of items inI.

We say that a transaction T contains X, a set of some items inI, ifX⊆T.

Thesupportof an itemsetI is the fraction of transactions inDcontainingI: supp(I) = k{t∈D|I ⊆t}k/k{t∈D}k.

An association rule is an implication of the form I1 ⇒ I2, where I1, I2 ⊂ I and I₁ ∩I₂ = ∅. The rule I₁ ⇒ I₂ holds in the transaction set D with confidence c if c% of transactions in D that contain I₁ also contain I₂. The rule r : I₁ ⇒ I₂ has support s in the transaction set D if s% of transactions in D contain I1 ∪I2 (i.e.

supp(r) =supp(I₁∪I₂)).

Given two parameters specified by the user, minsupp and minconfidence, the problem of association rule mining in a database D aims at providing the set of frequent item-sets in D, i.e. all the itemsets having a support greater than or equal to minsupp.

Association rules with a confidence greater thanminconfidence are thus generated.

Introduction to Sequential Pattern Mining 59 The definition for the association rule does not take into consideration the time, con-trary to thesequential pattern which is defined in [SA96] as follows:

Definition 2 Asequenceis an ordered list of itemsets denoted by< s₁s₂. . . s_n>where s_j is an itemset. Thedata-sequenceof a customercis the sequence inDcorresponding to customer c. A sequence < a1a2. . . an > is a subsequence of another sequence

< b₁b₂. . . b_m > if there exist integers i₁ < i₂ < . . . < i_n such that a₁ ⊆ b_i₁, a₂ ⊆ b_i₂, . . . , a_n⊆b_i_n.

Example 2 Let C be a client and S=<(3) (4 5) (8)>be the purchases of that client.

S can be read as “C bought the item 3, then the items 4 and 5 at the same moment (i.e. in the same transaction) and, finally, C bought the item 8”.

Definition 3 Thesupport for a sequences, also calledsupp(s), is defined as the frac-tion of the total data-sequences that contains. Ifsupp(s)≥minsupp, with a minimum support value minsupp given by the user,sis considered as a frequentsequential pat-tern.

3.2.2 Log Files Analysis with Sequential Patterns

The format of the (access) log file is not suitable for directly applying sequential pattern mining algorithms on them. This is due to the fact that the sequential pattern mining techniques were designed at first to analyze the market basket transactions. Therefore, the log data needs to be transformed into a (client, date, item) format.

The general process of Web Usage Mining is similar to the principle proposed in [FPSS96]. It relies on three main steps. First, starting from a raw data file, a pre-processing step is needed to clean the “useless” information as presented in Chapter 2.

The second step starts from this preprocessed data and applies data mining algorithms to find frequent itemsets or frequent sequential patterns. Finally, the third step aims at helping the user to analyze the results by providing a visualization of the result.

Raw data is collected in log files by Web servers. Each input in the log file illustrates a request from a client machine to the server (http daemon). Access log file formats can differ, depending on the system hosting the Web site. We illustrate our approaches with an access log file in the ECLF format (an extended version of the Common Log Format (CLF) given by CERN and NCSA [NCS95]). A log entry contains 9 fields, separated by spaces:

host user authuser [date:time] “request” status bytes “referrer” “user agent”

Definition 4 Let Log be a set of server access log entries. An entry g, g ∈Log, is a tuple g =< ip_g,([l^g₁.U RL, l^g₁.time]... [l_m^g.U RL, l^g_m.time])> such that for 1≤ k≤m, l^g_k.U RLis the item requested by the useripgat timel^g_k.time, for all 1≤j < k, l_k^g.time >

l^g_j.time.

User\ Item (position) 1 2 3 4 5

1 10 30 40 20 30

2 10 30 20 60 50

3 10 70 30 20 30

Table 3.1: File Obtained after the Preprocessing Step, from the DB

The preprocessing applied on the access log files and described in Chapter 2 is used for cleaning the log files and grouping the requests in sessions¹. At the end of this process, the preprocessed data is stored in a relational database. From this database, it is straightforward to extract a file in the format(client, date, item), containing the sessions corresponding to all the users we want to analyze. The Table 3.1 gives an example of a file extracted from a database containing a preprocessed log for 3 users.

For each user, we have an ordered list with the URLs requested. For instance, the user 2 requested the URL with the ID“60”in the forth requests.

Thus, our goal is, according to Definition 3 and by means of an sequential pattern min-ing step, to find in this file the sequential patterns that can be considered as frequent.

The result may be, for instance < ( 10 ) ( 30 ) ( 20 ) ( 30 ) > (using the file illustrated in Table 3.1 and a minimum support of 66%). Such a result, once mapped back into the URLs, allows the discovery of a frequent behavior, common to n users (withncalculated as the support of the patterns times the total number of users) and also gives the sequence of the events (i.e. clicks) composing that behavior.

3.2.3 Challenges for Sequential Pattern Mining Algorithms

As mentioned in Chapter 1, the increase in the Internet usage determined the increase of the Web logs. Moreover, Web sites are becoming larger as more Web pages are being added than suppressed. Web site owners prefer to keep the old Web pages because storing information is relatively cheap nowadays and also because users bookmark Web pages and they expect to find them always online.

Sequential pattern mining algorithms are usually based on the well known Apriori principle of the candidate generation [AIS93]. In this type of algorithms, at a given stepk, candidate patterns of lengthk+ 1 are generated from current patterns of length kand then tested against all the sequences in the database. This strategy proved to be inefficient for a large number of sequences or a large number of items as the latter will imply a large number of candidates being generated. In the case of Web logs, we have a high number of both sequences (i.e. sessions) and items (i.e. Web pages). Therefore, applying standard algorithms for extracting the sequential patterns from Web logs will be possible only for higher supports (the number of candidates generated will rapidly decrease). But, this will lead to few or uninteresting results (e.g. short patterns or

1In this chapter we use the termsession to designate the objects analyzed, representing the user sessions, the visits or the episodes.

Three Divisive Sequential Pattern Mining Approaches 61 patterns involving the main pages of the Web site). For example, one may discover that“20% of the users visited the home page and then the research teams page”.

To obtain more interesting sequential patterns, i.e. longer or involving pages with a higher depth or syntactic level, we need to be able to extract the sequential patterns with a very low support. The existing sequential pattern mining algorithms are not capable of doing this because of the exponential number of candidate patterns being generated at each step. For example, we can assume that for a very low supports_vl, the algorithm generates n₁ patterns of length 1. The number of candidates of length 2 is n²₁ and these candidates need to be tested against all the sequences from the database.

To face these challenges, we propose to use a general divisive sequential pattern mining methodology and we designed three approaches for implementing it.

Dans le document The DART-Europe E-theses Portal (Page 81-84)