• Aucun résultat trouvé

Building blocks for pattern mining

2.2.1 Pattern syntaxes

A myriad kinds of pattern syntaxes have been proposed, all for a specific problem setting faced or imagined. Common examples include:

• emerging patterns [3, 4], association rules [1, 5, 6], subgroups [7–9] inflat tabular data;

• motifs (or frequent subsequences) [10–13], time series shapelets [14–16], outliers [17–21], episodes [22–25] insequential data;

• frequent subgraphs [26–29], dense subgraphs [30–33], trees [34–36], cycles [37, 38] ingraph data;

• moving together patterns [39–41], sequential trajectories [42–44], outlier trajectories [45, 46] inspatial-temporal data;

2.2.2 Interestingness measures

Interestingness measures are intended to quantify how good a discovered pattern is deemed to be. Though contemporary research on formalizing interestingness measures has been hugely active and well-established, so far there is no consensus of how the interestingness should be precisely defined. Among this large number of diverse definitions and approaches to interestingness proposed in the literature, two major categories are exhibited [47]: objectiveandsubjectiveinterestingness measures—according to whether the user’s prior knowledge is considered.

Objectiveinterestingness measures only depend on the data and patterns. Most of them are formalized based on theories in probability, statistics, or information theory, and are intended to prioritize or consider factors such asconciseness(e.g., cardinality of a pattern set),coverage(e.g.,supportof an itemset [48]),reliability (e.g.,accuracyof a classification rule [49]),peculiarity(e.g.,peculiarityof a data

object [50]) ordiversity(e.g., cover redundancyof a subgroup set [9]). Though straightforward forms of objective measures often lend themselves to efficient min-ing algorithms, the discovered patterns w.r.t. them are often either obvious or al-ready known by the user (and are thus not truly interesting). This is due to the ignorance of variations among users—what is interesting to one may be nothing but useless information to another.

Subjectivemeasures take into account the user, in addition to the data and pat-terns. To do that, the user’s prior knowledge or expectations about the data are modelled by the commonly calledbackground knowledge, and then the interest-ingness of a discovered pattern is measured by how much this pattern deviated from the background knowledge. According to the way the background knowl-edge is encoded and the deviation is defined, two major classes of methods can be distinguished: thesyntacticaland theprobabilisticmethods. The former encodes the background knowledge by a collection of independent patterns with the same syntax as patterns to be mined (thus the user has to hold some explicit knowl-edge of the required form), and then employs a distance measure to evaluate their similarity or difference—the more distant, the more interesting (e.g., using fuzzy matching [51], logical contradiction [52]). The latter utilizes a probability distri-bution of the data (expressed explicitly or implicitly), calledbackground model, to encode the background knowledge. Then the interestingness can be either mea-sured by the deviation between a statistic of the pattern calculated on the empirical data and that on the background model (e.g., using Bayesian networks [53], swap randomization [54, 55]), or directly by the probability of this pattern under the background model (e.g., using maximal entropy models [56, 57]).

For excellent surveys of different interestingness measures, we refer interested readers to [47, 58, 59].

2.2.3 Mining algorithms

An ideal mining algorithm is an effective and efficient one—i.e., it can identify interesting patterns within a short response time. Existing algorithms can be cate-gorized into three groups:

1. Candidate generation-and-test algorithms. The first line of research adopts acandidate generation-and-testapproach, where a sufficiently large enumeration space (hopefully) guaranteed to contain all interesting patterns is generated, and then a portion of top-scoring patterns are selected out by testing the quality of each candidate. Existing algorithms of this paradigm can be further divided into the following three types:

• Exhaustive search plus pruning strategies. Exhaustive search is often used in tandem with some pruning strategies, in which the former serves for the

effectiveness, and the latter serves for the efficiency. A bellwether algorithm of this type isApriori[48], also known as thelevel-wise algorithm. Though originally designed for association rule mining, it has now been substantially studied and extended to mine many other patterns. Basically, Apriori adopts a breadth-first manner—i.e., pattern candidates of sizekare generated using size-(k−1) candidates, along with a pruning strategy based on an anti-monotoneproperty—i.e., if a size-(k−1)itemset is not frequent, none of its size-ksuper itemset can be frequent. Other algorithms of this type mostly borrow the spirit of Apriori, but differ in the properties exploited to prune the potentially complete search space (e.g., monotonicity [60], convertible con-straints [12], succinctness [60, 61], condensed representation [62]). Though equipped with pruning strategies, exhaustive enumeration may still be infea-sible, especially when confronting gigantic data.

• Heuristic search. The second type explores the enumeration space heuristically—i.e., picks most promising branches according to a certain rule of thumb, termed heuristic. Typical examples include hill climb-ing[63, 64],beam search[7, 9, 65],evolutionary algorithms[66, 67], among others. These algorithms scale better than the exhaustive search with prun-ing strategies, but they cannot guarantee the optimality. For many of them, even an error bound cannot be given as well.

• Anytime algorithms. Anytime pattern mining algorithms are also enumer-ative methods, but exhibit the so-calledanytimefeature [68]: they can be interrupted at any point of time to supply patterns whose quality gradually improves over time, and hence the whole process would converge to an ex-haustive search if sufficient time is given, guaranteeing to return the exact re-sult. Recently, this type of algorithms have been employed to mine frequent itemset [69], interval patterns [70], outliers [71] and so on. Particularly in Belfodil et al. [70], the proposed algorithm can always provide a guaranteed bounding of the quality difference between the top found pattern and the top possible pattern.

It is worth mentioning thatBranch-and-bound (BnB)[72], a principal algo-rithmic methodology which is usually used to find exact solutions to combi-natorial optimization problems, can be naturally adapted as an anytime one:

BnB algorithms evaluate the search space in a gradual way such that a given problem is decomposed into smaller subproblems (according to a certain branchingrule) and each of those subproblems may be further decomposed or pruned (according to a certainboundingrule), and clearly they find better solutions as less unexplored subproblems remain. BnB algorithms have also been applied to mine patterns such as boxes [73], maximal cliques [74, 75], discriminative patterns [76], and so forth.

2. Pattern-growth algorithms. More often than not, the candidate set gener-ation is costly. To circumvent this, the second main line of research follows a pattern-growthparadigm [77–80] led by Han et al. [77] in handling frequent pat-tern mining. The core of this paradigm is the construction of a highly compact data structure, e.g.,frequent pattern tree (FP-tree)[77], which stores compressed, crucial information about frequent patterns. Then a pattern-growth method is per-formed in a divide-and-conquer manner: the database is partitioned and projected based on the currently discovered frequent patterns, and new longer patterns are directly attained by growing discovered ones—through a traversal on that compact data structure. Hence, unlike candidate generation-and-test approaches that re-quire many scans of the entire database (thek-th scan checks the frequency of each size-kcandidate), pattern-growth approaches only need two scans (for construct-ing the compact data structure), then the rest steps are minconstruct-ing rather the compact structure than the (usually substantially larger) original data, which saves a huge cost.

3. Algorithms applying the sampling strategy. Recent pattern mining ap-proaches set up the third and elegant paradigm—i.e., relying on a (controlled) samplingstrategy [81–84]. More specifically, algorithms falling under this cate-gory design an efficient sampling procedure to access the pattern languageL, i.e., simulating a distributions:L →[0,1]that considers the corresponding interest-ingness measurem, e.g.,s(·) = m(·)Z whereZ is a normalizing constant. This enables us to obtain a pattern collection that is of size under control and is rep-resentative for the distributionsand hence for the underlying interestingnessm, without expensive candidate set generation. However, it is still very probable to draw an uninteresting pattern, because the distributionsislong-tailed—there are much more uninteresting patterns than interesting ones.