Interval support and confidence: - Summarizing Data Cubes Using Blocks

Summarizing Data Cubes Using Blocks

Definition 9. Interval support and confidence:

( , ) count b m conf b m

= b

As argued in the introductory section, considering separately all measure values present in the cube can lead to consider nonrelevant blocks, which will be very small. For instance, in a cube containing billions of cells and where the measure values range from 1 to 1,000 with very few repetitions, almost 1,000 values have to be considered sepa-rately. Alternatively, in this case, 5 and 5.2 are likely to be considered as similar measure values and thus, should be processed as such.

In order to take this important point into account, we propose two ways to build blocks, based on intervals of measure values, on the one hand, and on fuzzy intervals, on the other hand.

In these cases, the support and the confidence of a block are defined as follows.

Definition 9. Interval support and confidence:

The interval support of a block b in C for a measure value interval [m1;m2] is defined as:

( ,[ 1, 2]) _ ( ,[ 1, 2]) iCount b m m i sup b m m

= C

where iCount(b, [m1;m2]) is the number of m-cells in b such that m ∈ [m1;m2]. Similarly, the interval confidence of b for [m1;m2] is defined as:

( ,[ 1, 2]) _ ( ,[ 1, 2]) iCount b m m i conf b m m

= b

When considering fuzzy intervals instead of intervals, counting cells in a block b with respect to a fuzzy interval j can be computed according to the following methods (Dubois, Hülermeier &

Prade, 2003):

1. The Σ-count sums up the membership de-grees of all cells of b.

2. The threshold-count counts those cells of b whose membership degree is greater than a user-defined threshold.

3. The threshold-Σ-count sums up those cell membership degrees that are greater than a user-defined threshold.

In what follows, given a fuzzy interval j and a cell c with content m, we denote by µ(c, j) the membership value of m in j. Moreover, given a block b, ^f

c b∈

∑

(c, j) denotes the count of cells in b whose content is in j, according to one of the three counting methods mentioned above. In this case, the support and confidence of a block b are defined as follows.

Definition 10: Fuzzy Support and Confidence

The fuzzy support of a block b in C for a fuzzy interval j is defined as:

( , )

We first show that the support is anti-monotonic with respect to ∠, in either of the three cases defined above.

Proposition 1: For all blocks b and b’ such that b ∠ b’, we have: given above, fCount(b, j) ≤ fCount(b’, j).

Therefore, the proposition follows from the definitions of the support, which completes the proof.

In our levelwise algorithm for computing blocks given in the next section, proposition 1 is used in the following way in the case of single mea-sure values (the other two cases being similar):

given a block b, a measure value m and a support threshold σ, b is not σ-frequent for m if there exists a block b’ such that b’ ∠ b and b’ is not σ-frequent for m.

The following proposition shows that the sup-port (respectively confidence) of blocks based on intervals and fuzzy intervals are greater than the support (respectively confidence) blocks based on single values.

Proposition 2: For every block b and every mea-sure value m, let m1 and m2 be meamea-sure values such that m ∈ [m1;m2], and let j be a fuzzy interval such that kernel(j) = [m1;m2]. Then, for any of the three fuzzy counting methods Σ^f, we have:

Count(b, m) ≤ iCount(b, [m1;m2]) ≤ fCount(b, j) c contributes to 1 in Count(b, m), iCount(b, [m1;m2]) and fCount(b, j). Similarly, if m_C(c) ≠ m but m_C(c) ∈ [m₁;m₂] then c contributes to 1 in iCount(b, [m1;m2]) and fCount(b, j). Finally, if m_C(c) ∉ [m₁;m₂] then it might be that c contributes to less than 1 only in fCount(b, j). Therefore, the proof is complete.

AlGorIthMs

In this chapter, our goal is to discover blocks whose support and confidence are greater than user specified thresholds. To this end, our method is based on a levelwise Apriori-like algorithm (Agrawal, Imielinski, & Swami, 1993), for scal-ability reasons. In this section, we first present the algorithms for the discovery of blocks in the case of single values, and then we show how the cases of (fuzzy) interval based measure values can be processed.

Roughly speaking, in the case of single mea-sure values, our method works as follows: for every single measure value m in C do the following:

1. For every i = 1,..., k, compute all maximal intervals I of values in dom_i such that, for every v in I, the slice T(v) is σ-frequent for m.

2. Combine the intervals in a levelwise manner as follows: at level l (2 ≤ l ≤ k), compute all σ-frequent blocks b = I₁x...x I_k such that ex-actly l intervals defining b are different than ALL. Assuming that all blocks σ-frequent for m have been computed at the previous levels, this step can be achieved in much the

same way as frequent itemsets are computed in the algorithm Apriori.

3. Considering the set of all blocks computed in the previous step, sort out those that are not MS-blocks and those having a confidence for m less than γ.

It should be clear from definition 7 that a block can be frequent only if it contains at least σ.|C| cells.

Similarly, it follows from definition 8 that, for a given confidence threshold γ, a block b is output only if it contains at least γ.|b| cells containing the measure value m. Therefore, when fixing the support and the confidence thresholds, the user actually determines thresholds concerning the size and the purity of the blocks the user wishes to obtain.

block Generation for single Measure values

In the following algorithms, MS- or MG-blocks are computed according to the user’s specification.

Algorithm 1 performs step 1, while algorithm 2 performs steps 2 and 3.

Referring to the cube of example 1, the sup-ports for measure value 8 of all slices of the cube is displayed in Figure 3, while Figure 1 depicts all blocks output by algorithm 2. These blocks

Note that there are two overlappings: one between b2 and b3, and one between b3 and b4.

Processing Interval-based blocks In this section, we consider the computation of blocks when intervals and fuzzy intervals are considered, instead of single measure values. In

this case, the following modifications must be made in the two algorithms given previously:

1. The supports and confidences must be computed accordingly. That is, in the case of interval based measure values, sup and conf must be replaced by i_sup and i_conf, respectively, and in the case of fuzzy interval based measure values, sup and conf must be replaced by f_sup and f_conf, respec-tively.

2. In algorithm 2, the most outer loop must range over the set of intervals (fuzzy or not according to the considered case), instead of over the set of all single measure values.

On the other hand, when considering intervals or fuzzy intervals, a preprocessing task must be ap-plied on the data in order to discretize the measure values into (fuzzy) intervals. This discretization process can be automatically performed, provided that the user defines the number of intervals she wants to consider.

Denoting by N this number of intervals, and assuming that m_b (respectively m_t) is the bottom value (respectively the top value) of the measure values, [m_b;m_t] can be divided into N intervals either in an equi-width manner (i.e., the widths of all intervals are equal), or in an equi-depth man-ner (i.e., all intervals cover the same number of cells). These intervals are denoted by [bot_i;top_i] for i = 1,...,N, and we note that for every i = 2,...,N, bot_i = top_i_{- 1}.

Then, in the case of fuzzy intervals, we con-sider N trapezoidal membership functions μ₁,..., μ_N such that:

Algorithm 1.

Figure 3. Occurences of measure value 8

It should also be noted from proposition 2 that blocks based on intervals and fuzzy intervals of measure values are larger than blocks based on single values.

complexity Issues

In this section, we show that our method for com-puting blocks is polynomial in time with respect to the size of the cube C, but not complete, in the sense that blocks that fulfill the threshold condi-tions might not be output by our algorithms.

Let m be a measure value. In algorithm 1, the cube is scanned once for each dimension d_i. Thus, this step requires k scans of the cube C. Regard-ing the complexity of algorithm 2, at each level, the whole cube is scanned at most once, since the intervals produced by algorithm 1 for a given measure value do not overlap. As in algorithm

2, at most k iterations are processed, its execu-tion requires at most k scans of the cube C. As a consequence, the computation of all frequent blocks associated to a given measure value m is in O(k.|C|). As computing the confidence of a block does not require the scanning of the cube (because the size of a block is the product of the sizes of the intervals defining this block), the time complexity of algorithm 2 is in O(k.|C|).

Hence, in the case of single measure values, the computation of all blocks is in O(k.|C|²), because C contains at most |C| distinct measure values. We note that although polynomial, the complexity of our method is not linear.

On the other hand, in the cases of (fuzzy) in-tervals of measure values, if we denote by N the number of these intervals, then the computation of blocks is in O(k.|C|.N), that is, linear with respect to the size |C| of C. The experiments reported in Algorithm 2. Discovery of MS-blocks

section 5 show the influence of the three factors (i.e., k, |C| and N) on the computation time.

As mentioned in the introductory section, the general problem of computing blocks of in a cube is known to be NP-hard, and the polyno-mial method proposed in the chapter computes an approximation of the solution, meaning that our method is not complete. As an example, we recall from example 1 that the block b = [C1;C2]

x [P1;P2] is not an output, although its support and confidence are greater than the support and confidence thresholds, respectively. This is due to the fact that the slice T(C2) is not frequent for 6, which implies that the interval [C1;C2] is not computed by algorithm 1.

However, the following example shows that this is not the only reason for noncompleteness.

More precisely, even if completeness would mean the computation of all blocks b such that, for a given measure value m:

1. sup(b, m) ≥ σ and conf(b, m) ≥ γ , and 2. (∀i = 1,..., k)(∀v ∈ δ_i)(sup(T(v), m) ≥ σ ) then example 2 shows that our method is not complete either.

Example 2: Consider the cube C as shown in Figure 4 in which blocks are to be computed according to support and confidence thresholds equal to 1/15 and 80%, respectively.

Using algorithm 1 for the measure value 8, we obtain the intervals [C1;C6] and [P1;P4] because

all slices contain at least two 8-cells. However, since the confidence for 8 of the block [C1;C6] x [P1;P4] is equal to 12/24, which is less than 80%, this block is not computed by algorithm 2. On the other hand, it is easy to see that the two 8-blocks [C1;C3] x [P3;P4] and [C4;C6] x [P1;P2] satisfy the two items above.

In the next section, we study how to modify algorithm 1 so as to take into account situations as in example 2. Then we study the completeness of this modified method.

refInInG the coMPutAtIon of blocks

In this section, we take into account the cell neigh-borhoods in order to enhance the completeness of our method. Intuitively, cells are considered as neighbors if they share one side in the representa-tion. For instance, in Figure 4 the cells 〈P2, C3, 5〉 and 〈P2, C4, 8〉 are neighbors.

cell neighborhood

Definition 11. Neighbor: Two distinct cells c =

〈v₁,..., v_k, m〉 and c’ = 〈v’₁,..., v’_k, m’〉 are neighbors if there exists a unique i₀ in {1,..., k} such that:

• |rep_i0(v_i0) - rep_i0(v’_i0)| = 1 and

• for every i = 1,..., k such that i ≠ i₀, v_i = v_i’.

PRODUCT

P1 5 5 5 8 8 8

P2 5 5 5 8 8 8

P3 8 8 8 6 6 6

P4 8 8 8 6 6 6

C1 C2 C3 C4 C5 C6 CITY

Figure 4. The cube for example 2

136

We note that in a k-dimensional cube, a cell has at most 2.k neighbors. Moreover, considering a slice T(v) where v is a member value of the domain dom_i, we denote by v- and v+ the member values of dom_i such that rep_i(v-) = rep_i(v) - 1 and rep_i(v+)

= rep_i(v) + 1, respectively.

Clearly, every cell c in T(v) has exactly one neighbor in each of the slices T(v-) and T (v+).

As will be seen next, the refined computation of blocks requires to count, for every cell c, the number of neighbors of c that contain the same value. Clearly, this counting depends on whether we consider single measure values or intervals or fuzzy intervals. The following definition states how neighbors of a cell are counted in each of these cases.

In order to state this definition, we recall from Kaufmann (1973) that assessing the fact that the contents of two cells c and c’ both belong to a given fuzzy interval ϕ, denoted by (c, ϕ) ⊗ (c’, ϕ), can be done according to the following t-norms:

• Probalistic t-norm: μ ((c,(c, ϕ) ⊗ (c’, ϕ)) = μ (c,μ (c,(c,

Definition 12. Neighborhood counting: Let v be a member value.

1. If single measure values are considered, then for every such measure value m, we denote by n(v-, m), respectively n(v+, m), the number of m-cells in T(v) whose neighbor in T(v-), respectively in T(v+), is also an m-cell. Then neighbors(v-, m) and neighbors(v+, m) are defined as follows:

2. If intervals of measure values are considered, then for every such interval [m1;m2], given a member value v, we denote by i_n(v-, [m1;m2]), respectively i_n(v+, [m1;m2]), the number of cells in T(v) whose content is in [m1;m2] and whose neighbor in T(v-), respectively in T(v+), is a cell whose content is in [m1;m2].

Then, i_neighbors(v-, [m1;m2]) and i_neighbors (v+, [m1;m2]) are defined as follows:

( -, [ 1; 2]) given one of the above t-norms, a member value v and a fuzzy interval ϕ, we denote by f_neighbors(v-, ϕ) and f_neighbors(v+, ϕ) the following:

Intuitively, neighbors(v-, m) and neighbors(v+, m) are respectively the ratios of m-cells in a given slice having m-cells as neighbors in the previousneighbors in the previous in the previous slice, respectively in the next slice. The same

remark also holds in the cases of intervals and fuzzy intervals.

Modified Computation of Blocks Based on the notation previously introduced, our method works roughly as follows: We assume that, in addition to the support and confidence thresholds, we are given a new threshold called the neighbor threshold, denoted by ν. When scan-ning dimension i for a given measure value m, let v be the member in dom_i of which the slice is being considered. If an interval of the form [V;NIL] where V ≠ NIL is under construction and

if the support of T(v) is greater than the support threshold, then:

• If neighbors(v-, m) < ν, then the interval [V;v-] is output and the computation of the new interval [v;NIL] is considered.

• If neighbors(v+, m) < ν, then the interval [V;v] is output and the computation of the new interval [v+;NIL] is considered.

• Otherwise, the next slice, that is, the slice defined by v+, is considered for the interval [V;NIL].

Algorithm 3.

The corresponding algorithm is given below and is referred to as algorithm 3. In the remainder of the chapter, we call algorithm 2.1 the algorithm obtained from algorithm 2 by replacing algorithm 1 with algorithm 3 for the computation of L₁. It should also be clear that in the case of intervals (respectively fuzzy intervals) of measure values, in algorithm 3, sup and neighbors should respectively be replaced by i_sup and i_neighbors (respectively by f_sup and f_neighbors).

Example 3: We illustrate algorithm 3 using the cube of Figure 4 and we consider the same thresholds as in example 2, that is: σ = 1/15 and γ = 80%. Moreover, let the neighbor threshold ν be 60%.

In this case, for dimension CITY and measure value 8, algorithm 3 starts with [NIL;NIL] as the value for currentInterval and processes the first slice T(C1). As its support is greater than 1/12, and as j = 1, only n+ is computed, and is found equal to 1. Therefore the value of currentInterval is set to [1;NIL] and the next slice, that is, T(C2) is processed. In this case, n+ and n- are computed and both are found equal to 1. Thus, the slice T(C3) is processed.

At this stage, we find n- = 1 and n+ = 0. Since 0 ≤ ν, the interval [C1;C3] is output, the value of currentInterval is set to [C4;NIL] and the slice T(C4) is processed. Now, we find n- = 0 and n+

= 1. As in this case, j = α in algorithm 3, no com-putation is done and the slice T(C5) is processed,

which does not lead to any change. Finally the processing of the slice T(C6) results in the interval [C4;C6], since |dom_CITY| = 6.

It can be seen that, for dimension PRODUCT and measure value 8, the computation is similar and outputs the two intervals [P1;P2] and [P3;P4].

Therefore, for measure value 8, we obtain L1(8)

= {[C1;C3], [C4;C6], [P1;P2], [P3;P4]}.

Regarding the computation of the blocks, it is easy to see that, in this example, algorithm 2.1 computes the two blocks [C1;C3] x [P3;P4] and [C4;C6] x [P1;P2], since their confidence is 1.

completeness Properties

In this section, we study the completeness of our approach in the case of single measure values.

In particular, we show that if we consider a cube that can be partitioned into nonoverlapping blocks containing the same measure value, then algorithm 2.1 computes these blocks.

First, we show that, for limit thresholds, our approach is complete for any cube C, in the sense that Algorithm 2.1 outputs blocks that represent exactly the content of C. In what follows, we call limit thresholds:

• A support threshold such that 0 ≤ σ <

1/|C|,

• A confidence threshold such that γ = 1,

• A neighbor threshold such that ν = 1.

PRODUCT

P1 5 5 5 5 6 6

P2 5 8 5 5 6 6

P3 5 8 8 5 6 6

P4 8 8 8 8 6 6

C1 C2 C3 C4 C5 C6 CITY

Figure 5. The cube for example 4

Before giving the corresponding theorem, we note that considering limit thresholds implies the following:

• A block b is frequent for m if and only if b contains at least one m-cell.

• A block b such that conf(b, m) > γ contains only m-cells.

• If in algorithm 3, L₁(m) is computed accord-ing to ν = 1, then for every m-cell c in a block b returned by algorithm 2.1, all neighbors of c that belong to b are also m-cells.

Now, the completeness of our approach can be stat-ed as follows, in the case of limit thresholds.

Theorem

Let C be a k-dimensional cube. Then for limit thresholds, algorithm 2.1 outputs a set of blocks B such that for every cell c = 〈v₁,..., v_k, m〉 in C, there exists one and only one block b in B associ-ated with m that contains c.

• Proof: Let us first consider a cell c = 〈v₁,..., v_k, m〉 in C. Since we assume that 0 ≤ σ ≤ 1/|C|, every slice T(v_i) is frequent for m.

Therefore, according to algorithm 3, each v_i is in an interval δ_i of L₁(m). Let us consider the block b = δ₁ x...x δ_k that, clearly, contains c.

Since we assume limit thresholds, all neigh-bors of c in b are m-cells, and thus, so are all cells of b. As a consequence, conf(b, m) = 1 and thus b is output by algorithm 2.1, which shows that there exists at least one block b in B associated with m that contains c.

Assuming that two such blocks b and b’ can exist implies that b and b’ overlap and that they both contain only m-cells. However, this situation cannot happen because for any given measure value m, algorithm 3 computes nonoverlapping intervals. So, the proof is complete.

We note that, although Theorem 1 shows an important theoretical feature of our approach, its impact in practice is of little relevance. Indeed, as shown in the following example, in the worst case, algorithm 2.1 outputs blocks that are reduced to one single cell. However, the following example also shows that, with realistic threshold values, our approach can compute relevant blocks, even if in C, the measure values are not displayed in the form of blocks.

Example 4: Let us consider the 2-dimensional cube C of Figure 5 and limit thresholds, for instance σ = 0 and γ = ν = 1. In this case, for measure values 5 and 8, algorithm 2.1 computes blocks that are reduced to one single cell, whereas for the measure value 6, algorithm 2.1 computes the block [C5;C6] x ALL_PRODUCT .

To see this, let us consider the computation of L1(8) for dimension CITY. First, for σ = 0, every slice T(Ci), i = 1,...,6, is frequent for 8. Moreover, for all slices, neighbors(Ci-, 8) or neighbors(Ci+, 8) are less than 1. Therefore, algorithm 3 computes the intervals [Ci;Ci], for i = 1,..., 6. For dimension PRODUCT, a similar computation is produced and we obtain:

L₁(8) = {[Ci;Ci] | i = 1,..., 6} ∪ {[Pj;Pj] | j = 1,..., 4}

Then, when combining these intervals, algorithm 2.1 outputs each 8-cell of C as a block.

It can be seen that a similar computation is done for the measure value 5, whereas, for measure value 6 the block [C5;C6] x ALL_PRODUCT is returned by algorithm 2.1.

Now, we would like to emphasize that if we consider nonlimit thresholds, our approach computes relevant blocks, even in the case of this example. Indeed, let us consider as in example 2, σ = 1/15, γ = 80% and ν = 60%. Then, algorithm 3 returns the following:

L₁(5) = {[C1;C1], [C3;C4], [P1;P3]},

L₁(6) = {[C1;C1], [P1;P4]}, L₁(8) = {[C2;C3], [P3;P4]}.

Applying algorithm 2.1, we obtain the following blocks:

• For measure value 5: [C1;C1] x [P1;P3] and [C3;C4] x [P1;P3].

• For measure value 6: [C5;C6] x [P1;P4].

• For measure value 8: [C2;C3] x [P3;P4].

We note that, in this case, two blocks overlap and that only two cells (namely 〈C2, P1, 5〉 and 〈C4,

Dans le document Data Mining Patterns: New Methods and Applications (Page 147-166)