• Aucun résultat trouvé

Closed sets based discovery of association rules

N/A
N/A
Protected

Academic year: 2021

Partager "Closed sets based discovery of association rules"

Copied!
9
0
0

Texte intégral

(1)

HAL Id: hal-00467749

https://hal.archives-ouvertes.fr/hal-00467749

Submitted on 26 Apr 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Closed sets based discovery of association rules

Nicolas Pasquier

To cite this version:

Nicolas Pasquier. Closed sets based discovery of association rules. GDR-I3’1999 Meeting, Mar 1999, Marseille, France. pp.55-64. �hal-00467749�

(2)

Association Rules

Nicolas Pasquier

Laboratoire d'Informatique (LIMOS)

Université Blaise Pascal - Clermont-Ferrand II

Complexe Scientique des Cézeaux

24 Avenue des Landais, 63177 Aubière Cedex France

pasquier@libd1.univ-bpclermont.fr

Plan of the Presentation

1 Association rule framework

2 Existing algorithms

3 A-Close algorithm

4 Illustration

5 Experimental results

6 Conclusion

7 Present work

(3)

1 Association Rules

 Data mining context (dataset)



binary relation ROI



O : nite set of objets (transactions)



I : nite set of items (attributes)

OID Items 1 A C D 2 B C E 3 A B C E 4 B E 5 A B C E

Figure 1: The example data mining context D

 Itemset (set of items) support



proportion of objects containing the itemset

support(BC)=k2;3;5k=5=3=5

 Association rules



implications between two itemsets

r :BC!E (support%;confidence%)

 Association rule support



support of the union of antecedent and consequent of the rule

support(r)=support(BCE)=k2;3;5k=5=3=5

 Association rule condence



proportion of objects verifying the implication

confidence(r)=support(BCE)=support(BC)=1

(4)

2 Existing Algorithms

 Problem decomposition

1. determination of frequent itemsets (supportminsupport)

2. generation of association rules using frequent itemsets (confidenceminconfidence)

 The problem of extracting association rules is reduced to the problem of discovering

fre-quent itemsets

 Pruning subset latticeL

I to extract frequent itemsets

A B C D E Ø A B A C A D A E B C B D B E C D C E D E A B D E B C D E A B C E A B C D A C D E A B C D E A B C A B D A B E A C D A C E A D E B C D B C E B D E C D E

Frequent itemset (minsupport= 2/5) Infrequent itemset

Figure 2: Subset lattice of D

 Size is exponential kL

I

k=2

(5)

3 A-Close Algorithm

 Closure operator of the Galois connection of a binary relation

 Closed itemset: maximal set of items common to a set of objects

ex: BC is not closed sinceObjets(BC)=2;3;5 but Items(2;3;5)=BCE

 Problem decomposition

1. discovering frequent closed itemsets

2. deriving frequent itemsets from frequent closed itemsets 3. generating association rules using frequent itemsets

 The problem of extracting association rules is reduced to the problem of discovering

fre-quent closed itemsets

 Closed itemset properties

i)

all maximal frequent itemsets are maximal frequent closed itemsets

ii)

the support of a non-closed itemset is equal to the support of its closure

iii)

the maximal frequent closed itemsets characterise all frequent itemsets

 Pruning closed itemset latticeL

C to extract frequent closed itemsets

A B C D E A B C E A C B E C A C D B C E Ø

Frequent closed itemset (minsupport= 2) Infrequent closed itemset

Figure 3: Closed itemset lattice ofD

 Determining minimal generator itemsets of all frequent closed itemsets



generators of a closed itemset: itemsets for which closure is the closed itemset



X is a minimal generator itemset if 8X

0

X;support(X)6=support(X 0

)

 Closure of an itemset is the intersection of all objects containing it

(6)

Algorithm 1

A-Close frequent closed itemset discovery

1. G1 {frequent 1-itemsets}; // scan

D

2.

for

(i 2;G

i:generators

6=;i++)

do

3. Gi join generators in Gi 1;

4. Test presence of subsets(Gi) in Gi 1;

5. Determine support(Gi); // scan

D

6. Prune infrequent generators in Gi;

7. Prune non-minimal generators in Gi; // level variable i-1

8.

end

9. Determine closures(S Gi); // scan D Support-Count ! G 1 Generator Support {A} 3/5 {B} 4/5 {C} 4/5 {D} 1/5 {E} 4/5 Pruning infrequent generators ! G 1 Generator Support {A} 3/5 {B} 4/5 {C} 4/5 {E} 4/5 AC-Generator ! G 2 Generator Support {AB} 2/5 {AC} 3/5 {AE} 2/5 {BC} 3/5 {BE} 4/5 {CE} 3/5 Pruning ! G 2 Generator Support {AB} 2/5 {AE} 2/5 {BC} 3/5 {CE} 3/5 AC-Closure ! G 0

Generator Closure Support

{A} {AC} 3/5 {B} {BE} 4/5 {C} {C} 4/5 {E} {BE} 4/5 {AB} {ABCE} 2/5 {AE} {ABCE} 2/5 {BC} {BCE} 3/5 {CE} {BCE} 3/5 Pruning ! Answer: FC Closure Support {AC} 3/5 {BE} 4/5 {C} 4/5 {ABCE} 2/5 {BCE} 3/5

(7)

4 Experimental Results

 Synthetic data: execution times



weakly correlated data: nearly all frequent itemsets are closed



additional time for A-Close in T20I6D100K (0.5%,0.33%): closure computations

 Census data: C20D10K



correlated data: few frequent itemsets are closed



closure mecanism allows to skip some iterations and consider less candidates

 Census data: C73D10K



dierences between execution times can be measured in hours



maximal execution times: Apriori 14h, A-Close 1h15

5 Conclusion

 Correlated data



dicult cases: long execution times



few frequent itemsets are closed: A-Close is particularly ecient

statistical data, medical data, text data, etc.

 Weakly correlated data



nearly all frequent itemsets are closed



acceptable execution times

(8)

0 5 10 15 20 25 30 35 40 2% 1.5% 1% 0.75% 0.5% 0.33% Time (seconds) Minimum support A-Close Apriori

Execution times on T10I4D100K

0 100 200 300 400 500 600 700 800 0.33% 0.5% 0.75% 1% 1.5% 2% Time (seconds) Minimum support A-Close Apriori

Execution times on T20I6D100K Figure 5: Performance of Apriori and Close on synthetic data

0 500 1000 1500 2000 2500 3000 20% 15% 10% 7.5% 5% 4% 3% Time (seconds) Minimum support A-Close Apriori Execution times 10 11 12 13 14 15 16 3% 4% 5% 7.5% 10% 15% 20% Number of passes Minimum support A-Close Apriori

Number of database passes Figure 6: Performance of Apriori and Close on census data C20D10K

0 10000 20000 30000 40000 50000 60000 70% 75% 80% 85% 90% Time (seconds) Minimum support A-Close Apriori Execution times 10 12 14 16 18 70% 75% 80% 85% 90% Number of passes Minimum support A-Close Apriori

Number of database passes Figure 7: Performance of Apriori and Close on census data C73D10K

(9)

6 Present Work

 Problem of the understandability and usefulness of association rules extracted

 Discovering small covers for association rules



small informative and structural cover for exact association rules



small informative cover for approximate association rules



small structural cover for approximate association rules

Dataset Minimum Minimum Total Informative Structural

support condence rules cover cover

T10I4D100K 0.5% 90% 16,260 3,511 916

C73D10K 90% 90% 2,053,896 4,104 941

Mushrooms 50% 50% 1,248 87 44

Figure 8: Preliminary experimental results

References

[1] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Ecient mining of association rules using closed itemset lattices. Journal of Information Systems. To appear.

[2] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Pruning closed itemset lattices for association rules. Proceedings of the 14th BDA Conference on Advanced Databases, pages 177196, October 1998.

[3] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. Proceedings of the 7th ICDT Int'l Conference on Database Theory, pages 398416, January 1999.

Figure

Figure 1: The example data mining context D
Figure 2: Subset lattice of D
Figure 3: Closed itemset lattice of D
Figure 4: A-Close frequent closed itemset discovery in D for minsup = 2/5 (40%)
+2

Références

Documents relatifs

A set of labeled intervals is obtained from a multivariate time series and mined for patterns with a sliding window to restrict pattern length.. Then patterns are filtered by

In this paper, we proposes the ClosedPattern global constraint to capture the closed frequent pattern mining problem without requiring reified constraints or extra variables..

The paper is organized as follows: Section 2 dis- cusses related works, while Section 3 describes our ap- proach in detail, including how to encode images as data mining

In particular, in radix-2 floating-point arithmetic, we show that among the set of the algo- rithms with no comparisons performing only floating- point additions/subtractions, the

In the next section, we describe the interface through two use cases: a student using the tool to attend a distant lecture in real time, and a teacher giving the lecture to

The paper estab- lishes a surprisingly simple characterization of Palm distributions for such a process: The reduced n-point Palm distribution is for any n ≥ 1 itself a log Gaussian

Under idealized assumptions such as negligible node to node communication cost and uniform data item size, we have developed a framework for data redistribution across

Negotiations to remove tariffs on Environmental Goods (EGs) and Environmental Services (ESs) at the Doha Round, followed by negotiations towards an Environmental Goods Agreement