• Aucun résultat trouvé

Closed sets based discovery of small covers for association rules

N/A
N/A
Protected

Academic year: 2021

Partager "Closed sets based discovery of small covers for association rules"

Copied!
22
0
0

Texte intégral

(1)

HAL Id: hal-00467748

https://hal.archives-ouvertes.fr/hal-00467748

Submitted on 26 Apr 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Closed sets based discovery of small covers for

association rules

Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal

To cite this version:

Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal. Closed sets based discovery of small covers

for association rules. BDA’1999 international conference on Advanced Databases, Oct 1999, Bordeaux,

France. pp.361-381. �hal-00467748�

(2)

for Asso iation Rules

Ni olas Pasquier Yves Bastide Ra kTaouil Lot Lakhal Laboratoired'Informatique(LIMOS)

UniversiteBlaisePas al- Clermont-FerrandII ComplexeS ienti quedesCezeaux

24AvenuedesLandais,63177AubiereCedexFran e fpasquier,bastide,taouil,lakhalglibd2.univ-bp lermont.fr

Abstra t

Inthispaper,weaddresstheproblemoftheusefulnessofthesetofdis overedasso iationrules. This problemis importantsin ereal-lifedatabases yieldmostofthetimeseveralthousandsofruleswith high on den e. We propose new algorithmsbased onGalois losed sets to redu e the extra tion tosmall overs, or bases, for exa t andapproximaterules. On efrequent losed itemsets {whi h onstituteageneratingsetforbothfrequentitemsetsand asso iationrules {havebeendis overed, no additional database pass is needed to derive these bases. Experiments ondu ted on real-life databasesshowthatthesealgorithmsareeÆ ientandvaluableinpra ti e.

Keywords: data mining, Galois losure operator, frequent losed itemsets, bases for asso iation rules,algorithms.

1 Introdu tion and Motivation

Data mining has been extensively addressed for the last years, spe ially the problem of dis overing asso iationrules. Theaim when dis overingasso iation rules is to exhibit relationships betweendata items (or attributes) and ompute the pre ision of ea h relationship in the database. Usual pre ision measures aresupportand on den e [1℄thatpointtheproportionofdatabasetransa tions(or obje ts) upholding ea h rule out. When an asso iation rule has support and on den e ex eeding some user-de ned minimumthresholds,theruleis onsideredasrelevantandtheextra tedknowledgewouldlikely be used for supporting de isionmaking. A lassi al exampleof asso iationrules ts in the ontext of marketbasketdataanalysisandhighlightsaparti ularfeaturein ustomersbehavior: 80%of ustomers whobuy erealsand sugaralsobuy milk.

Sin etheproblem wasstated [1℄,various approa heshavebeenproposed for anin reasedeÆ ien y ofruledis overy[2,4,8,17,23,24,26,30,33℄. However,fullytakingadvantageofexhibitedknowledge means apabilitiesto handlesu haknowledge. Infa t,byusingasyntheti dataset ontaining100,000 obje ts,ea hofwhi hen ompassingaround10items,ourexperimentsyieldmorethan16,000ruleswith on den e out oming90%. Theproblemis mu h more riti alwhen olle teddata ishighly orrelated ordense,likeinstatisti alormedi aldatabases. Forinstan e,whenappliedtoa ensusdatasetof10,000 obje ts,ea hofwhi h hara terizedbyvaluesof73attributes,experimentsresultinmorethan2,000,000 ruleswithsupport and on den e out oming90%.

Thus the talkedissue ould berephrasedas follows: whi h relevantknowledge an belearnedfrom several thousandsofruleshighly redundant? Whi haid ouldbeo eredto usersforhandling ountless rulesand fo usingonuseful ones? Beforeexplaininghowourapproa hanswersthepreviousquestions,

(3)

Among approa hes addressingthe des ribed issue, twomain trends an be distinguished. The former provides userswith me hanismsfor lteringrules. In[3, 16℄, the userde nes templates, and rulesnot mat hing with them are dis arded. In [22, 29℄, boolean operators are introdu ed for sele ting rules in luding(or not) givenitems. A similar approa hexpanded withameasure ofusefulnessof extra ted rules, alledimprovement,isproposedin[5℄. In[21℄,anSQL-likeoperator alledMINE RULE,allowingthe spe i ation ofgeneral extra tion riteria,is proposed. Thequoted approa hesoperate \aposteriori", i.e. on e hugeamountof rulesare extra ted, queryingfa ilities makeit possibleto handle rulesubsets sele ted a ording to the user preferen es. In ontrast, the se ond trend addresses the problem with an\apriori"vision, byattempting to minimizethenumberof exhibitedrules. In[14, 28℄, information about taxonomies are used to de ne riteria of interest whi h apply for pruning redundant rules. In [7, 25℄,statisti al measuressu has Pearson's orrelationor the hi-squaredtestare usedinsteadofthe on den emeasure.

1.2 Contribution: an Overview

Theapproa hpresentedinthispaperbelongstothese ondtrendsin eitaimstoextra tnotallpossible rulesbut asub-set alled small overor basis for asso iation rules. When omputing su h a basis, re-dundantrulesaredis ardedsin etheydonotvehi ulerelevantknowledge. Su h apruningoperationis akey-step duringrule extra tion, andsigni antlyredu estheresulting set. For example,experiments performed usingareal-lifedatasetdes ribing hara teristi sofmushroomsyieldthe 9following asso i-ation rules with free gills in the ante edent and eatable in the onsequent, and with ommon support (51%)and on den e(54%).

1)freegills!eatable 6)freegills, white veil!eatable,partialveil 2)freegills!eatable, partial veil 7)freegills, partialveil !eatable

3)freegills!eatable, white veil 8)freegills, partialveil !eatable, whiteveil 4)freegills!eatable, partial veil, whiteveil 9)freegills, white veil, partialveil! eatable 5)freegills, whiteveil! eatable

Amongtheserules,8areredundantbe ausethey anbededu edfromthe4 th

rule: freegills! eatable, partialveil, whiteveil. Moreover,sin erulesunexpe tedbytheuserareimportant[18,27℄,presentinga listofrules overingallthefrequentitemsin thedatasetisalsoneeded.

First,usingthe losureoperatoroftheGalois onne tion[6℄,we hara terizefrequent loseditemsets [23,24℄. Then,weshowthatfrequent loseditemsetsrepresentageneratingsetforbothfrequentitemsets andasso iation rules. The underlyingtheorem statesthefoundationsof ourapproa h sin eit makesit possibleto generate thebases fromfrequent loseditemsets byavoidinghandlingof largesets ofrules. We propose two new algorithms: the former a hievesfrequent losed itemsets from frequent itemsets withouta essingthedataset,andthelatter, alled Apriori-Close,extends theApriorialgorithm[2℄by dis overingsimultaneously frequent itemsets and frequent losed itemsets withoutadditionalexe ution time.

Then, using the frequent losed itemsets and the pseudo- losed itemsets de ned by Duquenne and Guiguesinlatti etheory[9,11℄,wede netheDuquenne-Guiguesbasis forexa t asso iation rules (rules witha100% on den e). Rulesinthisbasisarenon-redundantexa truleswithminimalante edentand maximal onsequent. Besides, using thefrequent losed itemsets andresults proposed by Luxenburger inlatti etheory[19,32℄, wede netheproperbasis andthestru tural basis for approximate asso iation rules. Theproperbasisisasmallset ontainingthemostinformativeandusefulapproximaterules: the non-redundantinformativerules. The stru turalbasis an beviewed as anabstra t ofall approximate rulesthathold and an beusefulwhentheproperbasisislarge. Weproposethree algorithmsintended foryielding these threebases. Usingthe set offrequent losed itemsets, generating theevokedbases is performedwithoutanya essto thedataset.

(4)

number of items some tens. From the results presented in [19℄, no algorithm was proposed. In [24℄, the asso iation rule framework basedon theGalois onne tion is de ned. Fitting in this groundwork, twoeÆ ientalgorithmsthat dis overfrequent loseditemsetsforasso iationrulesarede ned: theClose algorithm [24℄for orrelated dataandtheA-Close algorithm[23℄ forweakly orrelateddata. Thework presentedin thispaperdi ersfrom [23,24℄inthefollowingpoints:

1. Itshowsthatfrequent loseditemsets onstituteagenerating setforfrequentitemsets and asso i-ationrules.

2. ItextendstheApriorialgorithmandalgorithmsfordis overingmaximalfrequentitemsetsto gen-eratefrequent loseditemsets.

3. Itadapts theDuquenne-Guigues basisandLuxenburgerresultsfor exa tand partial impli ations tothe ontextofasso iationrules. Thisadaptationisbasedon1. (generatingset).

4. Itpresentsnewalgorithmsforgenerating basesfor exa tand approximate asso iationrulesusing frequent loseditemsets.

5. Itshowsthatthealgorithms proposed areeÆ ientforbothimprovingtheusefulnessofextra ted asso iationrulesandde reasingtheexe utiontime oftheasso iationruleextra tion.

As shown by experiments, the proposed pro ess for extra ting bases does not require any overhead omparedwiththetraditionalapproa hesfordis overingasso iationrules.

1.3 Paper Organization

In Se tion 2, we present the asso iation rule framework based on the Galois onne tion. Se tion 3 addresses the on ept of basis for both exa t and approximate asso iation rules. New algorithms for dis overing frequent and frequent losed itemsets are des ribed in Se tion 4 and the following se tion presentsalgorithms omputingthebasesforasso iationrulesfromthefrequent loseditemsets. Experi-mental resultsa hieved fromvarious datasets aregivenin Se tion 6. Finally, as a on lusion,weevoke further workin Se tion7.

2 Asso iation Rule Framework

In this se tion, we present the asso iation rule framework based on the Galois onne tion, primarily introdu edin[23, 24℄.

De nition 1(Data mining ontext) A data mining ontext 1

is de nedas D =(O;I;R), where O andI are nitesetsofobje tsanditemsrespe tively. ROI isabinaryrelationbetweenobje tsand items. Ea h ouple(o;i)2Rdenotes the fa t thatthe obje to2O isrelatedtothe itemi2I.

Depending onthetargetsystem,adatamining ontext an bearelation, a lass,ortheresultofan SQL/OQLquery.

Example1 An exampledatamining ontextD onsistingof5obje ts(identi edbytheirOID) and5 itemsisillustratedinTable1.

De nition 2(Galois onne tion) Let D = (O, I, R) be a data mining ontext. For O  O and I I, wede ne: f:2 O !2 I g:2 I !2 O

f(O)=fi2Ij8o2O;(o;i)2Rg g(I)=fo2Oj8i2I;(o;i)2Rg 1

(5)

1 A C D 2 B C E 3 A B C E 4 B E

5 A B C E

Table1: TheExampleDataMiningContextD.

f(O) asso iates with O the items ommon to all obje ts o 2 O and g(I) asso iates with I the obje ts relatedtoallitemsi2I. The oupleofappli ations(f;g)isaGalois onne tionbetweenthepowersetof O(2

O

)andthepowersetofI(2 I

). ThefollowingpropertiesholdforallI;I 1 ;I 2 I andO;O 1 ;O 2 O: (1) I 1 I 2 )g(I 1 )g(I 2 ) (1') O 1 O 2 )f(O 1 )f(O 2 ) (2) Og(I)()I f(O)

De nition3(Frequentitemsets) Let I  I be a set of items from D. The support ount of the itemsetI inDis:

supp(I)= kg(I)k

kOk

I issaidtobefrequent ifthe supportof I inD isatleastminsupp. The setL of frequent itemsetsinD is:

L=fIIjsupp(I)minsuppg

De nition4(Asso iation rules) An asso iation rule is an impli ation between two itemsets, with the form I 1 !I 2 where I 1 ;I 2  I, I 1 ;I 2 6= ? and I 1 \I 2 = ?. I 1 and I 2

are alled respe tively the ante edentandthe onsequentof the rule. Thesupportsupp(r) and on den e onf(r)of an asso iation ruler:I

1 !I

2

arede nedusingthe Galois onne tion asfollows:

supp(r)= kg(I 1 [I 2 )k kOk ; onf(r)= supp(I 1 [I 2 ) supp(I 1 )

Asso iation rulesholding in the ontext arethosethat havesupportand on den egreaterthanor equal tothe minsuppandmin onfthresholdsrespe tively. Wede nethe setAR ofasso iation rulesholding in Dgiven minsuppandmin onfthresholds asfollows:

AR=fr:I 1 !I 2 I 1 jI 1 I 2 I ^ supp(I 2

)minsupp ^ onf(r)min onfg

If onf(r)=1thenris alledanexa tasso iationruleorimpli ationrule,otherwiseris alledapproximate asso iation rule.

Example2 Exa tandapproximateasso iationrulesextra tedfromDforminsupp=2/5andmin onf =1/2aregiveninTable2.

3 Bases for Asso iation Rules

In this se tion, we rst demonstrate that the frequent losed itemsets onstitute a generating set for frequent itemsets and asso iation rules. Then, we hara terize the Duquenne-Guigues basis for exa t asso iation rules andtheproper andstru tural basesfor approximate asso iation rules. The Duquenne-Guiguesbasis, asde ned in[11℄, isextended in thispaperto the ontextofasso iationrules. Proofsof Theorems2,3and4arestraightforwardfrom Theorem1and[11,19,32℄. Interestedreaders ouldrefer

(6)

ABC)E 2/5 BCE!A 2/5 2/3 B!AE 2/5 2/4 ABE)C 2/5 AC!BE 2/5 2/3 E!AB 2/5 2/4 ACE)B 2/5 BE!AC 2/5 2/4 A!CE 2/5 2/3 AB)CE 2/5 CE!AB 2/5 2/3 C!AE 2/5 2/4 AE)BC 2/5 AC!B 2/5 2/3 E!AC 2/5 2/4 AB)C 2/5 BC!A 2/5 2/3 B!CE 3/5 3/4 AB)E 2/5 BE!A 2/5 2/4 C!BE 3/5 3/4 AE)B 2/5 AC!E 2/5 2/3 E!BC 3/5 3/4 AE)C 2/5 CE!A 2/5 2/3 A!B 2/5 2/3 BC)E 3/5 BE!C 3/5 3/4 B!A 2/5 2/4 CE)B 3/5 A!BCE 2/5 2/3 C!A 3/5 3/4 A)C 3/5 B!ACE 2/5 2/4 A!E 2/5 2/3 B)E 4/5 C!ABE 2/5 2/4 E!A 2/5 2/4 E)B 4/5 E !ABC 2/5 2/4 B!C 3/5 3/4 A!BC 2/5 2/3 C!B 3/5 3/4 B!AC 2/5 2/4 C!E 3/5 3/4 C!AB 2/5 2/4 E!C 3/5 3/4 A!BE 2/5 2/3

Table2: Asso iationRulesExtra tedfrom Dforminsup =2/5andmin onf =1/2.

3.1 Generating Set

De nition 5(Galois losure operators) Theoperatorsh=fÆgin2 I andh 0 =gÆf in2 O areGalois losure operators 2

. Given the Galois onne tion (f;g), the following properties hold for allI;I 1 ;I 2 I andO;O 1 ;O 2 O[6 ℄: Extension: (3) I h(I) (3') Oh 0 (O) Idempoten y: (4) h(h(I))=h(I) (4') h

0 (h 0 (O))=h 0 (O) Monotoni ity: (5) I 1 I 2 )h(I 1 )h(I 2 ) (5') O 1 O 2 )h 0 (O 1 )h 0 (O 2 )

De nition 6(Frequent losed itemsets) An itemsetI I in Disa loseditemseti h(I)=I. A loseditemsetI issaidtobefrequentifthesupportofI inDisatleastminsupp. Thesmallest(minimal) loseditemset ontaining an itemsetI is h(I), the losure of I. The set FC of frequent losed itemsets in Disde nedasfollows:

FC=fIIjI =h(I) ^ supp(I)minsuppg

Example3 A frequent loseditemset isamaximal setof items ommonto asetof obje ts,forwhi h supportisatleastminsupp. Thefrequent loseditemsetsinthe ontextDforminsupp=2/5arepresented in Table3. Theitemset BCE is afrequent losed itemset sin eitis themaximalset ofitems ommon to theobje tsf2;3;5g. TheitemsetBC isnotafrequent loseditemsetsin eitisnotamaximalset of items ommontosomeobje ts: allobje tsinrelation withtheitemsB andC (obje ts2,3and5) are alsoin relationwiththeitemE.

Hereafter,wedemonstratethattheset offrequent loseditemsetswith theirsupportisthesmallest olle tionfrom whi h frequentitemsets withtheirsupport andasso iationrules an begenerated(it is ageneratingset).

Lemma1 [24 ℄ The support of an itemsetI is equal tothe support of the smallest losed itemset on-taining I: supp(I)=supp(h(I)).

Lemma2 [24 ℄ The set of maximalfrequentitemsets M=fI 2L j I 0

2L where I I 0

gisidenti al tothe setof maximalfrequent loseditemsets MC=fI 2FC jI

0

2FC where I I 0

g. 2

(7)

f?g 5/5 fCg 4/5 fACg 3/5 fBEg 4/5 fBCEg 3/5 fABCEg 2/5

Table3: FrequentClosedItemsetsExtra tedfromDforminsupp =2/5.

Theorem 1(Generatingset) ThesetFCoffrequent loseditemsetswiththeirsupportisagenerating setfor allfrequentitemsets andtheir support, andfor all asso iation rulesholding inthe dataset, their support andtheir on den e.

Proof. Based on Lemma 2, all frequent itemsets an be derived from the maximal frequent losed itemsets. Based onLemma 1,the support ofea h frequentitemset an bederivedfrom thesupport of frequent losed itemsets. Then, theset of frequent loseditemsets FC isagenerating set forboththe setoffrequentitemsetsLandthesetofasso iationrulesAR

3 . 

3.2 Duquenne-Guigues Basis for Exa t Asso iation Rules

De nition7(Frequentpseudo- losed itemsets) AnitemsetI I in Disapseudo- losed itemset i h(I)6=I and8I

0

I su hasI 0

isapseudo- loseditemset, wehaveh(I 0

)I. ThesetFPoffrequent pseudo- loseditemsets inDisde nedas

FP =fI I jsupp(I)minsupp ^I 6=h(I) ^ 8I 0 2FP su hasI 0 I wehaveh(I 0 )Ig Theorem 2(Duquenne-Guigues Basis forExa t Asso iation Rules) Let FP be the set of fre-quentpseudo- loseditemsetsinD. Theset

DG=fr:I 1 )h(I 1 ) I 1 j I 1 2FP ^ I 1 6=?g isabasis forallexa t asso iation rules holding inthe dataset.

The Duquenne-Guigues basisis minimal with respe tto thenumber ofrules sin e there an beno ompletesetwithfewerrulesthantherearefrequentpseudo- loseditemsets[10,13℄.

Example4 Afrequentpseudo- loseditemsetIisafrequentnon- loseditemsetthatin ludesthe losures ofallfrequentpseudo- loseditemsetsin ludedinI. ThesetFPoffrequentpseudo- loseditemsetsandthe Duquenne-Guiguesbasisforexa tasso iationrulesextra tedfromDforminsupp=2=5andmin onf=1=2 arepresentedin Table 4. Theitemset AB isnotafrequent pseudo- loseditemset sin ethe losuresof A and B (respe tively AC and BE) are notin luded in AB. ABCE is not afrequent pseudo- losed itemsetsin eitis losed.

Frequentpseudo- loseditemset Support

fAg 3/5

fBg 4/5

fEg 4/5

Exa trule Support A)C 3/5 B)E 4/5 E)B 4/5

Table4: FrequentPseudo-ClosedItemsetsandDuquenne-Guigues BasisExtra tedfrom Dforminsupp =2=5.

3

(8)

Theorem 3 (ProperBasis for Approximate Asso iation Rules) Let FC be the set of frequent loseditemsetsinD. Theset

PB=fr:I 1 !I 2 I 1 jI 1 ;I 2 2FC ^I 1 6=?^ I 1 I 2 ^ onf(r)min onfg

isabasisforallapproximateasso iationrulesholdinginthedataset. Asso iationrulesinPB areproper approximate asso iation rules.

Example5 Theproperbasisforapproximateasso iationrulesextra tedfromDforminsupp=2/5and min onf=1/2arepresentedin Table5.

Approximaterule Support Con den e BCE!A 2/5 2/3 AC!BE 2/5 2/3 BE!AC 2/5 2/4 BE!C 3/5 3/4 C!ABE 2/5 2/4 C!BE 3/5 3/4 C!A 3/5 3/4

Table5: ProperBasisExtra tedfrom Dforminsupp =2/5andmin onf =1/2.

3.4 Stru tural Basis for Approximate Asso iation Rules De nition 8(Undire ted graph G

FC

) LetFC bethesetoffrequent loseditemsetsinD. Wede ne G

FC

=(V;E)astheundire tedgraphasso iatedwithFC wherethesetofverti esV andthesetofedges E arede nedasfollows:

V =fIIjI 2FCg E=f(I 1 ;I 2 )2V  V j I 1 I 2 ^ supp(I 2 )=supp(I 1 )min onfg Withea hedgeinG

FC

betweentwoverti esI 1 andI 2 withI 1 I 2

isasso iatedthe on den e=supp(I 2

) /supp(I

1

)of the proper approximateasso iation ruleI 1

!I 2

I 1

representedbythe edge. De nition 9(MaximalCon den e Spanning Forest F

FC ) LetF

FC

=(V;E 0

)bethemaximal on- den e spanning forest asso iated with FC. F

FC

is obtained from the undire ted graph G FC

= (V;E) by suppressingtransitive edges and y les. Cy les areremovedby deletingsomeedges that enterthe last vertexI (maximalvertexwithrespe ttothein lusion) ofthe y le. Amongalledges enteringinI,those with on den eless thanthe maximal on den evalueasso iatedwith anedge with theform(I

0

;I)2E aredeleted. Ifmorethanoneedgehave themaximal on den evalue,the rstoneinlexi ographi order iskept.

Theorem 4 (Stru tural Basisfor Approximate Asso iation Rules) LetSB be thesetof asso i-ation rulesrepresentedby edges inF

FC

ex ept rules fromthe vertexf?g. Theset SB=fr:I 1 !I 2 I 1 jI 1 ;I 2 2V ^ I 1 I 2 ^ I 1 6=? ^(I 1 ;I 2 )2E 0 g

isabasisfor allapproximateasso iation rulesholding inthedataset(I isthe onsequentofatmostone approximate asso iation ruleinSB).

Example6 Thestru tural basisforapproximateasso iationrulesextra tedfrom D forminsupp=2/5 andmin onf=1/2ispresentedin Table6.

(9)

2/3

A B C E

3/4

3/4

A C

B E

C

B C E

3/4

2/3

2/4

2/4

4/5

4/5

Ø

2/5

3/5

3/5

G FC

A B C E

3/4

3/4

A C

B E

C

B C E

2/3

4/5

4/5

Ø

F FC Figure1: Undire tedGraphG

FC

andMaximalCon den eSpanningForestF FC

(atreeinthisexample) DerivedfromDforminsupp =2/5andmin onf =1/2.

Approximaterule Support Con den e AC!BE 2/5 2/3 BE!C 3/5 3/4 C!A 3/5 3/4

Table6: Stru turalBasisExtra tedfromDforminsupp=2/5andmin onf =1/2.

4 Dis overing Frequent and Frequent Closed Itemsets

InSe tion 4.1, we propose a newalgorithm to a hievefrequent losed itemsets from frequent itemsets withouta essing the dataset. This algorithm dis oversfrequent losed itemsets while for instan e an algorithm for dis overingmaximal frequent itemsets [4, 17, 33℄ is used. In Se tion 4.2, we present an extension of theApriori algorithm [2℄ alled Apriori-Close fordis overingfrequent and frequent losed itemsetswithoutadditional omputationtime. LikeintheApriorialgorithm,weassumeinthefollowing thatitemsaresortedinlexi ographi orderandthat kisthesizeofthelargestfrequentitemsets. Based onLemma 2,k isalsothesize ofthelargestfrequent loseditemsets.

4.1 Computing Frequent Closed Itemsets from Frequent Itemsets

Many eÆ ient algorithms for mining frequent itemsets and their support have been proposed. Well-knownproposalsarepresentedin[2,8,26,30℄. EÆ ientalgorithmsfordis overingthemaximalfrequent itemsetsandthena hieveallfrequentitemsetshavealsobeenproposed [4,17, 33℄. All thesealgorithms giveasresultthesetL=

S i=k i=1 L i whereL i

ontainsallfrequenti-itemsets(itemsetsofsizei). Basedon Proposition1andLemma2(Se tion3.1),thefrequent loseditemsetsandtheirsupport anbe omputed fromthefrequentitemsets andtheirsupportwithoutanydataseta ess.

Thepseudo- odetodeterminefrequent loseditemsetsamongfrequentitemsetsisgiveninAlgorithm 1. NotationsaregiveninTable7. TheinputofthealgorithmaresetsL

i

,1ik, ontainingallfrequent itemsetsinthedataset. Itre ursivelygeneratesthesetsFC

i

,0ik,offrequent losedi-itemsetsfrom FC k to FC 0 . L i

Setoffrequenti-itemsetsandtheirsupport. FC

i

Setoffrequent losedi-itemsetsandtheirsupport.

is losed Variableindi atingifthe onsidereditemset is losedornot. Table7: Notations.

(10)

the Galois onne tion). If g(l)= g(s) then h(l) =h(s) ) l = h(s) ) s l (absurd). It followsthat g(l)g(s))supp(l)>supp(s). 

Algorithm1DerivingFrequentClosedItemsetsfromFrequentItemsets. 1) FCk Lk;

2) for(i k 1;i6=0;i--)dobegin 3) FCi fg;

4) forallitemsetsl2Li dobegin 5) is l osed true;

6) forallitemsetsl 0 2L i+1 dobegin 7) if (ll 0 )and(l .support=l 0

.support)thenis l osed fal se; 8) end

9) if (is l osed=true)thenFCi FCi[fl g; 10) end

11) end

12) FC0 f?g;

13) forallitemsetsl2L1 dobegin

14) if (l .support=kOk)thenFC0 fg; 15) end

First, the set FC k

is initialized with the set of largest frequent itemsets L k

(step 1). Then, the algorithm iterativelydetermineswhi h i-itemsetsin L

i

are losed from L k 1

to L 1

(steps 2to11). At thebeginningofthei

th

iterationthesetFC i

offrequent losedi-itemsetsisempty(step 3). Insteps4 to10,forea hfrequentitemsetl inL

i

,weverifythatlhasthesamesupportasafrequent(i+1)-itemset l

0 in L

i+1

in whi h it is in luded. If so, wehave l 0

h(l)and then l 6=h(l): l is not losed (step 7). Otherwise, l is afrequent losed itemset and is inserted in FC

i

(step 9). During the last phase, the algorithmdeterminesiftheemptyitemsetis losedby rstinitializingFC

0

withtheemptyitemset(step 12) and then onsidering all frequent1-itemsets in L

1

(steps 13 to 15). Ifa 1-itemset l has asupport equaltothenumberofobje tsinthe ontext,meaningthatl is ommontoallobje ts,thentheitemset ? annotbe losed(wehavesupp(f?g)=kOk=supp(l))andisremovedfromFC

0

(step14). Thus,at theendofthealgorithm,ea hsetFC

i

ontainsallfrequent losed i-itemsets.

Corre tness Sin eallmaximalfrequentitemsetsaremaximalfrequent loseditemsets(Lemma2),the omputationofthesetFC

k

ontainingthelargestfrequent loseditemsetsis orre t. The orre tnessof the omputationofsets FC

i

fori<kreliesonProposition1. Thispropositionenablesto determineifa frequenti-itemsetl is losedby omparing its supportand thesupportsof thefrequent(i+1)-itemsets in whi h lisin luded. Ifoneofthemhasthesamesupport asl,thenl annotbe losed.

4.2 Apriori-Close Algorithm

Inthis se tion,wepresentanextension oftheApriorialgorithm [2℄ omputingsimultaneouslyfrequent and frequent losed itemsets. Thepseudo- ode is givenin Algorithm2 andnotations in Table8. The algorithm iterativelygeneratesthesetsL

i

offrequenti-itemsetsfromL 1

to L k

. Besides, duringthei th iteration,allfrequent losed(i 1)-itemsetsinFC

i 1

aredetermined. ThesetFC k

isdeterminedduring thelaststepofthealgorithm.

L i

Setoffrequenti-itemsets,theirsupportand markeris losedindi atingif losedornot. FC

i

Setoffrequent losedi-itemsetsandtheirsupport. Table8: Notations. First,thevariablek isinitializedto 0(step1). Then, theset L

1

(11)

1) k 0;

2) itemsetsinL1 f1-itemsetsg; 3) L1 Support-Count(L1); 4) FC0 f?g;

5) forallitemsetsl2L1 dobegin

6) if (l .support<minsupp)thenL1 L1nfl g; 7) elseif (l .support=kOk)thenFC0 fg; 8) end 9) for(i 1;L i 6=fg; i++)dobegin 10) forallitemsetsl 0 2L i dol 0 .is losed true; 11) L i+1 Apriori-Gen(L i ); 12) forallitemsetsl2L i+1 dobegin 13) foralli-subsetsl 0 ofldobegin 14) if (l 0 62L i )thenL i+1 L i+1 nfl g; 15) end 16) end 17) Li+1 Support-Count(Li+1); 18) forallitemsetsl2Li+1dobegin

19) if (l .support<minsupp)thenLi+1 Li+1nfl g; 20) elsedobegin 21) foralli-subsetsl 0 2L i ofldobegin 22) if (l .support=l 0 .support)thenl 0

.is losed fal se;

23) end 24) end 25) end 26) FCi fl2Lijl :is losed =trueg; 27) k i; 28) end 29) FCk Lk; 3). Theset FC 0

is initialized withthe empty itemset (step 4) and thesupports of itemsets in L 1

are onsidered (steps 5 to 8). All infrequent1-itemsets are removedfrom L

1

(step 6) and if afrequent 1-itemsethasasupportequaltothe numberofobje tsinthe ontext thentheemptyitemsetis removed fromFC

0

(step7). Duringea hofthefollowingiterations(steps9to28),frequentitemsets ofsize i+1, k>i1,and frequent loseditemsets ofsize i are omputedas follows. For allfrequenti-itemsetsin L

i

,themarkeris losedisinitializedto true(step 10). AsetL i+1

ofpossiblefrequent(i+1)-itemsetsis reatedbyapplyingtheApriori-Genfun tiontothesetL

i

(step11). Forea hofthese possiblefrequent (i+1)-itemsets,we he kthatallitssubsetsofsizeiexistinL

i

(steps12to16). Onepassisperformedto omputethesupportsoftheremainingitemsetsinL

i+1

(step17). Then,forea h(i+1)-itemsetsl2L i+1 (steps18to 25),ifl isinfrequentthenitisdis ardedfromL

1+1

(step 19). Otherwiseforalli-subsetsl 0 ofl,weverifythatsupportsofl

0

andl areequal;ifso,thenl 0

annotbea loseditemsetanditsmarker is losedis setto false(steps 20to 24). Then, allfrequenti-itemsetsinL

i

forwhi hmarkeris losedis trueare insertedin theset FC

i

of frequent losed i-itemsets(step 26) andthe variable k isset to the valueofi(step 27). Finally,thesetFC

k

isinitializedwiththefrequentk-itemsetsinL k

(step29). Apriori-Gen fun tion The Apriori-Gen fun tion [2℄ applies to a set L

i

of frequent i-itemsets. It returnsasetL

i+1

ofpotentialfrequent(i+1)-itemsets. Anewitemsetin L i+1

is reatedbyjoining two itemsetsin L

i

sharing ommon rsti-1items.

Support-Count fun tion TheSupport-Count fun tion takesaset L i

of i-itemsetsas argument. It eÆ iently omputesthesupportsofallitemsetsl2L

i

. Onlyonedatasetpassisrequired: forea hobje t o read, thesupports of allitemsets l 2 L

i

(12)

Corre tness Sin e the support of a frequent losed itemset l is di erent from the support of all its supersets(Proposition1), the omputationof sets FC

i

fori<k is orre t. Hen e,afrequenti-itemset l

0 2L

i

is determined losedor notby omparing its support with thesupports of all frequent(i+ 1)-itemsets l2L

i+1

for whi h l 0

l. Lemma2ensuresthe orre tnessofthe omputation oftheset FC k ontainingthelargestfrequent losed itemsets.

Example7 Figure 2illustrates theexe ution ofthe Apriori-Closealgorithm with the ontext Dfor a minimumsupportof2/5. S anD ! L1 Itemset Supp fAg 3/5 fBg 4/5 fCg 4/5 fDg 1/5 fEg 4/5 Pruning infrequent ! L 1 Itemset Supp fAg 3/5 fBg 4/5 fCg 4/5 fEg 4/5 Determining losed ! FC 0 Itemset Supp f?g 5/5 S anD ! L 2 Itemset Supp fABg 2/5 fACg 3/5 fAEg 2/5 fBCg 3/5 fBEg 4/5 fCEg 3/5 Pruning infrequent ! L 2 Itemset Supp fABg 2/5 fACg 3/5 fAEg 2/5 fBCg 3/5 fBEg 4/5 fCEg 3/5 Determining losed ! FC1 Itemset Supp fCg 4/5 S anD ! L3 Itemset Supp fABCg 2/5 fABEg 2/5 fACEg 2/5 fBCEg 3/5 Pruning infrequent ! L3 Itemset Supp fABCg 2/5 fABEg 2/5 fACEg 2/5 fBCEg 3/5 Determining losed ! FC2 Itemset Supp fACg 3/5 fBEg 4/5 S anD ! L 4 Itemset Supp fABCEg 2/5 Pruning infrequent ! L 4 Itemset Supp fABCEg 2/5 Determining losed ! FC 3 Itemset Supp fBCEg 3/5 L 4 Itemset Supp fABCEg 2/5 Closed k-itemsets ! FC 4 Itemset Supp fABCEg 2/5

Figure2: Dis overingFrequentandFrequentClosedItemsetswithApriori-Close.

5 Generating Bases for Asso iation Rules

In Se tion 5.1, wepresentan algorithm to generate the Duquenne-Guigues basis for exa tasso iation rules. InSe tions5.2and5.3aredes ribedalgorithmsa hievingtheproperbasisandthestru turalbasis

(13)

Thepseudo- odegeneratingtheDuquenne-Guiguesbasisforexa tasso iationrulesisgiveninAlgorithm 3. Notationsare given in Table9. The algorithm takes as input thesets L

i

, 1ik, ontainingthe frequentitemsets andtheirsupport, andthesets FC

i

;0ik, ontainingthefrequent losed itemsets andtheir support. It rst omputesthefrequentpseudo- losed itemsetsiteratively(steps 2to 17)and thenusesthemtogeneratetheDuquenne-Guiguesbasisforexa tasso iationrulesDG(steps18to22).

L i

Setoffrequenti-itemsetsandtheirsupport. FC

i

Setoffrequent losed i-itemsetsandtheirsupport. FP

i

Setoffrequentpseudo- losedi-itemsets,their losureandtheirsupport. DG Duquenne-Guiguesbasisforexa tasso iationrules.

Table9: Notations.

First,thesetDGisinitializedtotheemptyset(step1). Iftheemptyitemsetisnota loseditemset (itisthen ne essarilyapseudo- loseditemset), it isinsertedin FP

0

(step 2). OtherwiseFP 0

isempty (step 3). Then, the algorithm re ursivelydetermineswhi h i-itemsetsin L

i

arepseudo- losed from L 1 to L

k

(steps 4 to 16). At ea h iteration, the set FP i

is initialized with the list of frequent i-itemsets that are not losed (step 5) and ea h frequent i-itemsets l in FP

i

is onsidered as follows (steps 6 to 15). The variable pseudo is set to true (step 7). We verify for ea h frequent pseudo- losed itemset p previouslydis overed(i.e. in FP

j

withj <i) ifpis ontainedin l (steps 8to 13). Inthat ase andif the losureofpisnotin ludedin l,thenlisnotpseudo- losedandisremovedfromFP

i

(steps9to12). Otherwise, the losure of l (i.e. the smallestfrequent losed itemset ontaining l) is determined (step 14). On eallfrequentpseudo- loseditemsets pandtheir losureare omputed, allruleswith theform r:p)(p. losure p) aregenerated (steps17 to 21). Thealgorithmresultsin the setDG ontaining allrulesin theDuquenne-Guigues basisforexa tasso iationrules.

Algorithm3GeneratingDuquenne-GuiguesBasisforExa tAsso iationRules. 1) DG fg;

2) if (FC0=fg)thenFP0 f?g; 3) elseFP0 fg;

4) for(i 1;ik;i++)dobegin 5) FP i L i nFC i ; 6) forallitemsetsl2FP i do begin 7) pseudo true; 8) forallitemsetsp2FP j withj<idobegin 9) if (pl )and(p. losure6l )thendobegin 10) pseudo fal se;

11) FPi FPinfl g;

12) end

13) end

14) if (pseudo=true)thenl . losure Min(f 2FCj>ijl g); 15) end

16) end

17) forallsetsFPiwhereFPi6=fgdobegin 18) forallpseudo- loseditemsetsp2FP i

dobegin 19) DG DG[fr:p)(p. losure p),p.supportg; 20) end

21) end

Corre tness Sin e theitemset ?hasno subset,if itis nota losed itemset thenit isby de nition a pseudo- loseditemsetandthe omputationofthesetFP

0

is orre t. The orre tnessofthe omputation offrequentpseudo- losedi-itemsetsin FP for1ikrelies onDe nition 7. Allfrequenti-itemsetsl

(14)

i i

pseudo- loseditemsetsthataresubsetsoflareinsertedinFP i

. A ordingtoDe nition7,thesei-itemsets are allfrequent pseudo- losedi-itemsets andthe sets FP

i

are orre t. Theasso iation rulesgenerated in thelast phaseofthealgorithm areallruleswith afrequentpseudo- loseditemset in theante edent. Then,theresultingsetDG orrespondstotherulesintheDuquenne-Guiguesbasisforexa tasso iation rulesde ned inTheorem2.

Example8 Figure 3shows the generation of the Duquenne-Guigues basisfor exa tasso iation rules from the ontext Dforaminimumsupportof2/5.

L 1 Itemset Supp fAg 3/5 fBg 4/5 fCg 4/5 fEg 4/5 FC1 Itemset Supp fCg 4/5 1 ! FP 1

Itemset Closure Supp fAg fACg 3/5 fBg fBEg 4/5 fEg fBEg 4/5 L 3 Itemset Supp fABCg 2/5 fABEg 2/5 fACEg 2/5 fBCEg 3/5 FC3 Itemset Supp fBCEg 3/5 3 ! FP 3 ? L2 Itemset Supp fABg 2/5 fACg 3/5 fAEg 2/5 fBCg 3/5 fBEg 4/5 fCEg 3/5 FC 2 Itemset Supp fACg 3/5 fBEg 4/5 2 ! FP2 ? L4 Itemset Supp fABCEg 2/5 FC4 Itemset Supp fABCEg 2/5 4 ! FP4 ? FP= S FPi Itemset Closure Supp

fAg fACg 3/5 fBg fBEg 4/5 fEg fBEg 4/5 5 ! DG Rule A)C B)E E)B

Figure3: GeneratingDuquenne-GuiguesBasis forExa tAsso iationRules.

5.2 Generating Proper Basis for Approximate Asso iation Rules

Thepseudo- odegeneratingtheproperbasisforapproximateasso iationrulesispresentedinAlgorithm 4. Notationsaregivenin Table10. Thealgorithm takesas inputthesetsFC

i

, 1ik, ontainingthe frequent losed non-emptyitemsets andtheir support. Theoutput ofthealgorithm isthe properbasis forapproximateasso iationrulesPB.

ThesetPB is rstinitializedto theemptyset(step1). Then,thealgorithmiteratively onsidersall frequent loseditemsets l2FC

i

for2ik. Itdetermineswhi hfrequent loseditemsets l 0

2FC j<i are subsetsofl andgeneratesasso iationruleswith theform l

0

!l l 0

that havesuÆ ient on den e (steps 2to 12)as follows. Duringthe i

th

iteration, ea h itemset l in FC i

is onsidered (steps 3to 11). Forea hsetFC

j

,1j<i,asetS j

ontainingallfrequent losedj-itemsetsinFC j

thataresubsetsofl is reated(step 5). Then, forea h ofthese subsetsl

0

(15)

i S

j

Setofj-itemsetsthat aresubsetsofthe onsidereditemset. PB Properbasisforapproximateasso iationrules.

Table10: Notations.

theproperapproximateasso iationruler:l 0

!l l 0

(step 7). Ifthe on den eofr issuÆ ientthenr isinsertedinPB (step8). Attheendofthealgorithm,thesetPB ontainsallrulesoftheproperbasis forapproximateasso iationrules.

Algorithm4GeneratingProperBasisforApproximateAsso iationRules. 1) PB fg

2) for(i 2;ik;i++)dobegin 3) forallitemsetsl2FC i dobegin 4) for(j i 1;j>0;j--)dobegin 5) S j Subsets(FC j ;l ); 6) forallitemsetsl 0 2S j dobegin 7) onf(r) l .support/l 0 .support;

8) if ( onf(r)min onf)thenPB PB[fr:l 0 !l l 0 ;l .support, onf(r)g; 9) end 10) end 11) end 12) end

Subset fun tion The subset fun tion takesa set X of itemsets and an itemset y as arguments. It determinesallitemsetsx2X thataresubsetsofy. Inalgorithmimplementation,frequentandfrequent loseditemsetsarestoredinapre x-treestru ture[24℄inordertoimproveeÆ ien yofthesubsetsear h. Corre tness The orre tnessofthealgorithmreliesonthefa tthatweinspe tallproperapproximate asso iationrulesholdinginthedataset. Forea hfrequent loseditemset,thealgorithm omputes,among itssubsets,allotherfrequent loseditemsets. Then,thegenerationofallrulesbetweentwofrequent losed itemsetshavingsuÆ ient on den eisensured. Theserulesareallproperapproximateasso iationrules holding in the dataset, and the resulting set PB is the proper basis forapproximate asso iation rules de nedin Theorem3.

Example9 Figure4showsthegeneration oftheproperbasisforapproximate asso iationrulesin the ontextDforaminimumsupportof2/5andaminimum on den eof1/2.

5.3 Generating Stru tural Basis for Approximate Asso iation Rules

Thepseudo- odegeneratingthestru turalbasisforapproximateasso iationrulesisgiveninAlgorithm 5. Notationsare givenin Table11. Thealgorithm takesas inputthe sets FC

i

, 1i k, of frequent losednon-emptyitemsetsandtheirsupport. Itgeneratesthestru turalbasisforapproximateasso iation rulesSB representedby themaximal on den e spanning forestF

FC asso iated withFC= S i=k i=1 FC i (withouttheemptyitemset).

FC i

Setoffrequent losedi-itemsetsandtheirsupport. S

j

Setofj-itemsetsthat aresubsetsoftheitemset onsidered. CR Setof andidateapproximateasso iationrules.

(16)

FC 2 Itemset Supp fACg 3/5 fBEg 4/5 FC1 Itemset Supp fCg 4/5 1 ! PB

Rule Supp Conf C!A 3/5 3/4 3 Itemset Supp fBCEg 3/5 FC 2 [FC 1 Itemset Supp fACg 3/5 fBEg 4/5 fCg 4/5 2 ! PB

Rule Supp Conf C!A 3/5 3/4 C!BE 3/5 3/4 BE!C 3/5 3/4 FC 4 Itemset Supp fABCEg 2/5 FC 3 [FC 2 [FC 1 Itemset Supp fBCEg 3/5 fACg 3/5 fBEg 4/5 fCg 4/5 3 ! PB

Rule Supp Conf C!A 3/5 3/4 C!BE 3/5 3/4 BE!C 3/5 3/4 C!ABE 2/5 2/4 AC!BE 2/5 2/3 BE!AC 2/5 2/4 BCE!A 2/5 2/3

Figure4: GeneratingProperBasisforApproximate Asso iationRules.

Theset SB is rstinitializedto theemptyset(step 1). Then,thealgorithmiteratively onsidersall frequent loseditemsets l2FC

i

for2ik. Itdetermineswhi hfrequent loseditemsets l 0

2FC j<i are overedbyl,i.e. aredire tprede essorsofl,andthengeneratesthemaximal on den easso iation ruleswiththeforml!l

0

l thathold(steps2to25). Duringthei th

iteration,ea hitemsetlin FC i

is onsidered(steps3to24)asfollows. ThesetCRof andidateasso iationruleswithlin the onsequent is initializedto the empty set (step 4). For 1j <i, sets S

j

ontainingallfrequent losedj-itemsets in FC

j

that are subsets of l are reated (steps 5 to 7). Then, all these subsets of l are onsidered in de reasingorder of their sizes (steps 8 to 18). For ea h of these subsets l

0 2S

j

, the on den e of the properapproximateasso iationruler:l

0 !l l

0

is omputed(step10). Ifthe on den eofrissuÆ ient, risinsertedinCR(step12)andallsubsetsl

00 ofl

0

areremovedfromS n<j

(steps13to15). Thisbe ause rules with the form l

00 ! l l 00 with l 00 2 S n<j

are transitiveproperapproximate rules. Finally, the andidateproperapproximateruleswithlinthe onsequentthatareinCRarepruned(steps19to23): the maximum on den e value max onf of rulesin CR isdetermined (step 20) and the rst rulewith su h a on den e is insertedin SB (steps 21 and 22). At the end of the algorithm, the set SB thus ontainsallrulesin thestru turalbasisforapproximateasso iationrules.

Corre tness The algorithm onsiders allasso iationrules l 0

! l l 0

with on den e min onf be-tweentwofrequent losed itemsets l and l

0

where l oversl 0

. These rules are allpropernon-transitive approximateasso iationrulesthatholdand anberepresentedbytheedgesofthegraphG

FC

(De nition 8) without transitiveedges. Moreover, among all rules with the form X! l X (generated from l), wekeeponlythe rst one with on den e equalto the maximal on den e of rules X! l X. Only preservingthisruleisequivalenttothe y leremovinginthegraphG

FC

inthesamemannerasexplained inDe nition9. Then,theresultingsetSB anberepresentedasthemaximal on den espanningforest F

FC

withoutedgesfromtheemptyitemset. SB ontainsallrulesinthestru turalbasisforapproximate asso iationrulesde ned inTheorem4.

Example10 Figure5depi tsthegenerationofthestru turalbasisforapproximateasso iationrulesin the ontextDforaminimumsupportof 2/5andaminimum on den eof 1/2.

(17)

1) SB fg;

2) for(i 2;ik;i++)dobegin 3) forallitemsetsl2FCi dobegin 4) CR fg; 5) for(j i 1;j>0;j--)dobegin 6) Sj Subsets(FCj;l ); 7) end 8) for(j i 1;j>0;j--)dobegin 9) forallitemsetsl 0 2S j dobegin 10) onf(r) l .support/l 0 .support; 11) if ( onf(r)min onf)thendo begin 12) CR CR[fr:l 0 !l l 0 ;l .support, onf(r)g; 13) for(n j 1;n>0;n--)dobegin 14) S n S n Subsets(S n ;l 0 ); 15) end 16) end 17) end 18) end 19) if (CR6=fg)then dobegin 20) max onf Maxr2CR( onf(r)); 21) nd rstfr2CRj onf(r)=max onfg; 22) SB SB[frg; 23) end 24) end 25) end FC 2 Itemset Supp fACg 3/5 fBEg 4/5 1 ! SB Rule Conf C!A 3/4 FC 3 Itemset Supp fBCEg 3/5 2 ! SB Rule Conf C!A 3/4 BE!C 3/4 FC 4 Itemset Supp fABCEg 2/5 3 ! SB Rule Conf C!A 3/4 BE!C 3/4 AC!BE 2/3

Figure5: GeneratingStru turalBasisforApproximateAsso iationRules.

6 Experimental Results

Experiments were performed on a Pentium II PC with a 350 Mhz lo k rate, 128 MBytes of RAM, running the Linux operating system. Algorithms were implemented in C++. Chara teristi s of the datasetsusedaregivenin Table12. ThesedatasetsaretheT10I4D100K

4

syntheti datasetthatmimi s marketbasketdata,theC20D10KandtheC73D10K ensusdatasetsfromthePUMSsample le

5 ,and the Mushrooms

6

dataset des ribing mushroom hara teristi s. In all experiments, we attempted to hoosesigni antminimumsupportand on den ethresholdvalues: weobservedthresholdvaluesused inotherpapersforexperimentsonsimilardatatypesandinspe tedrulesextra tedinthebases.

4

http://www.almaden.ibm. om/ s/quest/syndata.html 5

ftp://ftp2. .ukans.edu/pub/ippbr/ ensus/pums/pums90ks.zip 6

(18)

T10I4D100K 100,000 10 1, 000 Mushrooms 8, 416 23 127

C20D10K 10, 000 20 386

C73D10K 10, 000 73 2, 177 Table12: Datasets.

6.1 Relative Performan e of Apriori and Apriori-Close

We ondu ted experimentsto ompareresponse timesobtained with Apriori andApriori-Close on the four datasets. Results for the T10I4D100Kand Mushrooms datasets are presented in Table 13. We anobservethatexe utiontimesareidenti alforthetwoalgorithms: addingthefrequent loseditemset derivationtothefrequentitemsetdis overydoesnotindu eadditional omputationtime. Similarresults wereobtainedforC20D10KandC73D10Kdatasets.

Minsupp Apriori Apriori-Close 2.0% 1. 99s 1. 97s 1.0% 3. 47s 3. 46s 0.5% 9. 62s 9. 70s 0.25% 15. 02s 14. 92s

Minsupp Apriori Apriori-Close 90% 0. 28s 0. 28s 70% 0. 73s 0. 73s 50% 2. 40s 2. 70s 30% 18. 22s 17. 93s

T10I4D100K Mushrooms

Table13: Exe utionTimesofAprioriandApriori-Close.

6.2 Number of Rules and Exe ution Times of the Rule Generation

Table14showsthetotalnumberofexa tasso iationrules andtheirnumberin theDuquenne-Guigues basisforexa trules. Table15showsthetotalnumberofapproximateasso iationrules,theirnumberin theproperbasisandinthestru turalbasisforapproximaterules,andthenumberofnon-transitiverules in theproperbasisforapproximaterules(5

th

olumn). Forexamplein the ontextD, rulesC!A and AC!BEareextra ted,aswellastheruleC!ABEwhi his learlytransitive. Sin eby onstru tion, its on den e{retrievedbymultiplyingthe on den esofthetwoformer{islessthantheirs,thisruleis thelessinterestingamongthethree. Redu ingtheextra tiontonon-transitiverulesintheproperbasis forapproximaterules analsobeinteresting. Su hrulesaregeneratedbyavariantofAlgorithm5with thelastpruningstrategy(steps20and21)removed: all andidaterulesinCRareinsertedinSB.

Table16 showsforthe fourdatasets theaverage relativesize ofbases ompared withthe setsof all rulesobtained. Inthe aseofweakly orrelateddata(T10I4D100K), noexa truleisgenerated andthe properbasisforapproximate rules ontainsallapproximate rulesthathold. Thereasonisthat, in su h data, all frequent itemsets are frequent losed itemsets. In the aseof orrelated data (Mushrooms, C20D10KandC73D10K),thenumberofextra tedrulesinbasesismu hsmallerthanthetotalnumber ofrulesthat hold.

Figure6showsforea hdatasettheexe utiontimesofthe omputationofallrules(usingthealgorithm des ribed in [2℄)andbases. Exe utiontimesofthe derivationof theDuquenne-Guigues basisforexa t rulesandtheproperbasisfornon-transitiveapproximaterulesarenotpresentedsin etheyareidenti al to those of the derivation of the Duquenne-Guigues basis for exa t rules and the stru tural basis for approximate rules(Duquenne-Guigues andstru turalbases).

7 Con lusion

(19)

T10I4D100K 0.5% 0 0 Mushrooms 30% 7, 476 69 C20D10K 50% 2, 277 11 C73D10K 90% 52, 035 15

Table14: NumberofExa tAsso iationRulesExtra ted.

Dataset Min onf Approximate Proper Non-transitive Stru tural (Minsupp) rules basis basis basis

90% 16, 260 16, 260 3, 511 916 T10I4D100K 70% 20, 419 20, 419 4, 004 1, 058 (0.5%) 50% 21, 686 21, 686 4, 191 1, 140 30% 22, 952 22, 952 4, 519 1, 367 90% 12, 911 806 563 313 Mushrooms 70% 37, 671 2, 454 968 384 (30%) 50% 56, 703 3, 870 1, 169 410 30% 71, 412 5, 727 1, 260 424 90% 36, 012 4, 008 1, 379 443 C20D10K 70% 89, 601 10, 005 1, 948 455 (50%) 50% 116, 791 13, 179 1, 948 455 30% 116, 791 13, 179 1, 948 455 95% 1,606,726 23, 084 4, 052 939 C73D10K 90% 2,053,896 32, 644 4, 089 941 (90%) 85% 2,053,936 32, 646 4, 089 941 80% 2,053,936 32, 646 4, 089 941 Table15: NumberofApproximateAsso iationRulesExtra ted. Dataset Duquenne-Guigues Proper Non-transitive Stru tural

basis basis basis basis T10I4D100K - 100. 00% 20. 05% 5.49% Mushrooms 0.92% 6.90% 2. 69% 1.19% C20D10K 0.48% 11. 21% 2. 33% 0.63% C73D10K 0.03% 1.55% 0. 21% 0.05%

Table16: AverageRelativeSizeofBases.

information. Moreover,itssizeissigni antlyredu ed omparedwiththesetofallpossiblerulesbe ause redundant,and thus useless,rules aredis arded. Ourapproa hhasatwofoldadvantage: on onehand, the useris provided with a smaller set of resulting rules, easier to handle, and vehi uling information ofimprovedquality. Onthe otherhand,exe utiontimes areredu ed ompared withthe dis overingof all asso iation rules. Su h results are proved (in the groundwork of latti e theory) and illustrated by experiments,a hievedfromreal-lifedatasets.

Integrating redu tion methods Templates,as de nedin [3,16℄, an dire tlybeusedforextra ting from the bases all asso iation rules mat hing someuser spe i ed patterns. Information in taxonomies asso iated withthe dataset an also be integrated in thepro ess as proposed in [14, 28℄for extra ting basesforgeneralized(multi-level)asso iationrules. Integratingitem onstraintsandstatisti almeasures, su hasdes ribedin [5,22,29℄and[7,25℄respe tively,inthegenerationofbasesrequiresfurtherwork. Fun tional and approximate dependen ies Algorithms presented in this paper an be adapted to generatebases for fun tional and approximate dependen ies. In[15, 20℄, su h bases and algorithms forgeneratingthem were proposed. However,theDuquenne-Guigues basisissmallerthanthebasisfor fun tionaldependen ies onstitutedofminimal non-trivialfun tionaldependen ies. Hen e,thenumber

(20)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

30

40

50

60

70

80

90

Time (s)

Minimum confidence (%)

All rules

Duquenne-Guigues and proper bases

Duquenne-Guigues and structural bases

T10I4D100K

0

0.5

1

1.5

2

2.5

30

40

50

60

70

80

90

Time (s)

Minimum confidence (%)

All rules

Duquenne-Guigues and proper bases

Duquenne-Guigues and structural bases

Mushrooms

0

0.5

1

1.5

2

2.5

3

3.5

30

40

50

60

70

80

90

Time (s)

Minimum confidence (%)

All rules

Duquenne-Guigues and proper bases

Duquenne-Guigues and structural bases

C20D10K

0

10

20

30

40

50

60

80

85

90

95

Time (s)

Minimum confidence (%)

All rules

Duquenne-Guigues and proper bases

Duquenne-Guigues and structural bases

C73D10K Figure6: Exe utionTimesoftheAsso iationRuleDerivation.

maximal onsequent [10, 13℄. Furthermore, the proper and stru tural bases for approximate rules are alsosmallerthanthebasisforapproximatedependen iesde nedin[15℄. Adaptingouralgorithmstothe dis overyoffun tionalandapproximatedependen iesisanongoingresear h.

A knowledgements

The authors would like to gratefully a knowledge Rosine Ci hetti and Mohand-Said Ha id for their onstru tive omments.

Referen es

[1℄ R. Agrawal, T. Imielinski, and A. Swami. Miningasso iationrules between sets of items in large databases. Pro . of the ACMSIGMODConferen e,pages207{216,May1993.

[2℄ R. AgrawalandR. Srikant. Fast algorithms forminingasso iationrules. Pro . of the 20th VLDB Conferen e,pages478{499,June1994. ExpandedversioninIBMResear hReportRJ9839. [3℄ E. Baralisand G. Psaila. Designingtemplates for miningasso iationrules. Journal of Intelligent

Information Systems,9(1):7{32,July1997.

[4℄ R.J.Bayardo. EÆ ientlymininglongpatternsfrom databases. Pro . of the ACM SIGMOD Con-feren e,pages85{93,June1998.

(21)

1967. Third edition.

[7℄ S.Brin, R.Motwani, andC.Silverstein. Beyondmarketbaskets: Generalizingasso iationrules to orrelation. Pro . of the ACMSIGMODConferen e,pages265{276,May1997.

[8℄ S. Brin,R. Motwani, J.D. Ullman, and S. Tsur. Dynami itemset ountingand impli ationrules formarketbasket data. Pro . ofthe ACMSIGMOD Conferen e,pages255{264,May1997. [9℄ P.Burmeister. Formal on eptanalysiswithConImp: Introdu tiontothebasi features. Te hni al

report,Te hnis heHo hs huleDarmstadt,Germany,1998.

[10℄ J.Demetrovi s,L. Libkin, and I.B.Mu hnik. Fun tionaldependen ies in relationaldatabases: A latti epointofview. Dis reteAppliedMathemati s,40:155{185,1992.

[11℄ V.DuquenneandJ.-L.Guigues.Familleminimaled'impli ationinformativesresultantd'untableau dedonneesbinaires. Mathematiques etS ien esHumaines,24(95):5{18,1986.

[12℄ B.Ganter and K. Reuter. Finding all losed sets: A generalapproa h. In Order, pages 283{290. KluwerA ademi Publishers,1991.

[13℄ B.Ganter andR.Wille. Formal Con ept Analysis: Mathemati al Foundations. Springer,1998. [14℄ J.Han andY. Fu. Dis overyof multiple-levelasso iation rulesfrom large databases. Pro . of the

21stVLDB Conferen e,pages420{431,September1995.

[15℄ Y.Huhtala,J.Karkkanen,P.Porkka,andH.Toivonen.EÆ ientdis overyoffun tionaland approx-imatedependen ies using partitions. Pro . of the 14th ICDE Conferen e, pages 392{401,Febuary 1998.

[16℄ M.Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen,and A. I.Verkamo. Finding interesting rules from large sets of dis overed asso iation rules. Pro . of the 3rd CIKM Conferen e, pages 401{407,November1994.

[17℄ D. Linand Z. M. Kedem. Pin er-sear h: A newalgorithm for dis overingthe maximum frequent set. Pro . of the6th EDBT Conferen e,pages105{119,Mar h1998.

[18℄ B.Liu,W. Hsu,andS. Chen. Using generalimpressionsto analyse dis overed lassi ationrules. Pro . ofthe 3rdKDD Conferen e,pages31{36,August1997.

[19℄ M.Luxenburger.Impli ationspartiellesdansun ontexte.Mathematiques,InformatiqueetS ien es Humaines,29(113):35{55,1991.

[20℄ H.MannilaandK.J.Raiha. Algorithmsforinferringfun tionaldependen iesfromrelations. Data &Knowledge Engineering,12(1):83{99,Febuary1994.

[21℄ R.Meo,G.Psaila,andS.Ceri. A newSQL-likeoperatorforminingasso iationrules. Pro . of the 22ndVLDB Conferen e,pages122{133,September1996.

[22℄ R.T.Ng,V. S.Lakshmanan,J.Han,andA.Pang. Exploratoryminingandpruningoptimizations of onstrainedasso iationrules. Pro .of the ACMSIGMODConferen e,pages13{24,June1998. [23℄ N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Dis overingfrequent loseditemsetsforasso iation

rules. Pro .of the 7thICDT Conferen e,pages398{416,January1999.

[24℄ N.Pasquier,Y. Bastide,R.Taouil,andL. Lakhal. EÆ ientminingofasso iationrulesusing losed itemsetlatti es. Information Systems,24(1):25{46,1999.

(22)

largesdatabases. Pro . ofthe 21st VLDB Conferen e,pages432{444,September1995.

[27℄ A. Silbers hatzandA. Tuzhilin. Whatmakespatterns interestinginknowledgedis overysystems. IEEETransa tionson Knowledge andData Engineering,8(6):970{974,De ember1996.

[28℄ R.SrikantandR.Agrawal.Mininggeneralizedasso iationrules.Pro .ofthe21stVLDBConferen e, pages407{419,September1995.

[29℄ R.Srikant,Q.Vu,andR.Agrawal. Miningasso iationruleswithitem onstraints. Pro . ofthe 3rd KDDConferen e,pages67{73,August1997.

[30℄ H. Toivonen. Samplinglargedatabasesfor asso iationrules. Pro . ofthe 22ndVLDB Conferen e, pages134{145,September1996.

[31℄ R. Wille. Con ept latti es and on eptualknowledge systems. Computers and Mathemati s with Appli ations,23:493{515,1992.

[32℄ M.J. ZakiandM. Ogihara. Theoreti al foundationsof asso iationrules. 3rdSIGMOD Workshop onResear h IssuesinData Mining andKnowledge Dis overy,June1998.

[33℄ M.J.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.Newalgorithmsforfastdis overyofasso iation rules. Pro . ofthe 3rdKDD Conferen e,pages283{286,August1997.

Figure

Table 1: The Example Data Mining Context D.
Table 2: Assoiation Rules Extrated from D for minsup = 2/5 and minonf = 1/2.
Figure 1: Undireted Graph G
Figure 2: Disovering F requent and Frequent Closed Itemsets with Apriori-Close.
+7

Références

Documents relatifs

In order to set up the TFDW model we imagine (in analogy to the nonrelativistic case) that the HF propagator can be replaced by a propagator of an effective field theory. where

In the next section, we describe the interface through two use cases: a student using the tool to attend a distant lecture in real time, and a teacher giving the lecture to

The paper estab- lishes a surprisingly simple characterization of Palm distributions for such a process: The reduced n-point Palm distribution is for any n ≥ 1 itself a log Gaussian

In places where remote communication already takes place, touch devices can allow people to further their range of communication by multiplexing existing communication

Negotiations to remove tariffs on Environmental Goods (EGs) and Environmental Services (ESs) at the Doha Round, followed by negotiations towards an Environmental Goods Agreement

(2000) list five other properties that might be requested from supertree methods: changing the order of the trees in the input collection does not change the supertree (P1);

With adjustments to different ways of rating content, redefining similarity of users and items, and focusing on bigger group impact, recommender systems can begin to shift

The results of this investigation confirm theoretically the general consensus that the details of the contact pressure dis- tributions affect stresses and strains near the surface