HAL Id: hal-00467748
https://hal.archives-ouvertes.fr/hal-00467748
Submitted on 26 Apr 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Closed sets based discovery of small covers for
association rules
Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal
To cite this version:
Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal. Closed sets based discovery of small covers
for association rules. BDA’1999 international conference on Advanced Databases, Oct 1999, Bordeaux,
France. pp.361-381. �hal-00467748�
for Asso iation Rules
Ni olas Pasquier Yves Bastide RakTaouil Lot Lakhal Laboratoired'Informatique(LIMOS)
UniversiteBlaisePas al- Clermont-FerrandII ComplexeS ientiquedesCezeaux
24AvenuedesLandais,63177AubiereCedexFran e fpasquier,bastide,taouil,lakhalglibd2.univ-bp lermont.fr
Abstra t
Inthispaper,weaddresstheproblemoftheusefulnessofthesetofdis overedasso iationrules. This problemis importantsin ereal-lifedatabases yieldmostofthetimeseveralthousandsofruleswith high onden e. We propose new algorithmsbased onGalois losed sets to redu e the extra tion tosmall overs, or bases, for exa t andapproximaterules. On efrequent losed itemsets {whi h onstituteageneratingsetforbothfrequentitemsetsand asso iationrules {havebeendis overed, no additional database pass is needed to derive these bases. Experiments ondu ted on real-life databasesshowthatthesealgorithmsareeÆ ientandvaluableinpra ti e.
Keywords: data mining, Galois losure operator, frequent losed itemsets, bases for asso iation rules,algorithms.
1 Introdu tion and Motivation
Data mining has been extensively addressed for the last years, spe ially the problem of dis overing asso iationrules. Theaim when dis overingasso iation rules is to exhibit relationships betweendata items (or attributes) and ompute the pre ision of ea h relationship in the database. Usual pre ision measures aresupportand onden e [1℄thatpointtheproportionofdatabasetransa tions(or obje ts) upholding ea h rule out. When an asso iation rule has support and onden e ex eeding some user-dened minimumthresholds,theruleis onsideredasrelevantandtheextra tedknowledgewouldlikely be used for supporting de isionmaking. A lassi al exampleof asso iationrules ts in the ontext of marketbasketdataanalysisandhighlightsaparti ularfeaturein ustomersbehavior: 80%of ustomers whobuy erealsand sugaralsobuy milk.
Sin etheproblem wasstated [1℄,various approa heshavebeenproposed for anin reasedeÆ ien y ofruledis overy[2,4,8,17,23,24,26,30,33℄. However,fullytakingadvantageofexhibitedknowledge means apabilitiesto handlesu haknowledge. Infa t,byusingasyntheti dataset ontaining100,000 obje ts,ea hofwhi hen ompassingaround10items,ourexperimentsyieldmorethan16,000ruleswith onden e out oming90%. Theproblemis mu h more riti alwhen olle teddata ishighly orrelated ordense,likeinstatisti alormedi aldatabases. Forinstan e,whenappliedtoa ensusdatasetof10,000 obje ts,ea hofwhi h hara terizedbyvaluesof73attributes,experimentsresultinmorethan2,000,000 ruleswithsupport and onden e out oming90%.
Thus the talkedissue ould berephrasedas follows: whi h relevantknowledge an belearnedfrom several thousandsofruleshighly redundant? Whi haid ouldbeoeredto usersforhandling ountless rulesand fo usingonuseful ones? Beforeexplaininghowourapproa hanswersthepreviousquestions,
Among approa hes addressingthe des ribed issue, twomain trends an be distinguished. The former provides userswith me hanismsfor lteringrules. In[3, 16℄, the userdenes templates, and rulesnot mat hing with them are dis arded. In [22, 29℄, boolean operators are introdu ed for sele ting rules in luding(or not) givenitems. A similar approa hexpanded withameasure ofusefulnessof extra ted rules, alledimprovement,isproposedin[5℄. In[21℄,anSQL-likeoperator alledMINE RULE,allowingthe spe i ation ofgeneral extra tion riteria,is proposed. Thequoted approa hesoperate \aposteriori", i.e. on e hugeamountof rulesare extra ted, queryingfa ilities makeit possibleto handle rulesubsets sele ted a ording to the user preferen es. In ontrast, the se ond trend addresses the problem with an\apriori"vision, byattempting to minimizethenumberof exhibitedrules. In[14, 28℄, information about taxonomies are used to dene riteria of interest whi h apply for pruning redundant rules. In [7, 25℄,statisti al measuressu has Pearson's orrelationor the hi-squaredtestare usedinsteadofthe onden emeasure.
1.2 Contribution: an Overview
Theapproa hpresentedinthispaperbelongstothese ondtrendsin eitaimstoextra tnotallpossible rulesbut asub-set alled small overor basis for asso iation rules. When omputing su h a basis, re-dundantrulesaredis ardedsin etheydonotvehi ulerelevantknowledge. Su h apruningoperationis akey-step duringrule extra tion, andsigni antlyredu estheresulting set. For example,experiments performed usingareal-lifedatasetdes ribing hara teristi sofmushroomsyieldthe 9following asso i-ation rules with free gills in the ante edent and eatable in the onsequent, and with ommon support (51%)and onden e(54%).
1)freegills!eatable 6)freegills, white veil!eatable,partialveil 2)freegills!eatable, partial veil 7)freegills, partialveil !eatable
3)freegills!eatable, white veil 8)freegills, partialveil !eatable, whiteveil 4)freegills!eatable, partial veil, whiteveil 9)freegills, white veil, partialveil! eatable 5)freegills, whiteveil! eatable
Amongtheserules,8areredundantbe ausethey anbededu edfromthe4 th
rule: freegills! eatable, partialveil, whiteveil. Moreover,sin erulesunexpe tedbytheuserareimportant[18,27℄,presentinga listofrules overingallthefrequentitemsin thedatasetisalsoneeded.
First,usingthe losureoperatoroftheGalois onne tion[6℄,we hara terizefrequent loseditemsets [23,24℄. Then,weshowthatfrequent loseditemsetsrepresentageneratingsetforbothfrequentitemsets andasso iation rules. The underlyingtheorem statesthefoundationsof ourapproa h sin eit makesit possibleto generate thebases fromfrequent loseditemsets byavoidinghandlingof largesets ofrules. We propose two new algorithms: the former a hievesfrequent losed itemsets from frequent itemsets withouta essingthedataset,andthelatter, alled Apriori-Close,extends theApriorialgorithm[2℄by dis overingsimultaneously frequent itemsets and frequent losed itemsets withoutadditionalexe ution time.
Then, using the frequent losed itemsets and the pseudo- losed itemsets dened by Duquenne and Guiguesinlatti etheory[9,11℄,wedenetheDuquenne-Guiguesbasis forexa t asso iation rules (rules witha100% onden e). Rulesinthisbasisarenon-redundantexa truleswithminimalante edentand maximal onsequent. Besides, using thefrequent losed itemsets andresults proposed by Luxenburger inlatti etheory[19,32℄, wedenetheproperbasis andthestru tural basis for approximate asso iation rules. Theproperbasisisasmallset ontainingthemostinformativeandusefulapproximaterules: the non-redundantinformativerules. The stru turalbasis an beviewed as anabstra t ofall approximate rulesthathold and an beusefulwhentheproperbasisislarge. Weproposethree algorithmsintended foryielding these threebases. Usingthe set offrequent losed itemsets, generating theevokedbases is performedwithoutanya essto thedataset.
number of items some tens. From the results presented in [19℄, no algorithm was proposed. In [24℄, the asso iation rule framework basedon theGalois onne tion is dened. Fitting in this groundwork, twoeÆ ientalgorithmsthat dis overfrequent loseditemsetsforasso iationrulesaredened: theClose algorithm [24℄for orrelated dataandtheA-Close algorithm[23℄ forweakly orrelateddata. Thework presentedin thispaperdiersfrom [23,24℄inthefollowingpoints:
1. Itshowsthatfrequent loseditemsets onstituteagenerating setforfrequentitemsets and asso i-ationrules.
2. ItextendstheApriorialgorithmandalgorithmsfordis overingmaximalfrequentitemsetsto gen-eratefrequent loseditemsets.
3. Itadapts theDuquenne-Guigues basisandLuxenburgerresultsfor exa tand partial impli ations tothe ontextofasso iationrules. Thisadaptationisbasedon1. (generatingset).
4. Itpresentsnewalgorithmsforgenerating basesfor exa tand approximate asso iationrulesusing frequent loseditemsets.
5. Itshowsthatthealgorithms proposed areeÆ ientforbothimprovingtheusefulnessofextra ted asso iationrulesandde reasingtheexe utiontime oftheasso iationruleextra tion.
As shown by experiments, the proposed pro ess for extra ting bases does not require any overhead omparedwiththetraditionalapproa hesfordis overingasso iationrules.
1.3 Paper Organization
In Se tion 2, we present the asso iation rule framework based on the Galois onne tion. Se tion 3 addresses the on ept of basis for both exa t and approximate asso iation rules. New algorithms for dis overing frequent and frequent losed itemsets are des ribed in Se tion 4 and the following se tion presentsalgorithms omputingthebasesforasso iationrulesfromthefrequent loseditemsets. Experi-mental resultsa hieved fromvarious datasets aregivenin Se tion 6. Finally, as a on lusion,weevoke further workin Se tion7.
2 Asso iation Rule Framework
In this se tion, we present the asso iation rule framework based on the Galois onne tion, primarily introdu edin[23, 24℄.
Denition 1(Data mining ontext) A data mining ontext 1
is denedas D =(O;I;R), where O andI arenitesetsofobje tsanditemsrespe tively. ROI isabinaryrelationbetweenobje tsand items. Ea h ouple(o;i)2Rdenotes the fa t thatthe obje to2O isrelatedtothe itemi2I.
Depending onthetargetsystem,adatamining ontext an bearelation, a lass,ortheresultofan SQL/OQLquery.
Example1 An exampledatamining ontextD onsistingof5obje ts(identiedbytheirOID) and5 itemsisillustratedinTable1.
Denition 2(Galois onne tion) Let D = (O, I, R) be a data mining ontext. For O O and I I, wedene: f:2 O !2 I g:2 I !2 O
f(O)=fi2Ij8o2O;(o;i)2Rg g(I)=fo2Oj8i2I;(o;i)2Rg 1
1 A C D 2 B C E 3 A B C E 4 B E
5 A B C E
Table1: TheExampleDataMiningContextD.
f(O) asso iates with O the items ommon to all obje ts o 2 O and g(I) asso iates with I the obje ts relatedtoallitemsi2I. The oupleofappli ations(f;g)isaGalois onne tionbetweenthepowersetof O(2
O
)andthepowersetofI(2 I
). ThefollowingpropertiesholdforallI;I 1 ;I 2 I andO;O 1 ;O 2 O: (1) I 1 I 2 )g(I 1 )g(I 2 ) (1') O 1 O 2 )f(O 1 )f(O 2 ) (2) Og(I)()I f(O)
Denition3(Frequentitemsets) Let I I be a set of items from D. The support ount of the itemsetI inDis:
supp(I)= kg(I)k
kOk
I issaidtobefrequent ifthe supportof I inD isatleastminsupp. The setL of frequent itemsetsinD is:
L=fIIjsupp(I)minsuppg
Denition4(Asso iation rules) An asso iation rule is an impli ation between two itemsets, with the form I 1 !I 2 where I 1 ;I 2 I, I 1 ;I 2 6= ? and I 1 \I 2 = ?. I 1 and I 2
are alled respe tively the ante edentandthe onsequentof the rule. Thesupportsupp(r) and onden e onf(r)of an asso iation ruler:I
1 !I
2
aredenedusingthe Galois onne tion asfollows:
supp(r)= kg(I 1 [I 2 )k kOk ; onf(r)= supp(I 1 [I 2 ) supp(I 1 )
Asso iation rulesholding in the ontext arethosethat havesupportand onden egreaterthanor equal tothe minsuppandmin onfthresholdsrespe tively. Wedenethe setAR ofasso iation rulesholding in Dgiven minsuppandmin onfthresholds asfollows:
AR=fr:I 1 !I 2 I 1 jI 1 I 2 I ^ supp(I 2
)minsupp ^ onf(r)min onfg
If onf(r)=1thenris alledanexa tasso iationruleorimpli ationrule,otherwiseris alledapproximate asso iation rule.
Example2 Exa tandapproximateasso iationrulesextra tedfromDforminsupp=2/5andmin onf =1/2aregiveninTable2.
3 Bases for Asso iation Rules
In this se tion, we rst demonstrate that the frequent losed itemsets onstitute a generating set for frequent itemsets and asso iation rules. Then, we hara terize the Duquenne-Guigues basis for exa t asso iation rules andtheproper andstru tural basesfor approximate asso iation rules. The Duquenne-Guiguesbasis, asdened in[11℄, isextended in thispaperto the ontextofasso iationrules. Proofsof Theorems2,3and4arestraightforwardfrom Theorem1and[11,19,32℄. Interestedreaders ouldrefer
ABC)E 2/5 BCE!A 2/5 2/3 B!AE 2/5 2/4 ABE)C 2/5 AC!BE 2/5 2/3 E!AB 2/5 2/4 ACE)B 2/5 BE!AC 2/5 2/4 A!CE 2/5 2/3 AB)CE 2/5 CE!AB 2/5 2/3 C!AE 2/5 2/4 AE)BC 2/5 AC!B 2/5 2/3 E!AC 2/5 2/4 AB)C 2/5 BC!A 2/5 2/3 B!CE 3/5 3/4 AB)E 2/5 BE!A 2/5 2/4 C!BE 3/5 3/4 AE)B 2/5 AC!E 2/5 2/3 E!BC 3/5 3/4 AE)C 2/5 CE!A 2/5 2/3 A!B 2/5 2/3 BC)E 3/5 BE!C 3/5 3/4 B!A 2/5 2/4 CE)B 3/5 A!BCE 2/5 2/3 C!A 3/5 3/4 A)C 3/5 B!ACE 2/5 2/4 A!E 2/5 2/3 B)E 4/5 C!ABE 2/5 2/4 E!A 2/5 2/4 E)B 4/5 E !ABC 2/5 2/4 B!C 3/5 3/4 A!BC 2/5 2/3 C!B 3/5 3/4 B!AC 2/5 2/4 C!E 3/5 3/4 C!AB 2/5 2/4 E!C 3/5 3/4 A!BE 2/5 2/3
Table2: Asso iationRulesExtra tedfrom Dforminsup =2/5andmin onf =1/2.
3.1 Generating Set
Denition 5(Galois losure operators) Theoperatorsh=fÆgin2 I andh 0 =gÆf in2 O areGalois losure operators 2
. Given the Galois onne tion (f;g), the following properties hold for allI;I 1 ;I 2 I andO;O 1 ;O 2 O[6 ℄: Extension: (3) I h(I) (3') Oh 0 (O) Idempoten y: (4) h(h(I))=h(I) (4') h
0 (h 0 (O))=h 0 (O) Monotoni ity: (5) I 1 I 2 )h(I 1 )h(I 2 ) (5') O 1 O 2 )h 0 (O 1 )h 0 (O 2 )
Denition 6(Frequent losed itemsets) An itemsetI I in Disa loseditemsetih(I)=I. A loseditemsetI issaidtobefrequentifthesupportofI inDisatleastminsupp. Thesmallest(minimal) loseditemset ontaining an itemsetI is h(I), the losure of I. The set FC of frequent losed itemsets in Disdenedasfollows:
FC=fIIjI =h(I) ^ supp(I)minsuppg
Example3 A frequent loseditemset isamaximal setof items ommonto asetof obje ts,forwhi h supportisatleastminsupp. Thefrequent loseditemsetsinthe ontextDforminsupp=2/5arepresented in Table3. Theitemset BCE is afrequent losed itemset sin eitis themaximalset ofitems ommon to theobje tsf2;3;5g. TheitemsetBC isnotafrequent loseditemsetsin eitisnotamaximalset of items ommontosomeobje ts: allobje tsinrelation withtheitemsB andC (obje ts2,3and5) are alsoin relationwiththeitemE.
Hereafter,wedemonstratethattheset offrequent loseditemsetswith theirsupportisthesmallest olle tionfrom whi h frequentitemsets withtheirsupport andasso iationrules an begenerated(it is ageneratingset).
Lemma1 [24 ℄ The support of an itemsetI is equal tothe support of the smallest losed itemset on-taining I: supp(I)=supp(h(I)).
Lemma2 [24 ℄ The set of maximalfrequentitemsets M=fI 2L j I 0
2L where I I 0
gisidenti al tothe setof maximalfrequent loseditemsets MC=fI 2FC jI
0
2FC where I I 0
g. 2
f?g 5/5 fCg 4/5 fACg 3/5 fBEg 4/5 fBCEg 3/5 fABCEg 2/5
Table3: FrequentClosedItemsetsExtra tedfromDforminsupp =2/5.
Theorem 1(Generatingset) ThesetFCoffrequent loseditemsetswiththeirsupportisagenerating setfor allfrequentitemsets andtheir support, andfor all asso iation rulesholding inthe dataset, their support andtheir onden e.
Proof. Based on Lemma 2, all frequent itemsets an be derived from the maximal frequent losed itemsets. Based onLemma 1,the support ofea h frequentitemset an bederivedfrom thesupport of frequent losed itemsets. Then, theset of frequent loseditemsets FC isagenerating set forboththe setoffrequentitemsetsLandthesetofasso iationrulesAR
3 .
3.2 Duquenne-Guigues Basis for Exa t Asso iation Rules
Denition7(Frequentpseudo- losed itemsets) AnitemsetI I in Disapseudo- losed itemset ih(I)6=I and8I
0
I su hasI 0
isapseudo- loseditemset, wehaveh(I 0
)I. ThesetFPoffrequent pseudo- loseditemsets inDisdenedas
FP =fI I jsupp(I)minsupp ^I 6=h(I) ^ 8I 0 2FP su hasI 0 I wehaveh(I 0 )Ig Theorem 2(Duquenne-Guigues Basis forExa t Asso iation Rules) Let FP be the set of fre-quentpseudo- loseditemsetsinD. Theset
DG=fr:I 1 )h(I 1 ) I 1 j I 1 2FP ^ I 1 6=?g isabasis forallexa t asso iation rules holding inthe dataset.
The Duquenne-Guigues basisis minimal with respe tto thenumber ofrules sin e there an beno ompletesetwithfewerrulesthantherearefrequentpseudo- loseditemsets[10,13℄.
Example4 Afrequentpseudo- loseditemsetIisafrequentnon- loseditemsetthatin ludesthe losures ofallfrequentpseudo- loseditemsetsin ludedinI. ThesetFPoffrequentpseudo- loseditemsetsandthe Duquenne-Guiguesbasisforexa tasso iationrulesextra tedfromDforminsupp=2=5andmin onf=1=2 arepresentedin Table 4. Theitemset AB isnotafrequent pseudo- loseditemset sin ethe losuresof A and B (respe tively AC and BE) are notin luded in AB. ABCE is not afrequent pseudo- losed itemsetsin eitis losed.
Frequentpseudo- loseditemset Support
fAg 3/5
fBg 4/5
fEg 4/5
Exa trule Support A)C 3/5 B)E 4/5 E)B 4/5
Table4: FrequentPseudo-ClosedItemsetsandDuquenne-Guigues BasisExtra tedfrom Dforminsupp =2=5.
3
Theorem 3 (ProperBasis for Approximate Asso iation Rules) Let FC be the set of frequent loseditemsetsinD. Theset
PB=fr:I 1 !I 2 I 1 jI 1 ;I 2 2FC ^I 1 6=?^ I 1 I 2 ^ onf(r)min onfg
isabasisforallapproximateasso iationrulesholdinginthedataset. Asso iationrulesinPB areproper approximate asso iation rules.
Example5 Theproperbasisforapproximateasso iationrulesextra tedfromDforminsupp=2/5and min onf=1/2arepresentedin Table5.
Approximaterule Support Conden e BCE!A 2/5 2/3 AC!BE 2/5 2/3 BE!AC 2/5 2/4 BE!C 3/5 3/4 C!ABE 2/5 2/4 C!BE 3/5 3/4 C!A 3/5 3/4
Table5: ProperBasisExtra tedfrom Dforminsupp =2/5andmin onf =1/2.
3.4 Stru tural Basis for Approximate Asso iation Rules Denition 8(Undire ted graph G
FC
) LetFC bethesetoffrequent loseditemsetsinD. Wedene G
FC
=(V;E)astheundire tedgraphasso iatedwithFC wherethesetofverti esV andthesetofedges E aredenedasfollows:
V =fIIjI 2FCg E=f(I 1 ;I 2 )2V V j I 1 I 2 ^ supp(I 2 )=supp(I 1 )min onfg Withea hedgeinG
FC
betweentwoverti esI 1 andI 2 withI 1 I 2
isasso iatedthe onden e=supp(I 2
) /supp(I
1
)of the proper approximateasso iation ruleI 1
!I 2
I 1
representedbythe edge. Denition 9(MaximalConden e Spanning Forest F
FC ) LetF
FC
=(V;E 0
)bethemaximal on-den e spanning forest asso iated with FC. F
FC
is obtained from the undire ted graph G FC
= (V;E) by suppressingtransitive edges and y les. Cy les areremovedby deletingsomeedges that enterthe last vertexI (maximalvertexwithrespe ttothein lusion) ofthe y le. Amongalledges enteringinI,those with onden eless thanthe maximal onden evalueasso iatedwith anedge with theform(I
0
;I)2E aredeleted. Ifmorethanoneedgehave themaximal onden evalue,therstoneinlexi ographi order iskept.
Theorem 4 (Stru tural Basisfor Approximate Asso iation Rules) LetSB be thesetof asso i-ation rulesrepresentedby edges inF
FC
ex ept rules fromthe vertexf?g. Theset SB=fr:I 1 !I 2 I 1 jI 1 ;I 2 2V ^ I 1 I 2 ^ I 1 6=? ^(I 1 ;I 2 )2E 0 g
isabasisfor allapproximateasso iation rulesholding inthedataset(I isthe onsequentofatmostone approximate asso iation ruleinSB).
Example6 Thestru tural basisforapproximateasso iationrulesextra tedfrom D forminsupp=2/5 andmin onf=1/2ispresentedin Table6.
2/3
A B C E
3/4
3/4
A C
B E
C
B C E
3/4
2/3
2/4
2/4
4/5
4/5
Ø
2/5
3/5
3/5
G FCA B C E
3/4
3/4
A C
B E
C
B C E
2/3
4/5
4/5
Ø
F FC Figure1: Undire tedGraphGFC
andMaximalConden eSpanningForestF FC
(atreeinthisexample) DerivedfromDforminsupp =2/5andmin onf =1/2.
Approximaterule Support Conden e AC!BE 2/5 2/3 BE!C 3/5 3/4 C!A 3/5 3/4
Table6: Stru turalBasisExtra tedfromDforminsupp=2/5andmin onf =1/2.
4 Dis overing Frequent and Frequent Closed Itemsets
InSe tion 4.1, we propose a newalgorithm to a hievefrequent losed itemsets from frequent itemsets withouta essing the dataset. This algorithm dis oversfrequent losed itemsets while for instan e an algorithm for dis overingmaximal frequent itemsets [4, 17, 33℄ is used. In Se tion 4.2, we present an extension of theApriori algorithm [2℄ alled Apriori-Close fordis overingfrequent and frequent losed itemsetswithoutadditional omputationtime. LikeintheApriorialgorithm,weassumeinthefollowing thatitemsaresortedinlexi ographi orderandthat kisthesizeofthelargestfrequentitemsets. Based onLemma 2,k isalsothesize ofthelargestfrequent loseditemsets.
4.1 Computing Frequent Closed Itemsets from Frequent Itemsets
Many eÆ ient algorithms for mining frequent itemsets and their support have been proposed. Well-knownproposalsarepresentedin[2,8,26,30℄. EÆ ientalgorithmsfordis overingthemaximalfrequent itemsetsandthena hieveallfrequentitemsetshavealsobeenproposed [4,17, 33℄. All thesealgorithms giveasresultthesetL=
S i=k i=1 L i whereL i
ontainsallfrequenti-itemsets(itemsetsofsizei). Basedon Proposition1andLemma2(Se tion3.1),thefrequent loseditemsetsandtheirsupport anbe omputed fromthefrequentitemsets andtheirsupportwithoutanydataseta ess.
Thepseudo- odetodeterminefrequent loseditemsetsamongfrequentitemsetsisgiveninAlgorithm 1. NotationsaregiveninTable7. TheinputofthealgorithmaresetsL
i
,1ik, ontainingallfrequent itemsetsinthedataset. Itre ursivelygeneratesthesetsFC
i
,0ik,offrequent losedi-itemsetsfrom FC k to FC 0 . L i
Setoffrequenti-itemsetsandtheirsupport. FC
i
Setoffrequent losedi-itemsetsandtheirsupport.
is losed Variableindi atingifthe onsidereditemset is losedornot. Table7: Notations.
the Galois onne tion). If g(l)= g(s) then h(l) =h(s) ) l = h(s) ) s l (absurd). It followsthat g(l)g(s))supp(l)>supp(s).
Algorithm1DerivingFrequentClosedItemsetsfromFrequentItemsets. 1) FCk Lk;
2) for(i k 1;i6=0;i--)dobegin 3) FCi fg;
4) forallitemsetsl2Li dobegin 5) is l osed true;
6) forallitemsetsl 0 2L i+1 dobegin 7) if (ll 0 )and(l .support=l 0
.support)thenis l osed fal se; 8) end
9) if (is l osed=true)thenFCi FCi[fl g; 10) end
11) end
12) FC0 f?g;
13) forallitemsetsl2L1 dobegin
14) if (l .support=kOk)thenFC0 fg; 15) end
First, the set FC k
is initialized with the set of largest frequent itemsets L k
(step 1). Then, the algorithm iterativelydetermineswhi h i-itemsetsin L
i
are losed from L k 1
to L 1
(steps 2to11). At thebeginningofthei
th
iterationthesetFC i
offrequent losedi-itemsetsisempty(step 3). Insteps4 to10,forea hfrequentitemsetl inL
i
,weverifythatlhasthesamesupportasafrequent(i+1)-itemset l
0 in L
i+1
in whi h it is in luded. If so, wehave l 0
h(l)and then l 6=h(l): l is not losed (step 7). Otherwise, l is afrequent losed itemset and is inserted in FC
i
(step 9). During the last phase, the algorithmdeterminesiftheemptyitemsetis losedbyrstinitializingFC
0
withtheemptyitemset(step 12) and then onsidering all frequent1-itemsets in L
1
(steps 13 to 15). Ifa 1-itemset l has asupport equaltothenumberofobje tsinthe ontext,meaningthatl is ommontoallobje ts,thentheitemset ? annotbe losed(wehavesupp(f?g)=kOk=supp(l))andisremovedfromFC
0
(step14). Thus,at theendofthealgorithm,ea hsetFC
i
ontainsallfrequent losed i-itemsets.
Corre tness Sin eallmaximalfrequentitemsetsaremaximalfrequent loseditemsets(Lemma2),the omputationofthesetFC
k
ontainingthelargestfrequent loseditemsetsis orre t. The orre tnessof the omputationofsets FC
i
fori<kreliesonProposition1. Thispropositionenablesto determineifa frequenti-itemsetl is losedby omparing its supportand thesupportsof thefrequent(i+1)-itemsets in whi h lisin luded. Ifoneofthemhasthesamesupport asl,thenl annotbe losed.
4.2 Apriori-Close Algorithm
Inthis se tion,wepresentanextension oftheApriorialgorithm [2℄ omputingsimultaneouslyfrequent and frequent losed itemsets. Thepseudo- ode is givenin Algorithm2 andnotations in Table8. The algorithm iterativelygeneratesthesetsL
i
offrequenti-itemsetsfromL 1
to L k
. Besides, duringthei th iteration,allfrequent losed(i 1)-itemsetsinFC
i 1
aredetermined. ThesetFC k
isdeterminedduring thelaststepofthealgorithm.
L i
Setoffrequenti-itemsets,theirsupportand markeris losedindi atingif losedornot. FC
i
Setoffrequent losedi-itemsetsandtheirsupport. Table8: Notations. First,thevariablek isinitializedto 0(step1). Then, theset L
1
1) k 0;
2) itemsetsinL1 f1-itemsetsg; 3) L1 Support-Count(L1); 4) FC0 f?g;
5) forallitemsetsl2L1 dobegin
6) if (l .support<minsupp)thenL1 L1nfl g; 7) elseif (l .support=kOk)thenFC0 fg; 8) end 9) for(i 1;L i 6=fg; i++)dobegin 10) forallitemsetsl 0 2L i dol 0 .is losed true; 11) L i+1 Apriori-Gen(L i ); 12) forallitemsetsl2L i+1 dobegin 13) foralli-subsetsl 0 ofldobegin 14) if (l 0 62L i )thenL i+1 L i+1 nfl g; 15) end 16) end 17) Li+1 Support-Count(Li+1); 18) forallitemsetsl2Li+1dobegin
19) if (l .support<minsupp)thenLi+1 Li+1nfl g; 20) elsedobegin 21) foralli-subsetsl 0 2L i ofldobegin 22) if (l .support=l 0 .support)thenl 0
.is losed fal se;
23) end 24) end 25) end 26) FCi fl2Lijl :is losed =trueg; 27) k i; 28) end 29) FCk Lk; 3). Theset FC 0
is initialized withthe empty itemset (step 4) and thesupports of itemsets in L 1
are onsidered (steps 5 to 8). All infrequent1-itemsets are removedfrom L
1
(step 6) and if afrequent 1-itemsethasasupportequaltothe numberofobje tsinthe ontext thentheemptyitemsetis removed fromFC
0
(step7). Duringea hofthefollowingiterations(steps9to28),frequentitemsets ofsize i+1, k>i1,and frequent loseditemsets ofsize i are omputedas follows. For allfrequenti-itemsetsin L
i
,themarkeris losedisinitializedto true(step 10). AsetL i+1
ofpossiblefrequent(i+1)-itemsetsis reatedbyapplyingtheApriori-Genfun tiontothesetL
i
(step11). Forea hofthese possiblefrequent (i+1)-itemsets,we he kthatallitssubsetsofsizeiexistinL
i
(steps12to16). Onepassisperformedto omputethesupportsoftheremainingitemsetsinL
i+1
(step17). Then,forea h(i+1)-itemsetsl2L i+1 (steps18to 25),ifl isinfrequentthenitisdis ardedfromL
1+1
(step 19). Otherwiseforalli-subsetsl 0 ofl,weverifythatsupportsofl
0
andl areequal;ifso,thenl 0
annotbea loseditemsetanditsmarker is losedis setto false(steps 20to 24). Then, allfrequenti-itemsetsinL
i
forwhi hmarkeris losedis trueare insertedin theset FC
i
of frequent losed i-itemsets(step 26) andthe variable k isset to the valueofi(step 27). Finally,thesetFC
k
isinitializedwiththefrequentk-itemsetsinL k
(step29). Apriori-Gen fun tion The Apriori-Gen fun tion [2℄ applies to a set L
i
of frequent i-itemsets. It returnsasetL
i+1
ofpotentialfrequent(i+1)-itemsets. Anewitemsetin L i+1
is reatedbyjoining two itemsetsin L
i
sharing ommonrsti-1items.
Support-Count fun tion TheSupport-Count fun tion takesaset L i
of i-itemsetsas argument. It eÆ iently omputesthesupportsofallitemsetsl2L
i
. Onlyonedatasetpassisrequired: forea hobje t o read, thesupports of allitemsets l 2 L
i
Corre tness Sin e the support of a frequent losed itemset l is dierent from the support of all its supersets(Proposition1), the omputationof sets FC
i
fori<k is orre t. Hen e,afrequenti-itemset l
0 2L
i
is determined losedor notby omparing its support with thesupports of all frequent(i+ 1)-itemsets l2L
i+1
for whi h l 0
l. Lemma2ensuresthe orre tnessofthe omputation oftheset FC k ontainingthelargestfrequent losed itemsets.
Example7 Figure 2illustrates theexe ution ofthe Apriori-Closealgorithm with the ontext Dfor a minimumsupportof2/5. S anD ! L1 Itemset Supp fAg 3/5 fBg 4/5 fCg 4/5 fDg 1/5 fEg 4/5 Pruning infrequent ! L 1 Itemset Supp fAg 3/5 fBg 4/5 fCg 4/5 fEg 4/5 Determining losed ! FC 0 Itemset Supp f?g 5/5 S anD ! L 2 Itemset Supp fABg 2/5 fACg 3/5 fAEg 2/5 fBCg 3/5 fBEg 4/5 fCEg 3/5 Pruning infrequent ! L 2 Itemset Supp fABg 2/5 fACg 3/5 fAEg 2/5 fBCg 3/5 fBEg 4/5 fCEg 3/5 Determining losed ! FC1 Itemset Supp fCg 4/5 S anD ! L3 Itemset Supp fABCg 2/5 fABEg 2/5 fACEg 2/5 fBCEg 3/5 Pruning infrequent ! L3 Itemset Supp fABCg 2/5 fABEg 2/5 fACEg 2/5 fBCEg 3/5 Determining losed ! FC2 Itemset Supp fACg 3/5 fBEg 4/5 S anD ! L 4 Itemset Supp fABCEg 2/5 Pruning infrequent ! L 4 Itemset Supp fABCEg 2/5 Determining losed ! FC 3 Itemset Supp fBCEg 3/5 L 4 Itemset Supp fABCEg 2/5 Closed k-itemsets ! FC 4 Itemset Supp fABCEg 2/5
Figure2: Dis overingFrequentandFrequentClosedItemsetswithApriori-Close.
5 Generating Bases for Asso iation Rules
In Se tion 5.1, wepresentan algorithm to generate the Duquenne-Guigues basis for exa tasso iation rules. InSe tions5.2and5.3aredes ribedalgorithmsa hievingtheproperbasisandthestru turalbasis
Thepseudo- odegeneratingtheDuquenne-Guiguesbasisforexa tasso iationrulesisgiveninAlgorithm 3. Notationsare given in Table9. The algorithm takes as input thesets L
i
, 1ik, ontainingthe frequentitemsets andtheirsupport, andthesets FC
i
;0ik, ontainingthefrequent losed itemsets andtheir support. It rst omputesthefrequentpseudo- losed itemsetsiteratively(steps 2to 17)and thenusesthemtogeneratetheDuquenne-Guiguesbasisforexa tasso iationrulesDG(steps18to22).
L i
Setoffrequenti-itemsetsandtheirsupport. FC
i
Setoffrequent losed i-itemsetsandtheirsupport. FP
i
Setoffrequentpseudo- losedi-itemsets,their losureandtheirsupport. DG Duquenne-Guiguesbasisforexa tasso iationrules.
Table9: Notations.
First,thesetDGisinitializedtotheemptyset(step1). Iftheemptyitemsetisnota loseditemset (itisthen ne essarilyapseudo- loseditemset), it isinsertedin FP
0
(step 2). OtherwiseFP 0
isempty (step 3). Then, the algorithm re ursivelydetermineswhi h i-itemsetsin L
i
arepseudo- losed from L 1 to L
k
(steps 4 to 16). At ea h iteration, the set FP i
is initialized with the list of frequent i-itemsets that are not losed (step 5) and ea h frequent i-itemsets l in FP
i
is onsidered as follows (steps 6 to 15). The variable pseudo is set to true (step 7). We verify for ea h frequent pseudo- losed itemset p previouslydis overed(i.e. in FP
j
withj <i) ifpis ontainedin l (steps 8to 13). Inthat ase andif the losureofpisnotin ludedin l,thenlisnotpseudo- losedandisremovedfromFP
i
(steps9to12). Otherwise, the losure of l (i.e. the smallestfrequent losed itemset ontaining l) is determined (step 14). On eallfrequentpseudo- loseditemsets pandtheir losureare omputed, allruleswith theform r:p)(p. losure p) aregenerated (steps17 to 21). Thealgorithmresultsin the setDG ontaining allrulesin theDuquenne-Guigues basisforexa tasso iationrules.
Algorithm3GeneratingDuquenne-GuiguesBasisforExa tAsso iationRules. 1) DG fg;
2) if (FC0=fg)thenFP0 f?g; 3) elseFP0 fg;
4) for(i 1;ik;i++)dobegin 5) FP i L i nFC i ; 6) forallitemsetsl2FP i do begin 7) pseudo true; 8) forallitemsetsp2FP j withj<idobegin 9) if (pl )and(p. losure6l )thendobegin 10) pseudo fal se;
11) FPi FPinfl g;
12) end
13) end
14) if (pseudo=true)thenl . losure Min(f 2FCj>ijl g); 15) end
16) end
17) forallsetsFPiwhereFPi6=fgdobegin 18) forallpseudo- loseditemsetsp2FP i
dobegin 19) DG DG[fr:p)(p. losure p),p.supportg; 20) end
21) end
Corre tness Sin e theitemset ?hasno subset,if itis nota losed itemset thenit isby denition a pseudo- loseditemsetandthe omputationofthesetFP
0
is orre t. The orre tnessofthe omputation offrequentpseudo- losedi-itemsetsin FP for1ikrelies onDenition 7. Allfrequenti-itemsetsl
i i
pseudo- loseditemsetsthataresubsetsoflareinsertedinFP i
. A ordingtoDenition7,thesei-itemsets are allfrequent pseudo- losedi-itemsets andthe sets FP
i
are orre t. Theasso iation rulesgenerated in thelast phaseofthealgorithm areallruleswith afrequentpseudo- loseditemset in theante edent. Then,theresultingsetDG orrespondstotherulesintheDuquenne-Guiguesbasisforexa tasso iation rulesdened inTheorem2.
Example8 Figure 3shows the generation of the Duquenne-Guigues basisfor exa tasso iation rules from the ontext Dforaminimumsupportof2/5.
L 1 Itemset Supp fAg 3/5 fBg 4/5 fCg 4/5 fEg 4/5 FC1 Itemset Supp fCg 4/5 1 ! FP 1
Itemset Closure Supp fAg fACg 3/5 fBg fBEg 4/5 fEg fBEg 4/5 L 3 Itemset Supp fABCg 2/5 fABEg 2/5 fACEg 2/5 fBCEg 3/5 FC3 Itemset Supp fBCEg 3/5 3 ! FP 3 ? L2 Itemset Supp fABg 2/5 fACg 3/5 fAEg 2/5 fBCg 3/5 fBEg 4/5 fCEg 3/5 FC 2 Itemset Supp fACg 3/5 fBEg 4/5 2 ! FP2 ? L4 Itemset Supp fABCEg 2/5 FC4 Itemset Supp fABCEg 2/5 4 ! FP4 ? FP= S FPi Itemset Closure Supp
fAg fACg 3/5 fBg fBEg 4/5 fEg fBEg 4/5 5 ! DG Rule A)C B)E E)B
Figure3: GeneratingDuquenne-GuiguesBasis forExa tAsso iationRules.
5.2 Generating Proper Basis for Approximate Asso iation Rules
Thepseudo- odegeneratingtheproperbasisforapproximateasso iationrulesispresentedinAlgorithm 4. Notationsaregivenin Table10. Thealgorithm takesas inputthesetsFC
i
, 1ik, ontainingthe frequent losed non-emptyitemsets andtheir support. Theoutput ofthealgorithm isthe properbasis forapproximateasso iationrulesPB.
ThesetPB isrstinitializedto theemptyset(step1). Then,thealgorithmiteratively onsidersall frequent loseditemsets l2FC
i
for2ik. Itdetermineswhi hfrequent loseditemsets l 0
2FC j<i are subsetsofl andgeneratesasso iationruleswith theform l
0
!l l 0
that havesuÆ ient onden e (steps 2to 12)as follows. Duringthe i
th
iteration, ea h itemset l in FC i
is onsidered (steps 3to 11). Forea hsetFC
j
,1j<i,asetS j
ontainingallfrequent losedj-itemsetsinFC j
thataresubsetsofl is reated(step 5). Then, forea h ofthese subsetsl
0
i S
j
Setofj-itemsetsthat aresubsetsofthe onsidereditemset. PB Properbasisforapproximateasso iationrules.
Table10: Notations.
theproperapproximateasso iationruler:l 0
!l l 0
(step 7). Ifthe onden eofr issuÆ ientthenr isinsertedinPB (step8). Attheendofthealgorithm,thesetPB ontainsallrulesoftheproperbasis forapproximateasso iationrules.
Algorithm4GeneratingProperBasisforApproximateAsso iationRules. 1) PB fg
2) for(i 2;ik;i++)dobegin 3) forallitemsetsl2FC i dobegin 4) for(j i 1;j>0;j--)dobegin 5) S j Subsets(FC j ;l ); 6) forallitemsetsl 0 2S j dobegin 7) onf(r) l .support/l 0 .support;
8) if ( onf(r)min onf)thenPB PB[fr:l 0 !l l 0 ;l .support, onf(r)g; 9) end 10) end 11) end 12) end
Subset fun tion The subset fun tion takesa set X of itemsets and an itemset y as arguments. It determinesallitemsetsx2X thataresubsetsofy. Inalgorithmimplementation,frequentandfrequent loseditemsetsarestoredinaprex-treestru ture[24℄inordertoimproveeÆ ien yofthesubsetsear h. Corre tness The orre tnessofthealgorithmreliesonthefa tthatweinspe tallproperapproximate asso iationrulesholdinginthedataset. Forea hfrequent loseditemset,thealgorithm omputes,among itssubsets,allotherfrequent loseditemsets. Then,thegenerationofallrulesbetweentwofrequent losed itemsetshavingsuÆ ient onden eisensured. Theserulesareallproperapproximateasso iationrules holding in the dataset, and the resulting set PB is the proper basis forapproximate asso iation rules denedin Theorem3.
Example9 Figure4showsthegeneration oftheproperbasisforapproximate asso iationrulesin the ontextDforaminimumsupportof2/5andaminimum onden eof1/2.
5.3 Generating Stru tural Basis for Approximate Asso iation Rules
Thepseudo- odegeneratingthestru turalbasisforapproximateasso iationrulesisgiveninAlgorithm 5. Notationsare givenin Table11. Thealgorithm takesas inputthe sets FC
i
, 1i k, of frequent losednon-emptyitemsetsandtheirsupport. Itgeneratesthestru turalbasisforapproximateasso iation rulesSB representedby themaximal onden e spanning forestF
FC asso iated withFC= S i=k i=1 FC i (withouttheemptyitemset).
FC i
Setoffrequent losedi-itemsetsandtheirsupport. S
j
Setofj-itemsetsthat aresubsetsoftheitemset onsidered. CR Setof andidateapproximateasso iationrules.
FC 2 Itemset Supp fACg 3/5 fBEg 4/5 FC1 Itemset Supp fCg 4/5 1 ! PB
Rule Supp Conf C!A 3/5 3/4 3 Itemset Supp fBCEg 3/5 FC 2 [FC 1 Itemset Supp fACg 3/5 fBEg 4/5 fCg 4/5 2 ! PB
Rule Supp Conf C!A 3/5 3/4 C!BE 3/5 3/4 BE!C 3/5 3/4 FC 4 Itemset Supp fABCEg 2/5 FC 3 [FC 2 [FC 1 Itemset Supp fBCEg 3/5 fACg 3/5 fBEg 4/5 fCg 4/5 3 ! PB
Rule Supp Conf C!A 3/5 3/4 C!BE 3/5 3/4 BE!C 3/5 3/4 C!ABE 2/5 2/4 AC!BE 2/5 2/3 BE!AC 2/5 2/4 BCE!A 2/5 2/3
Figure4: GeneratingProperBasisforApproximate Asso iationRules.
Theset SB isrstinitializedto theemptyset(step 1). Then,thealgorithmiteratively onsidersall frequent loseditemsets l2FC
i
for2ik. Itdetermineswhi hfrequent loseditemsets l 0
2FC j<i are overedbyl,i.e. aredire tprede essorsofl,andthengeneratesthemaximal onden easso iation ruleswiththeforml!l
0
l thathold(steps2to25). Duringthei th
iteration,ea hitemsetlin FC i
is onsidered(steps3to24)asfollows. ThesetCRof andidateasso iationruleswithlin the onsequent is initializedto the empty set (step 4). For 1j <i, sets S
j
ontainingallfrequent losedj-itemsets in FC
j
that are subsets of l are reated (steps 5 to 7). Then, all these subsets of l are onsidered in de reasingorder of their sizes (steps 8 to 18). For ea h of these subsets l
0 2S
j
, the onden e of the properapproximateasso iationruler:l
0 !l l
0
is omputed(step10). Ifthe onden eofrissuÆ ient, risinsertedinCR(step12)andallsubsetsl
00 ofl
0
areremovedfromS n<j
(steps13to15). Thisbe ause rules with the form l
00 ! l l 00 with l 00 2 S n<j
are transitiveproperapproximate rules. Finally, the andidateproperapproximateruleswithlinthe onsequentthatareinCRarepruned(steps19to23): the maximum onden e value max onf of rulesin CR isdetermined (step 20) and therst rulewith su h a onden e is insertedin SB (steps 21 and 22). At the end of the algorithm, the set SB thus ontainsallrulesin thestru turalbasisforapproximateasso iationrules.
Corre tness The algorithm onsiders allasso iationrules l 0
! l l 0
with onden e min onf be-tweentwofrequent losed itemsets l and l
0
where l oversl 0
. These rules are allpropernon-transitive approximateasso iationrulesthatholdand anberepresentedbytheedgesofthegraphG
FC
(Denition 8) without transitiveedges. Moreover, among all rules with the form X! l X (generated from l), wekeeponlythe rst one with onden e equalto the maximal onden e of rules X! l X. Only preservingthisruleisequivalenttothe y leremovinginthegraphG
FC
inthesamemannerasexplained inDenition9. Then,theresultingsetSB anberepresentedasthemaximal onden espanningforest F
FC
withoutedgesfromtheemptyitemset. SB ontainsallrulesinthestru turalbasisforapproximate asso iationrulesdened inTheorem4.
Example10 Figure5depi tsthegenerationofthestru turalbasisforapproximateasso iationrulesin the ontextDforaminimumsupportof 2/5andaminimum onden eof 1/2.
1) SB fg;
2) for(i 2;ik;i++)dobegin 3) forallitemsetsl2FCi dobegin 4) CR fg; 5) for(j i 1;j>0;j--)dobegin 6) Sj Subsets(FCj;l ); 7) end 8) for(j i 1;j>0;j--)dobegin 9) forallitemsetsl 0 2S j dobegin 10) onf(r) l .support/l 0 .support; 11) if ( onf(r)min onf)thendo begin 12) CR CR[fr:l 0 !l l 0 ;l .support, onf(r)g; 13) for(n j 1;n>0;n--)dobegin 14) S n S n Subsets(S n ;l 0 ); 15) end 16) end 17) end 18) end 19) if (CR6=fg)then dobegin 20) max onf Maxr2CR( onf(r)); 21) nd rstfr2CRj onf(r)=max onfg; 22) SB SB[frg; 23) end 24) end 25) end FC 2 Itemset Supp fACg 3/5 fBEg 4/5 1 ! SB Rule Conf C!A 3/4 FC 3 Itemset Supp fBCEg 3/5 2 ! SB Rule Conf C!A 3/4 BE!C 3/4 FC 4 Itemset Supp fABCEg 2/5 3 ! SB Rule Conf C!A 3/4 BE!C 3/4 AC!BE 2/3
Figure5: GeneratingStru turalBasisforApproximateAsso iationRules.
6 Experimental Results
Experiments were performed on a Pentium II PC with a 350 Mhz lo k rate, 128 MBytes of RAM, running the Linux operating system. Algorithms were implemented in C++. Chara teristi s of the datasetsusedaregivenin Table12. ThesedatasetsaretheT10I4D100K
4
syntheti datasetthatmimi s marketbasketdata,theC20D10KandtheC73D10K ensusdatasetsfromthePUMSsamplele
5 ,and the Mushrooms
6
dataset des ribing mushroom hara teristi s. In all experiments, we attempted to hoosesigni antminimumsupportand onden ethresholdvalues: weobservedthresholdvaluesused inotherpapersforexperimentsonsimilardatatypesandinspe tedrulesextra tedinthebases.
4
http://www.almaden.ibm. om/ s/quest/syndata.html 5
ftp://ftp2. .ukans.edu/pub/ippbr/ ensus/pums/pums90ks.zip 6
T10I4D100K 100,000 10 1, 000 Mushrooms 8, 416 23 127
C20D10K 10, 000 20 386
C73D10K 10, 000 73 2, 177 Table12: Datasets.
6.1 Relative Performan e of Apriori and Apriori-Close
We ondu ted experimentsto ompareresponse timesobtained with Apriori andApriori-Close on the four datasets. Results for the T10I4D100Kand Mushrooms datasets are presented in Table 13. We anobservethatexe utiontimesareidenti alforthetwoalgorithms: addingthefrequent loseditemset derivationtothefrequentitemsetdis overydoesnotindu eadditional omputationtime. Similarresults wereobtainedforC20D10KandC73D10Kdatasets.
Minsupp Apriori Apriori-Close 2.0% 1. 99s 1. 97s 1.0% 3. 47s 3. 46s 0.5% 9. 62s 9. 70s 0.25% 15. 02s 14. 92s
Minsupp Apriori Apriori-Close 90% 0. 28s 0. 28s 70% 0. 73s 0. 73s 50% 2. 40s 2. 70s 30% 18. 22s 17. 93s
T10I4D100K Mushrooms
Table13: Exe utionTimesofAprioriandApriori-Close.
6.2 Number of Rules and Exe ution Times of the Rule Generation
Table14showsthetotalnumberofexa tasso iationrules andtheirnumberin theDuquenne-Guigues basisforexa trules. Table15showsthetotalnumberofapproximateasso iationrules,theirnumberin theproperbasisandinthestru turalbasisforapproximaterules,andthenumberofnon-transitiverules in theproperbasisforapproximaterules(5
th
olumn). Forexamplein the ontextD, rulesC!A and AC!BEareextra ted,aswellastheruleC!ABEwhi his learlytransitive. Sin eby onstru tion, its onden e{retrievedbymultiplyingthe onden esofthetwoformer{islessthantheirs,thisruleis thelessinterestingamongthethree. Redu ingtheextra tiontonon-transitiverulesintheproperbasis forapproximaterules analsobeinteresting. Su hrulesaregeneratedbyavariantofAlgorithm5with thelastpruningstrategy(steps20and21)removed: all andidaterulesinCRareinsertedinSB.
Table16 showsforthe fourdatasets theaverage relativesize ofbases ompared withthe setsof all rulesobtained. Inthe aseofweakly orrelateddata(T10I4D100K), noexa truleisgenerated andthe properbasisforapproximate rules ontainsallapproximate rulesthathold. Thereasonisthat, in su h data, all frequent itemsets are frequent losed itemsets. In the aseof orrelated data (Mushrooms, C20D10KandC73D10K),thenumberofextra tedrulesinbasesismu hsmallerthanthetotalnumber ofrulesthat hold.
Figure6showsforea hdatasettheexe utiontimesofthe omputationofallrules(usingthealgorithm des ribed in [2℄)andbases. Exe utiontimesofthe derivationof theDuquenne-Guigues basisforexa t rulesandtheproperbasisfornon-transitiveapproximaterulesarenotpresentedsin etheyareidenti al to those of the derivation of the Duquenne-Guigues basis for exa t rules and the stru tural basis for approximate rules(Duquenne-Guigues andstru turalbases).
7 Con lusion
T10I4D100K 0.5% 0 0 Mushrooms 30% 7, 476 69 C20D10K 50% 2, 277 11 C73D10K 90% 52, 035 15
Table14: NumberofExa tAsso iationRulesExtra ted.
Dataset Min onf Approximate Proper Non-transitive Stru tural (Minsupp) rules basis basis basis
90% 16, 260 16, 260 3, 511 916 T10I4D100K 70% 20, 419 20, 419 4, 004 1, 058 (0.5%) 50% 21, 686 21, 686 4, 191 1, 140 30% 22, 952 22, 952 4, 519 1, 367 90% 12, 911 806 563 313 Mushrooms 70% 37, 671 2, 454 968 384 (30%) 50% 56, 703 3, 870 1, 169 410 30% 71, 412 5, 727 1, 260 424 90% 36, 012 4, 008 1, 379 443 C20D10K 70% 89, 601 10, 005 1, 948 455 (50%) 50% 116, 791 13, 179 1, 948 455 30% 116, 791 13, 179 1, 948 455 95% 1,606,726 23, 084 4, 052 939 C73D10K 90% 2,053,896 32, 644 4, 089 941 (90%) 85% 2,053,936 32, 646 4, 089 941 80% 2,053,936 32, 646 4, 089 941 Table15: NumberofApproximateAsso iationRulesExtra ted. Dataset Duquenne-Guigues Proper Non-transitive Stru tural
basis basis basis basis T10I4D100K - 100. 00% 20. 05% 5.49% Mushrooms 0.92% 6.90% 2. 69% 1.19% C20D10K 0.48% 11. 21% 2. 33% 0.63% C73D10K 0.03% 1.55% 0. 21% 0.05%
Table16: AverageRelativeSizeofBases.
information. Moreover,itssizeissigni antlyredu ed omparedwiththesetofallpossiblerulesbe ause redundant,and thus useless,rules aredis arded. Ourapproa hhasatwofoldadvantage: on onehand, the useris provided with a smaller set of resulting rules, easier to handle, and vehi uling information ofimprovedquality. Onthe otherhand,exe utiontimes areredu ed ompared withthe dis overingof all asso iation rules. Su h results are proved (in the groundwork of latti e theory) and illustrated by experiments,a hievedfromreal-lifedatasets.
Integrating redu tion methods Templates,as denedin [3,16℄, an dire tlybeusedforextra ting from the bases all asso iation rules mat hing someuser spe ied patterns. Information in taxonomies asso iated withthe dataset an also be integrated in thepro ess as proposed in [14, 28℄for extra ting basesforgeneralized(multi-level)asso iationrules. Integratingitem onstraintsandstatisti almeasures, su hasdes ribedin [5,22,29℄and[7,25℄respe tively,inthegenerationofbasesrequiresfurtherwork. Fun tional and approximate dependen ies Algorithms presented in this paper an be adapted to generatebases for fun tional and approximate dependen ies. In[15, 20℄, su h bases and algorithms forgeneratingthem were proposed. However,theDuquenne-Guigues basisissmallerthanthebasisfor fun tionaldependen ies onstitutedofminimal non-trivialfun tionaldependen ies. Hen e,thenumber
0
0.2
0.4
0.6
0.8
1
1.2
1.4
30
40
50
60
70
80
90
Time (s)
Minimum confidence (%)
All rules
Duquenne-Guigues and proper bases
Duquenne-Guigues and structural bases
T10I4D100K
0
0.5
1
1.5
2
2.5
30
40
50
60
70
80
90
Time (s)
Minimum confidence (%)
All rules
Duquenne-Guigues and proper bases
Duquenne-Guigues and structural bases
Mushrooms
0
0.5
1
1.5
2
2.5
3
3.5
30
40
50
60
70
80
90
Time (s)
Minimum confidence (%)
All rules
Duquenne-Guigues and proper bases
Duquenne-Guigues and structural bases
C20D10K
0
10
20
30
40
50
60
80
85
90
95
Time (s)
Minimum confidence (%)
All rules
Duquenne-Guigues and proper bases
Duquenne-Guigues and structural bases
C73D10K Figure6: Exe utionTimesoftheAsso iationRuleDerivation.
maximal onsequent [10, 13℄. Furthermore, the proper and stru tural bases for approximate rules are alsosmallerthanthebasisforapproximatedependen iesdenedin[15℄. Adaptingouralgorithmstothe dis overyoffun tionalandapproximatedependen iesisanongoingresear h.
A knowledgements
The authors would like to gratefully a knowledge Rosine Ci hetti and Mohand-Said Ha id for their onstru tive omments.
Referen es
[1℄ R. Agrawal, T. Imielinski, and A. Swami. Miningasso iationrules between sets of items in large databases. Pro . of the ACMSIGMODConferen e,pages207{216,May1993.
[2℄ R. AgrawalandR. Srikant. Fast algorithms forminingasso iationrules. Pro . of the 20th VLDB Conferen e,pages478{499,June1994. ExpandedversioninIBMResear hReportRJ9839. [3℄ E. Baralisand G. Psaila. Designingtemplates for miningasso iationrules. Journal of Intelligent
Information Systems,9(1):7{32,July1997.
[4℄ R.J.Bayardo. EÆ ientlymininglongpatternsfrom databases. Pro . of the ACM SIGMOD Con-feren e,pages85{93,June1998.
1967. Third edition.
[7℄ S.Brin, R.Motwani, andC.Silverstein. Beyondmarketbaskets: Generalizingasso iationrules to orrelation. Pro . of the ACMSIGMODConferen e,pages265{276,May1997.
[8℄ S. Brin,R. Motwani, J.D. Ullman, and S. Tsur. Dynami itemset ountingand impli ationrules formarketbasket data. Pro . ofthe ACMSIGMOD Conferen e,pages255{264,May1997. [9℄ P.Burmeister. Formal on eptanalysiswithConImp: Introdu tiontothebasi features. Te hni al
report,Te hnis heHo hs huleDarmstadt,Germany,1998.
[10℄ J.Demetrovi s,L. Libkin, and I.B.Mu hnik. Fun tionaldependen ies in relationaldatabases: A latti epointofview. Dis reteAppliedMathemati s,40:155{185,1992.
[11℄ V.DuquenneandJ.-L.Guigues.Familleminimaled'impli ationinformativesresultantd'untableau dedonneesbinaires. Mathematiques etS ien esHumaines,24(95):5{18,1986.
[12℄ B.Ganter and K. Reuter. Finding all losed sets: A generalapproa h. In Order, pages 283{290. KluwerA ademi Publishers,1991.
[13℄ B.Ganter andR.Wille. Formal Con ept Analysis: Mathemati al Foundations. Springer,1998. [14℄ J.Han andY. Fu. Dis overyof multiple-levelasso iation rulesfrom large databases. Pro . of the
21stVLDB Conferen e,pages420{431,September1995.
[15℄ Y.Huhtala,J.Karkkanen,P.Porkka,andH.Toivonen.EÆ ientdis overyoffun tionaland approx-imatedependen ies using partitions. Pro . of the 14th ICDE Conferen e, pages 392{401,Febuary 1998.
[16℄ M.Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen,and A. I.Verkamo. Finding interesting rules from large sets of dis overed asso iation rules. Pro . of the 3rd CIKM Conferen e, pages 401{407,November1994.
[17℄ D. Linand Z. M. Kedem. Pin er-sear h: A newalgorithm for dis overingthe maximum frequent set. Pro . of the6th EDBT Conferen e,pages105{119,Mar h1998.
[18℄ B.Liu,W. Hsu,andS. Chen. Using generalimpressionsto analyse dis overed lassi ationrules. Pro . ofthe 3rdKDD Conferen e,pages31{36,August1997.
[19℄ M.Luxenburger.Impli ationspartiellesdansun ontexte.Mathematiques,InformatiqueetS ien es Humaines,29(113):35{55,1991.
[20℄ H.MannilaandK.J.Raiha. Algorithmsforinferringfun tionaldependen iesfromrelations. Data &Knowledge Engineering,12(1):83{99,Febuary1994.
[21℄ R.Meo,G.Psaila,andS.Ceri. A newSQL-likeoperatorforminingasso iationrules. Pro . of the 22ndVLDB Conferen e,pages122{133,September1996.
[22℄ R.T.Ng,V. S.Lakshmanan,J.Han,andA.Pang. Exploratoryminingandpruningoptimizations of onstrainedasso iationrules. Pro .of the ACMSIGMODConferen e,pages13{24,June1998. [23℄ N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Dis overingfrequent loseditemsetsforasso iation
rules. Pro .of the 7thICDT Conferen e,pages398{416,January1999.
[24℄ N.Pasquier,Y. Bastide,R.Taouil,andL. Lakhal. EÆ ientminingofasso iationrulesusing losed itemsetlatti es. Information Systems,24(1):25{46,1999.
largesdatabases. Pro . ofthe 21st VLDB Conferen e,pages432{444,September1995.
[27℄ A. Silbers hatzandA. Tuzhilin. Whatmakespatterns interestinginknowledgedis overysystems. IEEETransa tionson Knowledge andData Engineering,8(6):970{974,De ember1996.
[28℄ R.SrikantandR.Agrawal.Mininggeneralizedasso iationrules.Pro .ofthe21stVLDBConferen e, pages407{419,September1995.
[29℄ R.Srikant,Q.Vu,andR.Agrawal. Miningasso iationruleswithitem onstraints. Pro . ofthe 3rd KDDConferen e,pages67{73,August1997.
[30℄ H. Toivonen. Samplinglargedatabasesfor asso iationrules. Pro . ofthe 22ndVLDB Conferen e, pages134{145,September1996.
[31℄ R. Wille. Con ept latti es and on eptualknowledge systems. Computers and Mathemati s with Appli ations,23:493{515,1992.
[32℄ M.J. ZakiandM. Ogihara. Theoreti al foundationsof asso iationrules. 3rdSIGMOD Workshop onResear h IssuesinData Mining andKnowledge Dis overy,June1998.
[33℄ M.J.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.Newalgorithmsforfastdis overyofasso iation rules. Pro . ofthe 3rdKDD Conferen e,pages283{286,August1997.