HAL Id: hal-00467752
https://hal.archives-ouvertes.fr/hal-00467752
Submitted on 26 Apr 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Mining association rules using formal concept analysis
Nicolas Pasquier
To cite this version:
Nicolas Pasquier. Mining association rules using formal concept analysis. ICCS’2000 International
Conference on Conceptual Structures, Aug 2000, Darmstadt, Germany. pp.259-264. �hal-00467752�
Analysis
Ni olasPasquier
LIMOS,UniversiteBlaisePas al-Clermont-FerrandII, 24AvenuedesLandais,F{63177Aubiere,Fran e
pasquierlibd2.univ-bp lermont.fr
Abstra t. Inthispaper,wegiveanoverviewoftheuseofFormal Con- eptAnalysisintheframeworkofasso iation ruleextra tion.Using fre-quent losed itemsets and their generators, that are dened using the Galois losureoperator,weaddresstwomajorproblems:responsetimes of asso iation rule extra tion and the relevan e and usefulness of dis- overedasso iation rules. Wequi klyreviewthe CloseandtheA-Close algorithmsforextra tingfrequent loseditemsetsusingtheirgenerators thatredu eresponsetimesoftheextra tion,spe iallyinthe aseof or-relateddata.Wealsopresentdenitions ofthegeneri and informative basesforasso iation ruleswhi hgenerationimprovestherelevan eand usefulnessofdis overedasso iation rules.
1 Introdu tion
Data mininghasbeenextensivelyaddressedforthelast yearsasthe omputa-tionalpartofKnowledgeDis overyinDatabases(KDD),spe iallytheproblem ofdis overingasso iationrules.Itsaimistoexhibitrelationshipsbetween item-sets (sets of binary attributes) in large databases. An example of asso iation rules,ttingin the ontextofmarketbasketdataanalysis,is\ ereal^milk! sugar (support10%, onden e 60%)"stating that60%of ustomerswhobuy ereals and sugar also buy milk and that 10% of all ustomers buy all three items. When an asso iation rule has support and onden e ex eeding some user-dened minimumsupport andminimum onden e thresholds,therule is onsideredasrelevantforsupportingde isionmaking[AIS93℄.Asso iationrules havebeensu essfullyappliedinawiderangeofdomains,amongwhi h market-ingde isionsupport,diagnosisandmedi alresear hsupport,tele ommuni ation pro essimprovement,websitemanagementanda ess,theanalysisof multime-dia,spatial,geographi alandstatisti aldata,et .
Therst phaseof theasso iationruleextra tion onsistsin sele tinguseful data from the database and transforming it in a data mining ontext. This ontextis atripletD=(O;I;R), whereOand I arenite setsofobje tsand items respe tively,and ROI is abinaryrelation. Ea h ouple(o;i)2R denotesthe fa tthat theobje to2O isrelatedto theitemi2I.Twomajor problems for the asso iation rule extra tion give pla e to interesting resear h topi s: theproblem ofresponse timesof theextra tionand theproblem ofthe
Existingapproa hesforminingasso iationrulesarebasedonthefollowing de- omposition oftheproblem: theextra tionoffrequentitemsets
1
andtheir sup-ports from the ontext and then the generation of all valid asso iation rules
2 . Therstphaseisthemost omputationallyintensivepartofthepro ess,sin e the numberof potential frequentitemsets is exponential in the size of the set of items and several database passes are required. Two approa hes have been proposed:levelwise algorithms for extra tingfrequent itemsets and algorithms for extra tingmaximal frequent itemsets.These algorithms givea eptable re-sponse times when miningasso iationrules from weakly orrelated data, su h as market basket data, but their performan es drasti ally de rease when they are applied to orrelateddata, su h as statisti al or medi aldata forinstan e. We re allthese twoapproa hesand present thenour approa h whi h isbased onFormalCon eptAnalysis[GW99℄.
2.1 Levelwisealgorithms forextra ting frequentitemsets
These algorithms onsider during ea h iteration a set of itemsets of a given size,i.e., a set of itemsets of a\level" ofthe itemset latti e. These algorithms arebasedonthefollowingpropertiesin orderto limitthenumberof andidate itemsets onsidered:allthesupersetsofaninfrequentitemsetareinfrequentand allthesubsetsofafrequentitemsetarefrequent[AS94,MTV94℄.Usingthis prop-erty,the andidatek-itemsets
3
ofthek th
iterationaregeneratedbyjoining two frequent (k-1)-itemsets dis overedduring the pre eding iteration. The Apriori [AS94℄andOCD[MTV94℄algorithms arryoutanumberofs ansofthe ontext equalto the size of thelargest frequent itemsets. The Partition[SON95℄ algo-rithmallows theparallelizationof the pro ess of extra tionand the algorithm DIC[BMUT97℄redu esthenumberof ontexts ansby onsidering itemsetsof dierent sizesduring ea h iteration.ThePartition andDICalgorithms involve additional ostsinCPUtime omparedtotheAprioriandOCDalgorithmsdue tothein reaseinthenumberof andidateitemsets tested.
2.2 Algorithmsfor extra ting maximal frequent itemsets
These algorithms are based on the property that the maximal frequent item-sets, i.e., the frequentitemsets of whi h all the supersets are infrequent,form aborder underwhi h allitemsets arefrequent.The extra tionof themaximal frequent itemsets is arried outby an iterative browsingof the itemset latti e that\advan es"byonelevelfromthebottomupwardsandbyoneormore lev-elsfrom thetopdownwardsduring ea hiteration.Using themaximalfrequent
1
An itemset is frequent if its support is greater or equal to the minimal support threshold.
2
An asso iation rule isvalid ifits supportand onden eare at leastequal to the minimalsupportandtheminimal onden ethresholds.
byperformingone nal s an ofthe ontext.Fouralgorithmsbasedonthis ap-proa hwereproposed;theyarethePin er-Sear h[LK98℄,MaxCliqueand Max-E lat [ZPOL97℄, and Max-Miner [Bay98℄ algorithms. These algorithms redu e the number of iterations, and thus de rease the number of ontext s ans and thenumberofCPUoperations arriedout, omparedtolevelwisealgorithmsfor extra tingfrequentitemsets.
2.3 Algorithmsfor extra ting frequent losed itemsets
In ontrastto thetwoprevious approa hes, ourapproa h [PBTL99a℄ is based on Formal Con ept Analysis. The losure operator of the Galois onne tion [GW99℄isthe ompositionoftheappli ation,thatasso iateswithOOthe items ommontoallobje tso2O,andtheappli ation , thatasso iateswith anitemsetI I theobje tsrelatedtoallitemsi2I (theobje ts\ ontaining" I).The losureoperator =Æ asso iateswithanitemsetI themaximalset of items ommon to all theobje ts ontaining I, i.e., theinterse tion of these obje ts.Using this losure operator,thefrequent loseditemsetsaredened.
Denition1 (Frequent losed itemsets). A frequent itemset I I is a frequent loseditemseti (I)=I.
Thefrequent loseditemsets onstitute,togetherwiththeirsupports,a gen-eratingsetforallfrequentitemsetsandtheirsupportsandthusforallasso iation rules, their supports and their onden es [PBTL99a℄. This property relies on the properties that the support of a frequent itemset is equal to the support of its losure and that the maximal frequent itemsets are maximal frequent loseditemsets. TwoeÆ ientlevelwisealgorithms, alledClose[PBTL99a℄and A-Close[PBTL99b℄,forextra tingfrequent loseditemsetsfromlargedatabases were proposed. In order to improve the eÆ ien y of theextra tion, the Close andtheA-Closealgorithms onsiderthegeneratoritemsetsofthefrequent losed itemsets.
Denition2 (Generator itemsets). An itemset GI is agenerator of a loseditemsetI i (G)=I andG
0 I withG 0 Gsu hthat (G 0 )=I.
CloseandA-Closeperformabreadth-rstsear hforthe(frequent)generators ofthefrequent loseditemsetsinalevelwisemanner.Duringaniterationk,the Closealgorithm onsidersaset of andidategeneratorsofsize k, itdetermines theirsupportsandtheir losures,andthendeletesallinfrequentgenerators.The supports and the losuresof the andidate k-generatorsare omputed by per-formingonedatabasepassand,forea hgeneratorG,interse tingalltheobje ts ontainingG(theirnumbergivesthesupport ofG). Duringthe(k+1)
th iter-ation, the andidate(k+1)-generatorsare onstru tedbyjoining twofrequent k-generators iftheir k 1 rstitems are identi al, and the andidate (k+
1)-tieda ordingtotheirsupportsonly,sin ethesupportofageneratoritemset is dierentfrom the supports ofall itssubsets, andone moredatabasepassis performedattheendofthealgorithmfor omputingthe losuresofallfrequent generatorsdis overed.Bothalgorithmsinitializeatthebeginingthesetof andi-date1-generatorswiththelistofallitemsetsofsize1.Experimentalresultsshow thatthesealgorithmsareparti ularlyeÆ ientforminingasso iationrulesfrom denseor orrelateddatathatrepresentanimportantpartofreallifedatabases. On su h data, Close outperforms A-Close, and they both learly outperform algorithmsforextra tingfrequentitemsets,whereasforweakly orrelateddata, A-Close outperforms Close and is in therange of algorithms des ribed in se -tion2.1.
3 Relevan e of extra ted asso iation rules
Theproblemof theusefulnessand therelevan eof dis overedasso iationrules isrelatedtothehugenumberofrulesextra tedandthepresen eofmany redun-dan iesamong them formany datasets,espe ially for orrelated data. Several approa hesforsolvingthisproblemhavebeenproposed.Werstqui klyreview theseapproa hesandpresentthentheapproa hweproposethat onsistsin gen-eratingnon-redundantasso iationruleswithminimalante edentsandmaximal onsequentsusingFormalCon eptAnalysis.
3.1 Previouswork
The use of statisti measures other than onden e, su h as onvi tion, Pear-son's orrelation or
2
test, to ompute the pre ision of rules is proposed in [BMS97,SBM98℄. Generalized asso iation rules, that are rules between item-sets that belong to dierent levels of a taxonomy of the items, are dened in [HF95,SA95℄. In [He 96,ST96℄, deviation measures, i.e., measures of distan e betweenasso iationrules usedfor pruningsimilar ones,are dened using sup-portand onden e.Item onstraints[BAG99,NLHP98℄arebooleanexpressions thatallowtheusertospe ifytheform ofasso iationrulesthatwillbesele ted. In[BG99℄,A-maximal rules, thatare rulesforwhi hthe populationof obje ts on erned isredu ed whenan itemis added tothe ante edent,are dened.In [PBTL99 ℄, theDuquenne-Guigues basisfor global impli ations[DG86,GW99℄ andtheLuxenburgerbasisforpartialimpli ations[Lux91℄areadaptedtothe as-so iationrulesframework.Thesebasesareminimalwithrespe ttothenumber ofrulesextra ted,buttheyarenotmadeupofthemostinformativeasso iation rulesthatarenon-redundantruleswithminimalante edentsandmaximal on-sequents, alledminimalnon-redundantasso iation rules.Webelievethatthese rules arethe mostrelevantand usefulfrom the pointof viewof theuser, on-sideringthefa t thatinpra ti etheuser annotinferallothervalidrulesfrom therulesextra tedwhilevisualizingthem. Noneoftheapproa hesproposedin
Fromthe point of view of theuser, an asso iation rule is redundant if it on-veysthesameinformation{or lessgeneralinformation{thantheinformation onveyed byanother rule of the samerange (support) and the samepre ision ( onden e). In previouswork forredu ing redundant impli ationrules (fun -tionaldependan ies),thenotionofnon-redundan y onsideredisrelatedtothe inferen esystemusingArmstrongaxioms[Arm74℄.Thisnotionisnottobe on-fused with thenotion of non-redundan y we onsider here.Toourknowledge, su haninferen esystemforasso iationrules,i.e.,takingintoa ountsupports and onden es ofthe rules, doesnot exist.An asso iation rule r2E is non-redundant and minimal if there is noother asso iationrule r
0
2 E with same supportand onden eand,whi hante edentisasubsetoftheante edentofr andwhi h onsequentisasupersetofthe onsequentofr.
Denition3 (Minimalnon-redundantasso iationrules).Anasso iation rule r : I
1 ! I
2
is a minimal non-redundant asso iation rule i not exists an asso iation rule r 0 :I 0 1 !I 0 2
su h that support(r) =support(r 0 ), onden e(r) = onden e(r 0 ), I 0 1 I 1 andI 2 I 0 2 .
Giventhis hara terization,wedenethegeneri basisforexa tasso iation rules(100% onden erules)andtheinformativebasisforapproximate asso ia-tionrules.Thesebasesare onstitutedoftheminimalnon-redundantexa tand approximateasso iationrulesrespe tively.LetFC bethesetoffrequent losed itemsetsandletFGbethesetoftheir(minimal)generators.
Denition4 (Generi basis). The generi basis ontains all rules with the form r : G ! (F nG) between a generator itemset G 2 FG and its losure (G)2FC su hthatG6= (G).
Denition5 (Informative basis). The informative basis ontains all rules with the form r : G ! (F nG) between a generator itemset G 2 FG and a frequent loseditemsetF 2FC thatisasupersetof its losure: (G)F. The transitiveredu tion of this basis,i.e., for F
0
2FC su hthat (G)F 0
F, isalso abasis forallapproximateasso iation rules.
All valid asso iationrules, theirsupports and their onden es an be de-du edfrom theunionofthegeneri basisandtheinformativebasisor its tran-sitive redu tion. Results of experimentations ondu ted on real-life databases showthat theirgenerationis eÆ ientand usefulin pra ti e,parti ularly when miningasso iationrulesfrom orrelateddata.
Referen es
[AIS93℄ R. Agrawal, T. Imielinski,and A. Swami. Mining asso iation rules between setsofitemsinlargedatabases. Pro .SIGMOD onf.,207{216,May1993. [AS94℄ R. Agrawal and R. Srikant. Fast algorithms for mining asso iation rules in
IFIP ongress,pp580{583,August1974.
[Bay98℄ R.J.Bayardo. EÆ ientlymininglong patternsfrom databases. Pro . SIG-MOD onf.,85{93,June1998.
[BAG99℄ R.J.Bayardo,R.Agrawal,andD.Gunopulos. Constraint-basedrulemining inlarge,densedatabases. Pro .ICDE onf.,188{197,Mar h1999.
[BG99℄ R.J.Bayardo,andR.Agrawal.Miningthemostinterestingrules. Pro .KDD Conferen e,145{154,August1999.
[BMUT97℄ S.Brin,R.Motwani,J.D.Ullman,andS.Tsur.Dynami itemset ounting andimpli ation rulesformarketbasketdata. Pro .SIGMOD onf.,255{264,May 1997.
[BMS97℄ S.Brin,R.Motwani,andC.Silverstein.Beyondmarketbaskets:Generalizing asso iation rulesto orrelation. Pro .SIGMOD onf.,265{276,May1997.
[DG86℄ V.DuquenneandJ.-L.Guigues.Familleminimaled'impli ationsinformatives resultant d'untableaudedonneesbinaires. Mathematiqueset S ien esHumaines, 24(95):5{18, 1986.
[GW99℄ B.GanterandR.Wille.FormalCon eptAnalysis:Mathemati alfoundations. Springer,1999.
[HF95℄ J. Han and Y. Fu. Dis overy of multiple-level asso iation rules from large databases. Pro .VLDB onf.,420{431,September1995.
[He 96℄ D. He kerman. Bayesian networks for knowledge dis overy. Advan es in KnowledgeDis overyandDataMining,273{305,1996.
[LK98℄ D.LinandZ.M.Kedem. Pin er-Sear h:Anewalgorithmfordis overingthe maximumfrequentset. Pro .EBDT onf.,105{119,Mar h1998.
[Lux91℄ M. Luxenburger. Impli ations partielles dans un ontexte. Mathematiques, Informatique etS ien esHumaines,29(113):35{55,1991.
[MTV94℄ H. Mannila, H.Toivonen,and A. I.Verkamo. EÆ ient algorithmsfor dis- overingasso iation rules. AAAIKDD workshop,181{192,July1994.
[NLHP98℄ R. T.Ng, V. S.Lakshmanan,J. Han,and A. Pang. Exploratory mining and pruningoptimizations of onstrainedasso iation rules. Pro .SIGMOD onf., 13{24,June1998.
[PBTL99a℄ N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. EÆ ient mining of asso iation rules using losed itemset latti es. Information Systems, 24(1):25{46, 1999.
[PBTL99b℄ N.Pasquier, Y. Bastide,R.Taouil, andL. Lakhal. Dis overingfrequent loseditemsetsforasso iation rules. Pro .ICDT onf.,398{416,January1999. [PBTL99 ℄ N.Pasquier, Y. Bastide,R.Taouil,andL. Lakhal. Closedsetbased
dis- overy of small overs for asso iation rules. Pro . BDA onf., 361{381, O tober 1999.
[SON95℄ A.Savasere,E.Omie inski,andS.Navathe.AneÆ ientalgorithmformining asso iation rulesinlargedatabases. Pro .VLDB onf.,432-444, September1995. [ST96℄ A. Silbers hatzand A. Tuzhilin. Whatmakespatterns interesting in
knowl-edge dis overysystems. IEEE Transa tions on Knowledgeand DataEngineering, 8(6):970{974, De ember1996.
[SBM98℄ C.Silverstein,S.Brin,andR.Motwani. Beyondmarketbaskets: Generaliz-ing asso iation rulestodependen erules. DataMiningandKnowledgeDis overy, 2(1):39{68, January1998.
[SA95℄ R.SrikantandR.Agrawal. Mininggeneralizedasso iationrules. Pro .VLDB onf.,407{419,September1995.
[ZPOL97℄ M. J.Zaki,S.Parthasarathy,M. Ogihara,andW.Li. New algorithmsfor fastdis overyofasso iationrules. Pro .KDD onf.,283{286,August1997.