• Aucun résultat trouvé

Mining association rules using formal concept analysis

N/A
N/A
Protected

Academic year: 2021

Partager "Mining association rules using formal concept analysis"

Copied!
7
0
0

Texte intégral

(1)

HAL Id: hal-00467752

https://hal.archives-ouvertes.fr/hal-00467752

Submitted on 26 Apr 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Mining association rules using formal concept analysis

Nicolas Pasquier

To cite this version:

Nicolas Pasquier. Mining association rules using formal concept analysis. ICCS’2000 International

Conference on Conceptual Structures, Aug 2000, Darmstadt, Germany. pp.259-264. �hal-00467752�

(2)

Analysis

Ni olasPasquier

LIMOS,UniversiteBlaisePas al-Clermont-FerrandII, 24AvenuedesLandais,F{63177Aubiere,Fran e

pasquierlibd2.univ-bp lermont.fr

Abstra t. Inthispaper,wegiveanoverviewoftheuseofFormal Con- eptAnalysisintheframeworkofasso iation ruleextra tion.Using fre-quent losed itemsets and their generators, that are de ned using the Galois losureoperator,weaddresstwomajorproblems:responsetimes of asso iation rule extra tion and the relevan e and usefulness of dis- overedasso iation rules. Wequi klyreviewthe CloseandtheA-Close algorithmsforextra tingfrequent loseditemsetsusingtheirgenerators thatredu eresponsetimesoftheextra tion,spe iallyinthe aseof or-relateddata.Wealsopresentde nitions ofthegeneri and informative basesforasso iation ruleswhi hgenerationimprovestherelevan eand usefulnessofdis overedasso iation rules.

1 Introdu tion

Data mininghasbeenextensivelyaddressedforthelast yearsasthe omputa-tionalpartofKnowledgeDis overyinDatabases(KDD),spe iallytheproblem ofdis overingasso iationrules.Itsaimistoexhibitrelationshipsbetween item-sets (sets of binary attributes) in large databases. An example of asso iation rules, ttingin the ontextofmarketbasketdataanalysis,is\ ereal^milk! sugar (support10%, on den e 60%)"stating that60%of ustomerswhobuy ereals and sugar also buy milk and that 10% of all ustomers buy all three items. When an asso iation rule has support and on den e ex eeding some user-de ned minimumsupport andminimum on den e thresholds,therule is onsideredasrelevantforsupportingde isionmaking[AIS93℄.Asso iationrules havebeensu essfullyappliedinawiderangeofdomains,amongwhi h market-ingde isionsupport,diagnosisandmedi alresear hsupport,tele ommuni ation pro essimprovement,websitemanagementanda ess,theanalysisof multime-dia,spatial,geographi alandstatisti aldata,et .

The rst phaseof theasso iationruleextra tion onsistsin sele tinguseful data from the database and transforming it in a data mining ontext. This ontextis atripletD=(O;I;R), whereOand I are nite setsofobje tsand items respe tively,and ROI is abinaryrelation. Ea h ouple(o;i)2R denotesthe fa tthat theobje to2O isrelatedto theitemi2I.Twomajor problems for the asso iation rule extra tion give pla e to interesting resear h topi s: theproblem ofresponse timesof theextra tionand theproblem ofthe

(3)

Existingapproa hesforminingasso iationrulesarebasedonthefollowing de- omposition oftheproblem: theextra tionoffrequentitemsets

1

andtheir sup-ports from the ontext and then the generation of all valid asso iation rules

2 . The rstphaseisthemost omputationallyintensivepartofthepro ess,sin e the numberof potential frequentitemsets is exponential in the size of the set of items and several database passes are required. Two approa hes have been proposed:levelwise algorithms for extra tingfrequent itemsets and algorithms for extra tingmaximal frequent itemsets.These algorithms givea eptable re-sponse times when miningasso iationrules from weakly orrelated data, su h as market basket data, but their performan es drasti ally de rease when they are applied to orrelateddata, su h as statisti al or medi aldata forinstan e. We re allthese twoapproa hesand present thenour approa h whi h isbased onFormalCon eptAnalysis[GW99℄.

2.1 Levelwisealgorithms forextra ting frequentitemsets

These algorithms onsider during ea h iteration a set of itemsets of a given size,i.e., a set of itemsets of a\level" ofthe itemset latti e. These algorithms arebasedonthefollowingpropertiesin orderto limitthenumberof andidate itemsets onsidered:allthesupersetsofaninfrequentitemsetareinfrequentand allthesubsetsofafrequentitemsetarefrequent[AS94,MTV94℄.Usingthis prop-erty,the andidatek-itemsets

3

ofthek th

iterationaregeneratedbyjoining two frequent (k-1)-itemsets dis overedduring the pre eding iteration. The Apriori [AS94℄andOCD[MTV94℄algorithms arryoutanumberofs ansofthe ontext equalto the size of thelargest frequent itemsets. The Partition[SON95℄ algo-rithmallows theparallelizationof the pro ess of extra tionand the algorithm DIC[BMUT97℄redu esthenumberof ontexts ansby onsidering itemsetsof di erent sizesduring ea h iteration.ThePartition andDICalgorithms involve additional ostsinCPUtime omparedtotheAprioriandOCDalgorithmsdue tothein reaseinthenumberof andidateitemsets tested.

2.2 Algorithmsfor extra ting maximal frequent itemsets

These algorithms are based on the property that the maximal frequent item-sets, i.e., the frequentitemsets of whi h all the supersets are infrequent,form aborder underwhi h allitemsets arefrequent.The extra tionof themaximal frequent itemsets is arried outby an iterative browsingof the itemset latti e that\advan es"byonelevelfromthebottomupwardsandbyoneormore lev-elsfrom thetopdownwardsduring ea hiteration.Using themaximalfrequent

1

An itemset is frequent if its support is greater or equal to the minimal support threshold.

2

An asso iation rule isvalid ifits supportand on den eare at leastequal to the minimalsupportandtheminimal on den ethresholds.

(4)

byperformingone nal s an ofthe ontext.Fouralgorithmsbasedonthis ap-proa hwereproposed;theyarethePin er-Sear h[LK98℄,MaxCliqueand Max-E lat [ZPOL97℄, and Max-Miner [Bay98℄ algorithms. These algorithms redu e the number of iterations, and thus de rease the number of ontext s ans and thenumberofCPUoperations arriedout, omparedtolevelwisealgorithmsfor extra tingfrequentitemsets.

2.3 Algorithmsfor extra ting frequent losed itemsets

In ontrastto thetwoprevious approa hes, ourapproa h [PBTL99a℄ is based on Formal Con ept Analysis. The losure operator of the Galois onne tion [GW99℄isthe ompositionoftheappli ation,thatasso iateswithOOthe items ommontoallobje tso2O,andtheappli ation , thatasso iateswith anitemsetI I theobje tsrelatedtoallitemsi2I (theobje ts\ ontaining" I).The losureoperator =Æ asso iateswithanitemsetI themaximalset of items ommon to all theobje ts ontaining I, i.e., theinterse tion of these obje ts.Using this losure operator,thefrequent loseditemsetsarede ned.

De nition1 (Frequent losed itemsets). A frequent itemset I  I is a frequent loseditemseti (I)=I.

Thefrequent loseditemsets onstitute,togetherwiththeirsupports,a gen-eratingsetforallfrequentitemsetsandtheirsupportsandthusforallasso iation rules, their supports and their on den es [PBTL99a℄. This property relies on the properties that the support of a frequent itemset is equal to the support of its losure and that the maximal frequent itemsets are maximal frequent loseditemsets. TwoeÆ ientlevelwisealgorithms, alledClose[PBTL99a℄and A-Close[PBTL99b℄,forextra tingfrequent loseditemsetsfromlargedatabases were proposed. In order to improve the eÆ ien y of theextra tion, the Close andtheA-Closealgorithms onsiderthegeneratoritemsetsofthefrequent losed itemsets.

De nition2 (Generator itemsets). An itemset GI is agenerator of a loseditemsetI i (G)=I andG

0 I withG 0 Gsu hthat (G 0 )=I.

CloseandA-Closeperformabreadth- rstsear hforthe(frequent)generators ofthefrequent loseditemsetsinalevelwisemanner.Duringaniterationk,the Closealgorithm onsidersaset of andidategeneratorsofsize k, itdetermines theirsupportsandtheir losures,andthendeletesallinfrequentgenerators.The supports and the losuresof the andidate k-generatorsare omputed by per-formingonedatabasepassand,forea hgeneratorG,interse tingalltheobje ts ontainingG(theirnumbergivesthesupport ofG). Duringthe(k+1)

th iter-ation, the andidate(k+1)-generatorsare onstru tedbyjoining twofrequent k-generators iftheir k 1 rstitems are identi al, and the andidate (k+

(5)

1)-ti eda ordingtotheirsupportsonly,sin ethesupportofageneratoritemset is di erentfrom the supports ofall itssubsets, andone moredatabasepassis performedattheendofthealgorithmfor omputingthe losuresofallfrequent generatorsdis overed.Bothalgorithmsinitializeatthebeginingthesetof andi-date1-generatorswiththelistofallitemsetsofsize1.Experimentalresultsshow thatthesealgorithmsareparti ularlyeÆ ientforminingasso iationrulesfrom denseor orrelateddatathatrepresentanimportantpartofreallifedatabases. On su h data, Close outperforms A-Close, and they both learly outperform algorithmsforextra tingfrequentitemsets,whereasforweakly orrelateddata, A-Close outperforms Close and is in therange of algorithms des ribed in se -tion2.1.

3 Relevan e of extra ted asso iation rules

Theproblemof theusefulnessand therelevan eof dis overedasso iationrules isrelatedtothehugenumberofrulesextra tedandthepresen eofmany redun-dan iesamong them formany datasets,espe ially for orrelated data. Several approa hesforsolvingthisproblemhavebeenproposed.We rstqui klyreview theseapproa hesandpresentthentheapproa hweproposethat onsistsin gen-eratingnon-redundantasso iationruleswithminimalante edentsandmaximal onsequentsusingFormalCon eptAnalysis.

3.1 Previouswork

The use of statisti measures other than on den e, su h as onvi tion, Pear-son's orrelation or 

2

test, to ompute the pre ision of rules is proposed in [BMS97,SBM98℄. Generalized asso iation rules, that are rules between item-sets that belong to di erent levels of a taxonomy of the items, are de ned in [HF95,SA95℄. In [He 96,ST96℄, deviation measures, i.e., measures of distan e betweenasso iationrules usedfor pruningsimilar ones,are de ned using sup-portand on den e.Item onstraints[BAG99,NLHP98℄arebooleanexpressions thatallowtheusertospe ifytheform ofasso iationrulesthatwillbesele ted. In[BG99℄,A-maximal rules, thatare rulesforwhi hthe populationof obje ts on erned isredu ed whenan itemis added tothe ante edent,are de ned.In [PBTL99 ℄, theDuquenne-Guigues basisfor global impli ations[DG86,GW99℄ andtheLuxenburgerbasisforpartialimpli ations[Lux91℄areadaptedtothe as-so iationrulesframework.Thesebasesareminimalwithrespe ttothenumber ofrulesextra ted,buttheyarenotmadeupofthemostinformativeasso iation rulesthatarenon-redundantruleswithminimalante edentsandmaximal on-sequents, alledminimalnon-redundantasso iation rules.Webelievethatthese rules arethe mostrelevantand usefulfrom the pointof viewof theuser, on-sideringthefa t thatinpra ti etheuser annotinferallothervalidrulesfrom therulesextra tedwhilevisualizingthem. Noneoftheapproa hesproposedin

(6)

Fromthe point of view of theuser, an asso iation rule is redundant if it on-veysthesameinformation{or lessgeneralinformation{thantheinformation onveyed byanother rule of the samerange (support) and the samepre ision ( on den e). In previouswork forredu ing redundant impli ationrules (fun -tionaldependan ies),thenotionofnon-redundan y onsideredisrelatedtothe inferen esystemusingArmstrongaxioms[Arm74℄.Thisnotionisnottobe on-fused with thenotion of non-redundan y we onsider here.Toourknowledge, su haninferen esystemforasso iationrules,i.e.,takingintoa ountsupports and on den es ofthe rules, doesnot exist.An asso iation rule r2E is non-redundant and minimal if there is noother asso iationrule r

0

2 E with same supportand on den eand,whi hante edentisasubsetoftheante edentofr andwhi h onsequentisasupersetofthe onsequentofr.

De nition3 (Minimalnon-redundantasso iationrules).Anasso iation rule r : I

1 ! I

2

is a minimal non-redundant asso iation rule i not exists an asso iation rule r 0 :I 0 1 !I 0 2

su h that support(r) =support(r 0 ), on den e(r) = on den e(r 0 ), I 0 1 I 1 andI 2 I 0 2 .

Giventhis hara terization,wede nethegeneri basisforexa tasso iation rules(100% on den erules)andtheinformativebasisforapproximate asso ia-tionrules.Thesebasesare onstitutedoftheminimalnon-redundantexa tand approximateasso iationrulesrespe tively.LetFC bethesetoffrequent losed itemsetsandletFGbethesetoftheir(minimal)generators.

De nition4 (Generi basis). The generi basis ontains all rules with the form r : G ! (F nG) between a generator itemset G 2 FG and its losure (G)2FC su hthatG6= (G).

De nition5 (Informative basis). The informative basis ontains all rules with the form r : G ! (F nG) between a generator itemset G 2 FG and a frequent loseditemsetF 2FC thatisasupersetof its losure: (G)F. The transitiveredu tion of this basis,i.e., for F

0

2FC su hthat (G)F 0

F, isalso abasis forallapproximateasso iation rules.

All valid asso iationrules, theirsupports and their on den es an be de-du edfrom theunionofthegeneri basisandtheinformativebasisor its tran-sitive redu tion. Results of experimentations ondu ted on real-life databases showthat theirgenerationis eÆ ientand usefulin pra ti e,parti ularly when miningasso iationrulesfrom orrelateddata.

Referen es

[AIS93℄ R. Agrawal, T. Imielinski,and A. Swami. Mining asso iation rules between setsofitemsinlargedatabases. Pro .SIGMOD onf.,207{216,May1993. [AS94℄ R. Agrawal and R. Srikant. Fast algorithms for mining asso iation rules in

(7)

IFIP ongress,pp580{583,August1974.

[Bay98℄ R.J.Bayardo. EÆ ientlymininglong patternsfrom databases. Pro . SIG-MOD onf.,85{93,June1998.

[BAG99℄ R.J.Bayardo,R.Agrawal,andD.Gunopulos. Constraint-basedrulemining inlarge,densedatabases. Pro .ICDE onf.,188{197,Mar h1999.

[BG99℄ R.J.Bayardo,andR.Agrawal.Miningthemostinterestingrules. Pro .KDD Conferen e,145{154,August1999.

[BMUT97℄ S.Brin,R.Motwani,J.D.Ullman,andS.Tsur.Dynami itemset ounting andimpli ation rulesformarketbasketdata. Pro .SIGMOD onf.,255{264,May 1997.

[BMS97℄ S.Brin,R.Motwani,andC.Silverstein.Beyondmarketbaskets:Generalizing asso iation rulesto orrelation. Pro .SIGMOD onf.,265{276,May1997.

[DG86℄ V.DuquenneandJ.-L.Guigues.Familleminimaled'impli ationsinformatives resultant d'untableaudedonneesbinaires. Mathematiqueset S ien esHumaines, 24(95):5{18, 1986.

[GW99℄ B.GanterandR.Wille.FormalCon eptAnalysis:Mathemati alfoundations. Springer,1999.

[HF95℄ J. Han and Y. Fu. Dis overy of multiple-level asso iation rules from large databases. Pro .VLDB onf.,420{431,September1995.

[He 96℄ D. He kerman. Bayesian networks for knowledge dis overy. Advan es in KnowledgeDis overyandDataMining,273{305,1996.

[LK98℄ D.LinandZ.M.Kedem. Pin er-Sear h:Anewalgorithmfordis overingthe maximumfrequentset. Pro .EBDT onf.,105{119,Mar h1998.

[Lux91℄ M. Luxenburger. Impli ations partielles dans un ontexte. Mathematiques, Informatique etS ien esHumaines,29(113):35{55,1991.

[MTV94℄ H. Mannila, H.Toivonen,and A. I.Verkamo. EÆ ient algorithmsfor dis- overingasso iation rules. AAAIKDD workshop,181{192,July1994.

[NLHP98℄ R. T.Ng, V. S.Lakshmanan,J. Han,and A. Pang. Exploratory mining and pruningoptimizations of onstrainedasso iation rules. Pro .SIGMOD onf., 13{24,June1998.

[PBTL99a℄ N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. EÆ ient mining of asso iation rules using losed itemset latti es. Information Systems, 24(1):25{46, 1999.

[PBTL99b℄ N.Pasquier, Y. Bastide,R.Taouil, andL. Lakhal. Dis overingfrequent loseditemsetsforasso iation rules. Pro .ICDT onf.,398{416,January1999. [PBTL99 ℄ N.Pasquier, Y. Bastide,R.Taouil,andL. Lakhal. Closedsetbased

dis- overy of small overs for asso iation rules. Pro . BDA onf., 361{381, O tober 1999.

[SON95℄ A.Savasere,E.Omie inski,andS.Navathe.AneÆ ientalgorithmformining asso iation rulesinlargedatabases. Pro .VLDB onf.,432-444, September1995. [ST96℄ A. Silbers hatzand A. Tuzhilin. Whatmakespatterns interesting in

knowl-edge dis overysystems. IEEE Transa tions on Knowledgeand DataEngineering, 8(6):970{974, De ember1996.

[SBM98℄ C.Silverstein,S.Brin,andR.Motwani. Beyondmarketbaskets: Generaliz-ing asso iation rulestodependen erules. DataMiningandKnowledgeDis overy, 2(1):39{68, January1998.

[SA95℄ R.SrikantandR.Agrawal. Mininggeneralizedasso iationrules. Pro .VLDB onf.,407{419,September1995.

[ZPOL97℄ M. J.Zaki,S.Parthasarathy,M. Ogihara,andW.Li. New algorithmsfor fastdis overyofasso iationrules. Pro .KDD onf.,283{286,August1997.

Références

Documents relatifs

In this paper, we provide an experimental comparison of 10 algorithms on real-world data sets whose implementations are publicly available, two of them compute formal concepts (FCbO

Faculty of Law, Economics and Finance • Control of the capital by articles of assocation (Bill 5730). • Right to organize the exclusion of a shareholder or the redemption of the

Keywords: Redescription Mining - Association Rule Mining - Concept Analysis - Linked Open Data - Definition of categories..

The main motivation for using Genetic Algorithm (GA) is that a GA performs a global search and copes better with attribute in- teraction than the greedy rule induction algorithms

After introducing the basic definitions needed from category theory and formal concept analysis, in this paper we have studied two different product construc- tions in the

The paper estab- lishes a surprisingly simple characterization of Palm distributions for such a process: The reduced n-point Palm distribution is for any n ≥ 1 itself a log Gaussian

Negotiations to remove tariffs on Environmental Goods (EGs) and Environmental Services (ESs) at the Doha Round, followed by negotiations towards an Environmental Goods Agreement

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of