• Aucun résultat trouvé

Mining minimal non-redundant association rules using frequent closed itemsets

N/A
N/A
Protected

Academic year: 2021

Partager "Mining minimal non-redundant association rules using frequent closed itemsets"

Copied!
16
0
0

Texte intégral

(1)

HAL Id: hal-00467751

https://hal.archives-ouvertes.fr/hal-00467751

Submitted on 26 Apr 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Mining minimal non-redundant association rules using

frequent closed itemsets

Yves Bastide, Nicolas Pasquier, Rafik Taouil, Gerd Stumme, Lotfi Lakhal

To cite this version:

Yves Bastide, Nicolas Pasquier, Rafik Taouil, Gerd Stumme, Lotfi Lakhal. Mining minimal

non-redundant association rules using frequent closed itemsets. CL’2000 international conference on

Com-putational Logic, Jul 2000, Londres, United Kingdom. pp.972-986. �hal-00467751�

(2)

Rules using Frequent Closed Itemsets YvesBastide 1 ,Ni olasPasquier 1 , Ra kTaouil 1 ,GerdStumme 1;2 ,Lot Lakhal 1 1

L.I.M.O.S.,UniversiteBlaisePas al{Clermont-FerrandII,

ComplexedesCezeaux,24AvenuedesLandais,F{63177Aubiere edex,Fran e; fbastide,pasquier,taouil,lakhalglibd2.univ-bp lermont.fr

2

Te hnis heUniversitatDarmstadt,Fa hberei hMathematik,S hlogartenstr.7, D{64289Darmstadt,Germany;stummemathematik.tu-darmstadt.de

Abstra t. Theproblemoftherelevan eandtheusefulnessofextra ted asso iation rules is of primary importan e be ause, inthe majority of ases,real-lifedatabasesleadtoseveralthousandsasso iationruleswith high on den e and among whi h are many redundan ies. Using the losureoftheGalois onne tion,wede netwonewbasesforasso iation ruleswhi hunionisageneratingsetforall validasso iation ruleswith support and on den e. These bases are hara terized using frequent loseditemsetsandtheir generators;they onsistofthe non-redundant exa tandapproximateasso iationruleshavingminimalante edentsand maximal onsequents,i.e.themostrelevantasso iationrules.Algorithms forextra tingthesebasesarepresentedandresultsofexperiments arried outonreal-lifedatabases showthattheproposedbases areuseful,and thattheirgenerationisnottime onsuming.

1 Introdu tion

Thepurposeofasso iationruleextra tion,introdu edin [AIS93℄,istodis over signi ant relations between binary attributes extra ted from databases. An example ofasso iationrule extra ted from adatabaseof supermarket sales is: \ ereals^sugar!milk(support 7%, on den e 50%)".Thisrule statesthat the ustomers who buy ereals and sugar also tend to buy milk. The support de nes the range ofthe rule, i.e. the proportion of ustomers who bought the threeitemsamongall ustomers,andthe on den ede nesthepre isionofthe rule,i.e.theproportionof ustomerswhoboughtmilkamongthosewhobought erealsandsugar.Anasso iationruleis onsideredrelevantforde isionmaking if it has support and on den e at least equal to some minimal support and on den ethresholds,minsupportandmin on den e,de nedbytheuser. Theproblemofrelevan eandusefulnessoftheresultisrelatedtothenumberof extra tedasso iationrules{thatisingeneralverylarge{andtothepresen eof ahugeproportionofredundantrules,i.e.rules onveyingthesameinformation, among them. Even though the visualization of a relativelysigni antnumber of rules an be simpli ed by the use of visualization tools su h as the Rule

(3)

jorityoftheextra tedrulesforseveralkindsofdata,theirsuppression redu es onsiderablythenumberofrulestobemanagedduringthevisualization.

Example 1. In order to illustrate the problem of redundant asso iation rules, nine asso iation rules extra ted from UCI KDD's ar hives's dataset Mush-rooms

1

des ribingthe hara teristi sof8416mushroomsarepresentedbelow. These nine rules haveidenti al supports and on den es of 51%and 54% re-spe tively,andtheitem\freegills"in theante edent:

1)freegills!eatable 6)freegills!eatable,partialveil, whiteveil 2)freegills!eatable,partialveil 7)freegills,partialveil!eatable,whiteveil 3)freegills!eatable,whiteveil 8)freegills,whiteveil!eatable,partialveil 4)freegills, whiteveil!eatable 9)freegills,partialveil, whiteveil!eatable 5)freegills, partialveil!eatable

Obviously,givenrule6,rules1to5and7to9areredundant,sin etheydonot onveyany additionalinformation to theuser. Rule 6hasminimal ante edent andmaximal onsequentanditisthemostinformativeamongtheseninerules. Inordertoimprovetherelevan eandtheusefulnessofextra tedrules,onlyrule 6shouldbeextra tedandpresentedto theuser.

Severalmethods havebeenproposed in theliteratureto redu e thenumberof extra ted asso iation rules. Generalized asso iation rules [HF95,SA95℄ are de- ned using ataxonomyof theitems; theyare rulesbetweensets of items that belong to di erent levels ofthe taxonomy. The useof statisti measures other than on den e su h as onvi tion, Pearson's orrelation or 

2

test is stud-iedin[BMS97,SBM98℄.In[He 96,PSM94,ST96℄, theuseofdeviationmeasures, i.e. measures of distan e between asso iation rules, de ned a ordingto their supports and on den es, is proposed.In[BAG99,NLHP98,SVA97℄,theuse of item onstraints,that are boolean expressionsde ned bythe user,in order to spe ify the form of the asso iation rules that will be presented to the user is proposed.Theapproa hproposedin[BG99℄istopresenttotheuserruleswith maximalante edents, alledA-maximalrules,thatarerulesforwhi hthe popu-lationofobje ts on ernedisredu edwhenanitemisaddedtotheante edent. In [PBTL99 ℄, we adapt the Duquenne-Guigues basis for global impli ations [DG86,GW99℄andtheproperbasisforpartialimpli ations[Lux91℄tothe asso- iation rules framework.It is demonstrated that these bases are minimal with respe t to the number of extra ted asso iation rules. However, none of these methods allowsto generate thenon-redundantasso iation rules with minimal ante edents and maximal onsequentswhi h we believe are the most relevant andusefulfromthepointofviewoftheuser.

1.1 Contribution

Intherestofthepaper,twokindsofasso iationrulesaredistinguished:

(4)

validforalltheobje tsofthe ontext.Theserulesarewritten l)l 0

. { Approximate asso iation rules whose on den e is lower than 100%, i.e.

whi h are valid for a proportion of obje ts of the ontext equal to their on den e.Theserulesarewrittenl!l

0 .

Thesolutionproposedinthispaper onsistsingeneratingbases,orredu ed ov-ers, forasso iationrules.These bases ontainnoredundantrule, beingthus of smallersize.Ourgoalisto limittheextra tion tothemostinformative asso i-ationrulesfromthepointofviewoftheuser.

Using thesemanti fortheextra tion ofasso iationrulesbasedon the losure oftheGalois onne tion[PBTL98℄,thegeneri basis for exa tasso iation rules and the informative basis for approximate asso iation rules are de ned. They are onstru ted using the frequent losed itemsets and their generators, and theyminimizethe numberof asso iationrules generatedwhile maximizingthe quantityandthequalityoftheinformation onveyed.Theyallowfor:

1. Thegenerationofonlythemostinformativenon-redundantasso iationrules, i.e.ofthemostusefulandrelevantrules:thosehavingaminimalante edent (left-handside)andamaximal onsequent(right-handside).Thusredundant ruleswhi h represent in ertain databasesthe majority of extra ted rules, parti ularlyinthe aseofdenseor orrelateddataforwhi hthetotalnumber ofvalidrulesis verylarge,willbepruned.

2. Thepresentation to theuser ofa set ofrules overingall theattributes of thedatabase,i.e. ontainingruleswheretheunionoftheante edents(resp. onsequents)isequaltotheunionsoftheante edents(resp. onsequents)of allthe asso iation rules valid in the ontext. This is ne essaryin order to dis overrulesthatare\surprising"to theuser,whi h onstituteimportant informationthatitisne essaryto onsider [He 96,PSM94,ST96℄.

3. The extra tion of a set of rules without any loss of information, i.e. on-veyingalltheinformation onveyedbytheset ofallvalid asso iationrules. It is possible to dedu e eÆ iently, without a ess to the dataset, all valid asso iationruleswiththeirsupports and on den es fromthesebases.

Theunionofthesetwobasesthus onstitutesasmallnon-redundantgenerating set forallvalidasso iationrules,theirsupportsandtheir on den es.

In se tion 2, we re all the semanti for asso iation rules based on the Galois onne tion. Thenewbaseswepropose andalgorithms forgeneratingthem are de ned in se tion3. Resultsof experiments we ondu ted onreal-life datasets arepresentedinse tion4andse tion5 on ludes thepaper.

2 Semanti for asso iation rules based on the Galois

onne tion

(5)

thefa tthat theobje to2O isrelatedto theitemi2I.

Example 2. A datamining ontextD onstitutedofsixobje ts(ea hone iden-ti ed by its OID) and veitems is represented in the table 1. This ontext is usedas supportfortheexamplesintherestofthepaper.

OID Items 1 A C D 2 B C E 3 A B C E 4 B E 5 A B C E 6 B C E

Table1.Datamining ontextD.

The losureoperator ofthe Galois onne tion[GW99℄ is the omposition of theappli ation, thatasso iateswithO Otheitems ommonto allobje ts o2O,andtheappli ation , thatasso iateswithanitemset lI theobje ts relatedtoallitemsi2l (the obje ts\ ontaining"l).

De nition1 (Frequent itemsets). A set of items l I is alled an item-set. The support of an itemsetl is the per entage of obje ts in D ontaining l: support(l)=j (l)j=jOj.l isafrequent itemsetifsupport(l)minsupport.

De nition2 (Asso iation rules). An asso iation rule r is an impli ation between two frequent itemsetsl

1 ;l 2 I of the forml 1 !(l 2 nl 1 )where l 1 l 2 . The support and the on den e of r are de ned as: support(r) = support(l

2 ) and onfiden e(r)=support(l

2

)=support(l 1

).

The losure operator = Æ asso iates with an itemset l the maximal set of items ommon to all the obje ts ontaining l, i.e. the interse tion of these obje ts.Usingthis losureoperator,wede nethefrequent loseditemsets that onstituteaminimalnon-redundantgeneratingsetforallfrequentitemsetsand theirsupports,andthusforallasso iationrules,theirsupportsandtheir on -den es.Thisproperty omesfromthefa tsthatthesupportofafrequentitemset isequaltothesupportofits losureandthatthemaximalfrequentitemsetsare maximalfrequent loseditemsets[PBTL98℄.

De nition3 (Frequent losed itemsets). A frequent itemset l  I is a frequent loseditemseti (l)=l. The smallest (minimal) losed itemset on-taining anitemsetl is (l),i.e. the losureof l.

Inorder toextra tthefrequent loseditemsets, theClose[PBTL98,PBTL99a℄ and the A-Close [PBTL99b℄ algorithms perform a breadth- rstsear h for the generatorsofthefrequent loseditemsetsin alevelwise manner.

De nition4 (Generators). An itemsetg I is a(minimal) generator of a losed itemset l i (g) = l and g

0  I with g 0  g su h that (g 0 ) = l. A

(6)

the Closealgorithm

TheClose algorithmis aniterativealgorithmfor theextra tion ofallfrequent loseditemsets.It oursesgeneratorsofthefrequent loseditemsetsinalevelwise manner. During the k

th

iteration of the algorithm, a set FCC k

of andidates is onsidered.Ea h element of this set onsists of three elds: a andidate k-generator, its losure (whi h is a andidate losed itemset), and theirsupport (thesupportsofthegeneratorandits losurebeingidenti al).Attheendofthe k

th

iteration,thealgorithmstoresasetFC k

ontainingthefrequentk-generators, their losureswhi harefrequent loseditemsets,and theirsupports.

ThealgorithmstartsbyinitializingthesetFCC 1

ofthe andidate1-generators withthelistofthe1-itemsetsofthe ontextandthen arriesoutsomeiterations. Duringea hiterationk:

1. The losuresofallk-generatorsandtheirsupportsare omputed.This om-putationisbasedonthepropertythat the losureofanitemset isequalto theinterse tionof alltheobje tsin the ontext ontainingit.Thenumber ofthese obje tsprovidesthesupportofthegenerator.Onlyones anofthe ontextis thusne essaryto determine the losures andthesupportsof all thek-generators.

2. Allfrequentk-generators,whi h supportis greateror equalto minsupport, their losures and their supports are inserted in the set FC

k

of frequent loseditemsetsidenti edduringtheiterationk.

3. Theset of andidate(k+1)-generators(usedduring thefollowingiteration) is onstru ted,byjoiningthefrequentk-generatorsinthesetFC

k

asfollows. (a) The andidate (k+1)-generatorsare reatedbyjoining thek-generators

inFC k

thathavethesamek 1 rstitems.Forinstan e,the3-generators fABCg and fABDg will be joined in order to reate the andidate 4-generatorfABCDg.

(b) The andidate(k+1)-generatorsthatareknownto beeither infrequent ornon-minimal, be auseone oftheirsubsetiseitherinfrequentor non-minimal, are then removed. These generatorsare identi ed bythe ab-sen e of at least one their subsets of size k among the frequent k-generatorsofFC

k .

( ) Athirdphaseremovesamongtheremaininggeneratorsthosewhi h lo-sureswere already omputed. Su h a(k+1)-generator g is easily iden-ti ed sin e it is in luded in the losure of afrequent k-generator g

0 in FC k :g 0 g (g 0

)(i.e.itisnotaminimalgenerator).

Thealgorithm stops whennonew andidategenerator an be reated.The A-Closealgorithm,developedinordertoimprovethee e tivenessoftheextra tion in the ase of slightly orrelated data, does not ompute the losures of the andidategeneratorsduringtheiterations,butduring anultimates an arried outaftertheendoftheseiterations.

Example 3. Figure1showstheexe utionoftheClosealgorithmonthe ontextD foraminimalsupportthresholdof2/6.Thealgorithm arriesouttwoiterations,

(7)

S an D

!

Generator Closed Support itemset fAg fACg 3/6 fBg fBEg 5/6 fCg fCg 5/6 fDg fACDg 1/6 fEg fBEg 5/6 Suppressing infrequent itemsets ! FC1

Generator Closed Support itemset fAg fACg 3/6 fBg fBEg 5/6 fCg fCg 5/6 fEg fBEg 5/6 S an D ! FCC 2

Generator Closed Support itemset fABg fABCEg 2/6 fAEg fABCEg 2/6 fBCg fBCEg 4/6 fCEg fBCEg 4/6 Suppressing infrequent itemsets ! FC 2

Generator Closed Support itemset

fABg fABCEg 2/6 fAEg fABCEg 2/6 fBCg fBCEg 4/6 fCEg fBCEg 4/6

Fig.1.Extra tingfrequent loseditemsetsfromDwithCloseforminsupport =2/6.

Experimentalresultsshowedthatthese algorithmsareparti ularly eÆ ientfor miningasso iationrulesfromdenseor orrelateddatathatrepresentan impor-tantpartofreal lifedatabases.

3 Minimal non-redundant asso iation rules

Aspointedoutinexample1,itisdesirablethatonlythenon-redundant asso ia-tionruleswithminimalante edentandmaximal onsequent,i.e.themostuseful andrelevantrules,areextra tedandpresentedtotheuser.Su hrulesare alled minimal non-redundantasso iation rules.

Supportand on den eindi atetherangeandthepre isionoftherule,andthus, mustbetakenintoa ountfor hara terizingtheredundantasso iationrules.In previous works on erning theredu tion of redundantimpli ation rules (fun -tionaldependan ies),su hasthede nitionofthe anoni al over[BB79,Mai80℄, thenotionofnon-redundan y onsideredisrelatedtotheinferen esystemusing Armstrongaxioms[Arm74℄.Thisnotionisnottobe onfusedwiththenotionof non-redundan y we onsider here.Toourknowledge, su h aninferen esystem for asso iation rules, that takes into a ount supports and on den es of the rules,doesnotexist.The prin ipleof minimalnon-redundantasso iationrules asde ned hereafteristo identifythemostinformativeasso iationrules onsid-eringthe fa t that in pra ti e, theuser annot inferall other valid rulesfrom therulesextra tedwhilevisualizingthem.

An asso iation rule is redundant if it onveys the same information { or less generalinformation{thantheinformation onveyedbyanotherruleofthesame usefulnessandthe samerelevan e.An asso iationrule r2E isnon-redundant andminimalifthereisnootherasso iationruler

0

2Ehavingthesamesupport andthe same on den e, ofwhi htheante edentisasubsetoftheante edent

(8)

rule r :l 1

!l 2

is aminimal non-redundant asso iation rule i there does not existanasso iationruler

0 :l 0 1 !l 0 2

withsupport(r)=support(r 0 ), on den e(r) = on den e(r 0 ), l 0 1 l 1 andl 2 l 0 2 .

Basedonthis de nition,we hara terizethegeneri basisforexa tasso iation rulesandtheinformativebasisforapproximateasso iationrules, onstitutedof theminimalnon-redundantexa tandapproximateasso iationrulesrespe tively.

3.1 Generi basis for exa t asso iation rules

The exa t asso iation rules, of the form r : l 1

) (l 2

nl 1

), are rules between twofrequent itemsets l

1 and l

2

whose losures areidenti al: (l 1 ) = (l 2 ). In-deed, from (l 1 )= (l 2 )wededu e thatl 1 l 2 andsupport(l 1 )=support(l 2 ), andthus onfiden e(r)=1.Sin ethemaximumitemset amongthese itemsets (whi hhavesamesupports)istheitemset (l

2

),allsupersetsofl 1

thatare sub-setsof (l

2

)havethesamesupport,andtherulesbetweentwooftheseitemsets areexa trules.

Let G (l

2 )

be the set of generators of the frequent losed itemset (l 2

). By de nition,theminimalitemsetsthataresupersetsofl

1

andaresubsetsof (l 2

) arethegeneratorsg2G

(l2)

.Wethus on ludethatrulesoftheformg)( (l 2

)n g)between generatorsg2 G

(l2)

andthe frequent losed itemset (l 2

) arethe rulesofminimalante edentsandmaximal onsequentsamongtherulesbetween thesupersetsofl

1

andthesubsetsof (l 2

). Thegeneralizationofthis property to the set offrequent loseditemsets de nesthe generi basis onsistingof all non-redundant exa t asso iation rules with minimal ante edents and maximal onsequents,as hara terizedinde nition 5.

De nition6 (Generi basis for exa t asso iation rules). LetFC bethe set of frequent loseditemsetsextra tedfromthe ontext and, for ea h frequent loseditemsetf,letdenote G

f

the setofgeneratorsoff. The generi basisfor exa t asso iation rulesis:

GB=fr:g)(fng)j f 2FC ^ g2G f

^ g6=fg:

The ondition g 6= f ensures that rules of the form g ) ? that are non-informativearedis arded.Thefollowingpropositionstatesthatthegeneri basis doesnotleadtoanylossofinformation.

Proposition1. (i) All valid exa t asso iation rules, their supports and their on den es(thatareequalsto100%) anbededu edfromtherulesofthegeneri basis andtheirs supports.(ii) The generi basis for exa tasso iation rules on-tainsonly minimalnon redundant-rules.

Proof. Let r: l 1 )(l 2 nl 1

)be avalid exa tasso iationrule between two fre-quentitemsetswithl

1 l

2

.Sin e onfiden e(r)=100%wehavesupport(l 1

)= support(l

2

).Giventhepropertythatthesupportofanitemsetisequaltothe sup-portof its losure,wededu ethat support( (l ))=support( (l ))=) (l )=

(9)

2

there exists a rule r 0

: g ) (f ng) 2 GB su h that g is a generatorof f for whi h g  l

1

and g  l 2

. We show that the rule r and its support an be dedu ed from the rule r

0

and its support. Sin e g  l 1

 l 2

 f, the rule r an be derived from the rule r

0 . From (l 1 ) = (l 2 ) = f, we dedu e that support(r)=support(l 2 )=support( (l 2 ))=support(f)=support(r 0 ). 

Algorithmfor onstru ting thegeneri basis

Thepseudo- odeoftheGen-GBalgorithmfor onstru tingthegeneri basisfor exa tasso iationrulesusingthefrequent loseditemsetsandtheirgeneratorsis presentedinalgorithm1.Ea helementofasetFC

k

onsists ofthree elements: generator, losureandsupport.

Algorithm1Constru tingthegeneri basiswithGen-GB.

Input :setsFCk ofk-groupsoffrequentk-generators; Output:setGB ofexa tasso iationrules ofthegeneri basis; 1) GB fg

2) forallsetFCk2FCdo begin 3) forallk-generatorg2FC k

su hthatg6= (g)dobegin 4) GB GB[f(r:g)( (g)ng); (g):support)g; 5) end

6) end 7) returnGB;

The algorithm starts by initializing the set GB with the empty set (step 1). Ea hsetFC

k

offrequentk-groupsisthenexaminedsu essively(steps2to6). Forea hk-generatorg2FC

k

ofthefrequent loseditemset (g)forwhi hg is di erent fromits losure (g)(steps 3to 5),the ruler:g )( (g)ng),whose supportisequalto thesupportofg and (g),isinsertedintoGB (step4).The algorithmreturns nallythesetGB ontainingallminimalnon-redundantexa t asso iationrulesbetweengeneratorsandtheir losures(step7).

Example 4. Thegeneri basisforexa tasso iationrulesextra tedfromthe on-textDforaminimalsupportthresholdof2/6ispresentedinTable2.It ontains sevenruleswhereasfourteenexa tasso iationrulesarevalid onthewhole.

3.2 Informative basis forapproximate asso iation rules

Ea happroximateasso iationrulel 1

!(l 2

nl 1

),isarulebetweentwofrequent itemsets l

1 and l

2

su h that the losure of l 1

is a subset of the losure of l 2 : (l 1 ) (l 2

). The non-redundant approximate asso iation rules with minimal ante edentl

1

andmaximal onsequent(l 2

nl 1

)arededu edfromthis hara ter-isation.

Letf 1

bethefrequent loseditemsetwhi histhe losureofl 1 ,andg 1 agenerator of f 1 su h as g 1 l 1 f 1 . Let f 2

bethe frequent losed itemset whi h is the losureofl andg ageneratoroff su hasg l f .Theruleg )(f ng )

(10)

fAg fACg A)C 3/6 fBg fBEg B)E 5/6 fCg fCg fEg fBEg E)B 5/6 fABg fABCEg AB)CE 2/6 fAEg fABCEg AE)BC 2/6 fBCg fBCEg BC)E 4/6 fCEg fBCEg CE)B 4/6

Table 2. Generi basis for exa t asso iation rules extra ted from D for minsup-port =2/6.

betweenthegeneratorg 1

andthefrequent loseditemsetf 2

istheminimal non-redundantruleamongtherulesbetweenanitemsetoftheinterval

2 [g 1 ;f 1 ℄and anitemsetoftheinterval[g

2 ;f

2

℄.Indeed,thegeneratorg 1

istheminimalitemset whose losureisf

1

,whi hmeansthattheante edentg 1

isminimalandthatthe onsequent(f 2 ng 1 ) ismaximal sin ef 2

is themaximalitemset of theinterval [g

2 ;f

2

℄. The generalizationof this property to the set of allrules between two itemsetsl

1 andl

2

de nestheinformativebasiswhi hthus onsistsofallthe non-redundant approximate asso iationrules of minimal ante edents and maximal onsequents hara terizedin de nition5.

De nition7 (Informativebasisforapproximateasso iationrules).Let FC be the setof frequent loseditemsetsand letdenote Gthe set oftheir gen-eratorsextra tedfromthe ontext. The informative basis for approximate asso- iation rules is:

IB=fr:g!(fng)jf 2FC ^ g2G ^ (g)fg:

Proposition2. (i)Allvalid approximate asso iation rules,their supportsand on den es, anbededu edfromtherulesoftheinformativebasis,theirsupports and theirs on den es. (ii) All rules in the informative basis re minimal non-redundantapproximate asso iation rules.

Proof. Let r : l 1 ! (l 2 nl 1

) be a valid approximate asso iation rule between two frequent itemsets with l

1  l

2

. Sin e onfiden e(r) < 1 we also have (l

1 ) (l

2

).Foranyfrequentitemsetsl 1 andl 2 ,thereisageneratorg 1 su hthat g 1 l 1  (l 1 )= (g 1 )and ageneratorg 2 su h that g 2 l 2  (l 2 )= (g 2 ). Sin el 1 l 2 ,wehavel 1  (g 1 )l 2  (g 2

)andtheruler 0 :g 1 !( (g 2 )ng 1 ) belongsto the informativebasisIB. Weshowthat theruler, its support and its on den e an bededu ed from the rule r

0

, its support and its on den e. Sin e g 1  l 1  (g 1 )  g 2  l 2  (g 2

), the ante edent and the onse-quentof r an berebuilt starting from therule r

0 . Moreover,wehave (l 2 )= (g 2

)and thus support(r)=support(l 2 )=support( (g 2 ))=support(r 0 ).Sin e g 1 l 1  (g 1 ), wehavesupport(g 1 )=support(l 1

)and wethus dedu ethat: onfiden e(r) = support(l

2 )=support(l 1 ) = support( (g 2 ))=support(g 1 ) = onfiden e(r 0 ):  2

(11)

sitiveredu tionoftheinformativebasisthatisitselfabasisforallapproximate asso iation rules.We notel

1 ll 2 ifthe itemset l 1 is animmediate prede essor of the itemset l 2 , i.e. l 3 su h that l 1  l 3  l 2

. The transitive rules of the informativebasis are of theform r : g ! (fng) for afrequent losed itemset f and afrequentgeneratorg su hthat (g)f and (g)is notanimmediate prede essoroffinFC: (g)lf6 .Thetransitiveredu tionoftheinformativebasis thus ontainstheruleswiththeformr:g!(fng)forafrequent loseditemset f andafrequentgeneratorgsu has (g)lf.

De nition8 (Transitiveredu tion oftheinformativebasis).LetFC be the set of frequent losed itemsets and let denote G the set of their generators extra tedfrom the ontext. Thetransitiveredu tion ofthe informative basis for approximate asso iation rulesis:

R I=fr:g!(fng)j f 2FC ^ g2G ^ (g)lfg:

Obviously, it is possible to dedu e all the asso iation rules of the informative basiswith theirsupports andtheir on den es, and thus allthevalid approxi-mate rules,from therulesof thistransitiveredu tion,theirsupports andtheir on den es.Thisredu tionmakesitpossibletode reasethenumberof approx-imaterulesextra tedbypreserving theruleswhi h on den esarethehighest (sin e thetransitive rules have on den es lowerthan thenon-transitiverules by onstru tion)withoutlosinganyinformation.

Constru ting the transitiveredu tion of the informativebasis

Thepseudo odeoftheGen-RIalgorithmfor onstru tingthetransitive redu -tionoftheinformativebasisfortheapproximateasso iationrulesusingtheset offrequent loseditemsetsandtheirgeneratorsispresentedinalgorithm2.Ea h elementofasetFC

k

onsistsofthree elds:generator, losureandsupport.The algorithm onstru tsforea hgeneratorg onsideredasetSu

g

ontainingthe frequent loseditemsetsthatareimmediate su essorsofthe losureofg. ThealgorithmstartsbyinitializingthesetR Iwiththeemptyset(step1).Ea h set FC

k

of frequent k-groups is then examined su essively in the in reasing order of the values of k (steps 2 to 14). For ea h k-generator g 2 FC

k of the frequent losed itemset (g) (steps 3to 18),the set Su

g

of thesu essorsof the losure of (g)isinitialized withthe empty set (step 4)and thesets S

j of frequent losed j-itemsets that are supersets of (g) for j (g)j < j  

3 are onstru ted (steps 5 to 7). The sets S

j

are then onsidered in the as ending order of the values of j (steps 8 to 17). For ea h itemset f 2 S

j

that is not a superset of an immediate su essorof (g) in Su

g

(step 10), f is inserted in Su

g

(step 11) and the on den e of therule r : g ! (fng) is omputed (step 12). If the on den e of r is greater or equal to the minimal on den e thresholdmin on den e,theruler isinsertedin R I (steps13to 15).Whenall thegeneratorsofsize lowerthanhavebeen onsidered,thealgorithmreturns thesetR I (step 20).

(12)

Gen-RI.

Input :setsFC k

ofk-groupsoffrequentk-generators;min on den e threshold; Output:Transitiveredu tionoftheinformativebasisR I;

1) R I fg

2) for(k 1;k-1;k++)dobegin 3) forallk-generatorg2FCkdo begin 4) Su g fg;

5) for(j=j (g)j;j;j++)dobegin 6) Sj ff 2FCjf  (g) ^ jfj=jg;

7) end

8) for(j=j (g)j;j;j++)dobegin 9) forallfrequent loseditemsetf2S

j

dobegin 10) if(s2Su g jsf)thendobegin

11) Su g Su g[f;

12) r: on den e f:support=g:support; 13) if(r: on den emin on den e)

14) thenR I R I[fr:g!(fng);r: on den e;f:supportg;

15) endif 16) end 17) end 18) end 19) end 20) returnR I;

Example 5. The transitive redu tion of the informative basis for approximate asso iationrulesextra tedfromthe ontextDforaminimalsupportthreshold of 2/6 and a minimal on den e threshold of 3/6 is presented in Table 3. It ontainssevenrules,versustenrulesintheinformativebasis,whereasthirtysix approximate asso iationrulesare validonthewhole.

4 Experimental results

Weusedthefourfollowingdatasetsduring theseexperiments:

{ T20I6D100K 4

, madeupof syntheti data builta ordingtothe properties ofsalesdata,whi h ontains100,000obje tswithanaveragesizeof20items andanaveragesizeofthepotentialmaximalfrequentitemsetsofsixitems. { Mushrooms,that onsistsof8,416obje tsofanaveragesizeof23attributes (23itemsbyobje tsand127itemsonthewhole)des ribing hara teristi s ofmushrooms.

{ C20D10KandC73D10K 5

whi haresamplesofthe lePubli UseMi rodata Samples ontainingdataof the ensusofKansas arriedoutin 1990.They onsistof10,000obje ts orrespondingtothe rst10,000listedpeople,ea h 4

http://www.almaden.ibm. om/ s/quest/syndata.html 5

(13)

fAg fACg fABCEg A !BCE 2/6 2/3 fBg fBEg fBCEg B!CE 4/6 4/5 fBg fBEg fABCEg fCg fCg fACg C!A 3/6 3/5 fCg fCg fBCEg C!BE 4/6 4/5 fCg fCg fABCEg

fEg fBEg fBCEg E!BC 4/6 4/5

fEg fBEg fABCEg

fABg fABCEg fAEg fABCEg

fBCg fBCEg fABCEg BC!AE 2/6 2/4

fCEg fBCEg fABCEg CE !AB 2/6 2/4

Table 3. Transitive redu tion of the informative basis for approximate asso iation rulesextra tedfromDforminsupport=2/6andmin on den e =3/6.

obje t ontaining20 attributes (20 items byobje tsand 386 itemson the whole)forC20D10Kand73attributes(73itemsbyobje tsand2,178items onthewhole)forC73D10K.

Exe utiontimes (notpresentedhere) ofthegeneration of thebases,as forthe generation of all valid asso iation rules, are negligible ompared to exe ution timesofthefrequent( losed)itemsetsextra tion.

Numberofexa t asso iation rulesextra ted. Thetotalnumberofvalid exa t asso iation rules and the number of rules in the generi basis are pre-sentedin Table4. Noexa tasso iationrule is extra ted from T10I4D100Kas forthissupportthresholdallthefrequentitemsetsarefrequent loseditemsets: they all havedi erent supports and are thus themselves their unique genera-tor. Consequently, there exists no rule of the form l

1 ) (l 2 nl 1 ) between two frequent itemsets whose losuresare identi al: (l

1 )= (l

2

)that are the valid exa tasso iationrules.

Dataset Minsupport Exa trules Generi basis

T10I4D100K 0.5% 0 0

Mushrooms 30% 7,476 543

C20D10K 50% 2,277 457

C73D10K 90% 52,035 1,369

Table4.Numberofexa tasso iation rulesextra ted.

For the three other datasets,made up of denseand orrelated data, the total numberof valid exa t rulesvaries from more than2,000 to more than52,000, whi his onsiderableandmakesitdiÆ ulttodis overinterestingrelationships. Thegeneri basis representsasigni antredu tionofthe numberof extra ted rules(byafa torvaryingfrom12to50)andsin eitdoesnotrepresentanyloss ofinformation,itbringsaknowledgethatis omplete,relevantandeasilyusable fromthepointofviewoftheuser.

(14)

validapproximateasso iationrulesisforthefourdatasetsverysigni antsin e it varies of almost 20,000rules for T20I6D100K to more than 2,000,000 rules forC73D10K.Itisthusessentialtoredu etheset ofextra tedrulesinorderto makeitusablebytheuser.ForT20I6D100K,thisbasisrepresentsadivisionby afa tor of5approximatelyof thenumberof extra tedapproximaterules. For Mushrooms,C20D10K,andC73D10K,thetotalnumberofvalidapproximate asso iationrulesismu hmoreimportantthanforthesyntheti datasin ethese dataaredenseand orrelatedandthusthenumberoffrequentitemsetsismu h higher. As a onsequen e, it is the same forthe number of valid approximate rules. The proportion of frequent losed itemsets among the frequent itemsets beingweak,theredu tionoftheinformativebasisforapproximaterulesmakesit possibletoredu e onsiderably(byafa torvaryingfrom40to500)thenumber ofextra tedrules.

Dataset Min on den e Approximate Informative (Minsupport) Rules basisredu tion

T10I4D100K 70% 20, 419 4,004 (0.5%) 30% 22, 952 4,519 Mushrooms 70% 37, 671 1,221 (30%) 30% 71, 412 1,578 C20D10K 70% 89, 601 1,957 (50%) 30% 116,791 1,957 C73D10K 90% 2,053,896 5,718 (90%) 80% 2,053,936 5,718 Table5.Numberofapproximateasso iation rulesextra ted.

Comparingrulesin thegeneri basisandtheredu tionoftheinformativebasis to all valid rules, we he ked that these bases do not ontain any redundant rules. Considering theexample presented in the se tion 1 on erning the nine approximate rulesextra tedthedatasetMushrooms,onlythe6

th

ruleis gen-eratedamongtheseninerulesinthebases.Indeed,theitemsetsffreegillsgand ffree gills,eatable,partial veil,white veilg are two frequent losed itemsets of whi hthe rstisanimmediateprede essorofthese ondandtheyaretheonly frequent loseditemsets oftheinterval[?,ffreegills,eatable,partialveil,white veilg℄.Moreover,thefrequent losed itemset ffree gillsg beingitself itsunique generator,therule6belongstothetransitiveredu tionoftheinformativebasis: itistheminimalnon-redundantruleamongthesenine rules.

5 Con lusion

Using thefrequent losed itemsets and theirgeneratorsextra ted by the algo-rithms Closeor A-Close, wede ne thegeneri basisfor exa tasso iationrules andthetransitiveredu tionoftheinformativebasisforapproximateasso iation rules.The unionofthese basesprovidesanon-redundantgeneratingset forall

(15)

onsequent)and doesnot representanyloss of information: from the point of viewoftheuser,these rulesarethemostusefulandthemostrelevant asso ia-tionrules. All the information onveyedby theset of valid asso iationrules is also onveyed by the union of these twobases.Twoalgorithms for generating thegeneri basisandthetransitiveredu tionoftheinformativebasisusingthe frequent losed itemsets and their generators, are also presented. These bases arealsoofstronginterestfor:

{ Thevisualizationoftheextra tedrulessin etheredu ednumberofrulesin thesebases,aswellasthedistin tionoftheexa tandtheapproximaterules, fa ilitate the presentation of the rules to the user. Moreover, the absen e of redundant rules in the bases and the generation of the minimal non-redundantrulesareofsigni antinterestfromthepointofviewoftheuser [KMR

+ 94℄.

{ Theidenti ationoftheminimalnon-redundantasso iationrulesamongthe setofvalidasso iationrulesextra ted,usingDe nition5.Itisthuspossible toextend anexisting implementationfor extra tingasso iationrules or to integrate this method in the visualization system in order to present the minimalnon-redundantasso iationrulestotheuser.

{ Thedataanalysisandtheformal on eptanalysissin etheydonotrepresent anylossofinformation omparedtothesetofvalidimpli ationrulesandare onstitutedofthemostusefulandrelevantrules.De nition5oftheminimal non-redundant rules being also valid within the framework of global and partial impli ation rules betweenbinary sets of attributes, de nitions 6 of thegeneri basisand7oftheinformativebasisarealsovalidfortheglobal andpartialimpli ationrulesrespe tively.

Moreover,wethinkthatthisapproa his omplementarywithapproa hesfor se-le tingasso iationrulestobevizualised,su hastemplatesanditem onstraints, that helptheusermanagingtheresult.

Aspointedoutinse tion3,uptonow,theredoesnotexistanyinferen esystem with ompletenessandsoundnessproperties,forinferringasso iationrulesthat takes into a ount supports and on den es of the rules. We think that the de nition of su h aninferen e system,equivalentto theArmstrongaxiomsfor impli ations, onstitutesaninterestingperspe tiveoffuture work.

Referen es

[AIS93℄ R. Agrawal, T. Imielinski,and A. Swami. Mining asso iation rules between setsofitemsinlargedatabases. Pro .SIGMOD onf.,pp207{216,May1993. [AS94℄ R. Agrawal and R. Srikant. Fast algorithms for mining asso iation rules in

largedatabases. Pro .VLDB onf.,pp478{499,September1994.

[Arm74℄ W.W.Armstrong. Dependen ystru turesofdatabaserelationships. Pro . IFIP ongress,pp580{583,August1974.

(16)

Conferen e,pp145{154,August1999.

[BB79℄ C.Beeri, P. A. Bernstein. Computationalproblemsrelated to thedesign of normal form relational s hemas. Transa tions on Database Systems, 4(1):30{59, 1979.

[BMS97℄ S.Brin,R.Motwani,andC.Silverstein.Beyondmarketbaskets:Generalizing asso iation rulesto orrelation. Pro .SIGMOD onf.,pp265{276,May1997. [DG86℄ V.DuquenneandJ.-L.Guigues.Familleminimaled'impli ationsinformatives

resultant d'untableaudedonneesbinaires. Mathematiqueset S ien esHumaines, 24(95):5{18, 1986.

[GW99℄ B.GanterandR.Wille. FormalCon eptAnalysis:Mathemati alfoundations. Springer,1999.

[HF95℄ J. Han and Y. Fu. Dis overy of multiple-level asso iation rules from large databases. Pro .VLDB onf.,pp420{431,September1995.

[He 96℄ D. He kerman. Bayesian networks for knowledge dis overy. Advan es in KnowledgeDis overyandDataMining,pp273{305,1996.

[KMR +

94℄ M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Findinginterestingrules fromlarge setsofdis overed asso iation rules. Pro .CIKM onf.,pp401{407,November1994.

[Lux91℄ M. Luxenburger. Impli ations partielles dans un ontexte. Mathematiques, Informatique etS ien esHumaines,29(113):35{55,1991.

[Mai80℄ D.Maier.Minimum oversinrelationaldatabasemodel.JournaloftheACM, 27(4):664{674,1980.

[NLHP98℄ R. T.Ng, V. S.Lakshmanan,J. Han,and A. Pang. Exploratory mining and pruningoptimizations of onstrainedasso iation rules. Pro .SIGMOD onf., pp13{24,June1998.

[PBTL98℄ N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal. Pruning loseditemset latti esforasso iation rules. Pro .BDA onf.,pp177{196,O tobre1998.

[PBTL99a℄ N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. EÆ ient mining of asso iation rules using losed itemset latti es. Information Systems, 24(1):25{46, 1999.

[PBTL99b℄ N.Pasquier, Y. Bastide,R.Taouil, andL. Lakhal. Dis overingfrequent losed itemsets for asso iation rules. Pro . ICDT onf.,LNCS 1540, pp 398{416, January1999.

[PBTL99 ℄ N.Pasquier, Y. Bastide,R.Taouil,andL. Lakhal. Closedsetbased dis- overyofsmall oversforasso iationrules. Pro .BDA onf.,pp361{381,O tobre 1999.

[PSM94℄ G.Piatetsky-ShapiroandC.J.Matheus. Theinterestingnessofdeviations. AAAIKDD workshop,pp25{36,July1994.

[ST96℄ A. Silbers hatzand A. Tuzhilin. Whatmakespatterns interesting in knowl-edge dis overysystems. IEEE Transa tions on Knowledgeand DataEngineering, 8(6):970{974, De ember1996.

[SBM98℄ C.Silverstein,S.Brin,andR.Motwani. Beyondmarketbaskets: Generaliz-ing asso iation rulestodependen erules. DataMiningandKnowledgeDis overy, 2(1):39{68, January1998.

[SA95℄ R.SrikantandR.Agrawal. Mininggeneralizedasso iationrules. Pro .VLDB onf.,pp407{419,September1995.

[SVA97℄ R.Srikant,Q.Vu,andR.Agrawal. Mining asso iationrules withitem on-straints.Pro .KDD onf.,pp67{73,August1997.

Figure

Fig. 1. Extrating frequent losed itemsets from D with Close for minsupport = 2/6.
Table 2. Generi basis for exat assoiation rules extrated from D for minsup-
Table 3. Transitive redution of the informative basis for approximate assoiation
Table 5. Number of approximate assoiation rules extrated.

Références

Documents relatifs

topic observations show a signi ficant depletion of 13 C in the atmosphere (∼−0.12‰ in seven years), suggest- ing that increases in methane emissions after 2006 are primarily

L'étude de ces systèmes visait à démontrer la possibilité d'utiliser les couches colonnaires de GeMn comme un matériau fonctionnel, dont il est possible de modier les

The paper estab- lishes a surprisingly simple characterization of Palm distributions for such a process: The reduced n-point Palm distribution is for any n ≥ 1 itself a log Gaussian

In places where remote communication already takes place, touch devices can allow people to further their range of communication by multiplexing existing communication

Negotiations to remove tariffs on Environmental Goods (EGs) and Environmental Services (ESs) at the Doha Round, followed by negotiations towards an Environmental Goods Agreement

In our study, we investigated the role of vascular access stenosis and systemic vascular risk factors (inflammation and especially mineral metabolism indexes)

Beyond the ap- parent complexity and unruly nature of contemporary transnational rule making, we search for those structuring dimensions and regularities that frame