HAL Id: hal-00363015
https://hal.archives-ouvertes.fr/hal-00363015
Submitted on 26 Apr 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
Generating a condensed representation for association
rules
Nicolas Pasquier, Rafik Taouil, Yves Bastide, Gerd Stumme, Lotfi Lakhal
To cite this version:
Nicolas Pasquier, Rafik Taouil, Yves Bastide, Gerd Stumme, Lotfi Lakhal. Generating a condensed
representation for association rules. Journal of Intelligent Information Systems, Springer Verlag, 2005,
24 (1), pp.29-60. �10.1007/s10844-005-0266-z�. �hal-00363015�
Ni olasPasquier (ni olas.pasquieruni e.fr)
I3S(CNRSUMR6070)-Universitéde Ni e-SophiaAntipolis,06903SophiaAntipolis,Fran e
RakTaouil (taouiluniv-tours.fr)
LI-Université Fran oisRabelaisde Tours, 3pla eJeanJaurès, 41000Blois,Fran e
YvesBastide (yves.bastideirisa.fr)
IRISA-INRIARennes, ampusuniversitaire deBeaulieu,35042Rennes,Fran e
Gerd Stumme (stummeuni-kassel.de)
Fa hberei hMathematik/Informatik,UniversitätKassel,34121Kassel,Germany
Lot Lakhal (lotfi.lakhallim.univ-mrs.fr)
LIM(CNRSFRE2246)-UniversitédelaMéditerranée, ase901,13288Marseille,Fran e
Abstra t. Asso iation ruleextra tion fromoperational datasets often produ esseveral tensof
thousands,andevenmillions,ofasso iationrules.Moreover,manyoftheserulesareredundantand
thususeless.Usingasemanti basedonthe losureoftheGalois onne tion,wedenea ondensed
representationforasso iationrules.Thisrepresentationis hara terizedbyfrequent loseditemsets
andtheir generators. It ontainsthe non-redundantasso iation rules havingminimal ante edent
and maximal onsequent, alled min-max asso iation rules. We think that these rules are the
mostrelevant sin etheyare themostgeneralnon-redundantasso iation rules.Furthermore,this
representationis a basis, i.e., agenerating set for all asso iation rules, their supports and their
onden es,andallofthem anberetrievedneedlessa essingthedata.Weintrodu ealgorithms
forextra tingthisbasisandforre onstru tingallasso iationrules.Resultsofexperiments arried
outonrealdatasetsshowtheusefulnessofthisapproa h.Inordertogeneratethisbasiswhen an
algorithmforextra tingfrequentitemsetssu hasAprioriforinstan eisused,wealsopresent
an algorithm for deriving frequent losed itemsets and their generators from frequent itemsets
withoutusingthedataset.
Keywords: Datamining, Galois losure operator,frequent losed itemsets,generators,min-max
asso iationrules,basisforasso iation rules, ondensedrepresentation.
1. Introdu tion
The purpose of asso iation rule extra tion, introdu ed in (Agrawal et al., 1993),
is to dis over signi ant relations between binary attributes, alled items, in large
datasets. An example of asso iation rule extra ted from a dataset of supermarket
sales is: ` ereals
∧
sugar→
milk (support=7%, onden e=67%)'. This rule states that ustomers who buy ereals and sugar also tend to buy milk. The supportmeasuredenes the rangeof the rule, i.e.,the proportion of ustomerswhobought
the three items among all ustomers. The onden e measure denesthe pre ision
of the rule, i.e., the proportion of ustomers who bought milk among those who
bought ereals and sugar. Only rules with support and onden e above some
minimal support and onden e thresholds, dened by the analyst a ording to
the appli ation,areextra ted.
1. Extra ting frequent itemsetsandtheirsupportfromthe dataset.Frequent
item-sets aresets of items ontained in a proportion of obje ts above the minimum
supportthreshold.
2. Generating asso iation rules from frequent itemsets and supports. Only rules
with onden e above the minimum onden ethreshold aregenerated.
Therstphaseisthe most omputationallyintensive,sin e thenumber ofpotential
frequent itemsets is exponential in the size of the set of items and several dataset
s ans, very expensive in exe ution times, are required to ount their supports.
Classi al approa hes an be lassied into three main trends. Approa hes in the
rst trend arebased on the levelwise extra tion of frequent itemsets (Agrawal and
Srikant, 1994; Mannila et al., 1994). That is a breadth-rst exploration of the
sear h spa e where all potential frequent itemsets of a given size are onsidered
simultaneously (Mannila and Toivonen, 1997). These approa hes are e ient for
mining asso iation rules from weakly orrelated data, su h as market basket data,
butperforman esdrasti allyde reasewhendataaredenseor orrelated,su has
sta-tisti aldataforinstan e.Approa hesinthese ondtrendarebasedontheextra tion
of maximal 1
frequent itemsets (Bayardo, 1998; Lin and Kedem, 1998; Zaki et al.,
1997) to improve the e ien y. On e all maximal frequent itemsets are extra ted,
all frequent itemsets are derived and their support are ounted in the dataset. In
the third trend, approa hes are based on the extra tion of frequent losed
item-sets(Pasquieret al.,1998;ZakiandOgihara, 1998)denedusingthe Galois losure
operator.Theseapproa hesrstextra t allfrequent losed itemsetsand then,both
frequent itemsetsand theirsupport arederived fromthem, without dataset a ess.
Inthe aseofdenseor orrelateddata,therearemu hfewerfrequent loseditemsets
thanfrequent itemsetsandthus,these approa hesimprove theextra tion e ien y
ompared to approa hes in the rst trend. Compared to approa hes in the se ond
trend,appro hesbasedonfrequent loseditemsets an be moree ient inthe ase
of orrelated data due to the ostof generatingall subsetsofthe maximal frequent
itemsetsand ountingtheir supportin the dataset.
Anothermajor resear htopi indata miningistheproblemofrelevan e and
useful-nessofextra tedasso iationrules.Thisproblemisrelatedtothenumberofextra ted
rulesthatismostoftenverylargeandtotheimportantproportion ofredundant
rules,i.e.rules bringing the same information,among them. Thisproblembe omes
ru ial whendata aredense or orrelated, su has statisti aldata,
tele ommuni a-tion data or nominative market basket data (Bayardo and al., 2000; Brin and al.,
1997;Siversteinetal.,1998).Forinstan e,usinga ensusdatasetsample onstituted
of 10,000 obje ts, ea h one ontaining values of 73 binary attributes, more than
2,000,000asso iationrules with support and onden e above 90% wereextra ted.
The analyst isthen onfronted with the following problems:How to handle su h a
listofasso iationrules?Isitpossibleto redu eitssizewithoutlosinginformation?
Moreover, the inspe tionof extra ted asso iationrules shown thatredundant rules
represent the majority of them. Their suppression will thus onsiderably redu e
the number of rules to be handled by the analyst. In the previous example, this
1
suppressionredu edthe number ofrules toafewthousands.Inaddition,redundant
rules an be misleading as dis ussed in example 1. Thus, the following question
arises: How to redu e extra ted asso iation rules to a smaller list ontaining only
non-redundant asso iationrules ?
Example 1. To illustrate the problem of redundant asso iation rules, we present
nine rules extra ted from the Mushrooms dataset des ribing hara teristi s of
8416 mushrooms (Blake and Merz, 1998) in table I. These rules have identi al
supports and onden es, of 51% and 54% respe tively, and the item free gills
in the ante edent.
Table I. Redundantasso iation rules.
1) freegills
→
edible 6) freegills, partialveil→
edible,whiteveil 2) freegills→
edible,partialveil 7) freegills, whiteveil→
edible3) freegills
→
edible,whiteveil 8) freegills, whiteveil→
edible,partialveil 4) freegills→
edible,partialveil,whiteveil 9) freegills, partialveil,whiteveil→
edible 5) freegills,partialveil→
edibleObviously, rules 1 to 3 and 5 to 9 do not add any information to rule 4 sin e all
these rules have identi al supports and onden es. We thus say that these rules
are redundant ompared to rule 4, the most relevant from the analyst's point of
viewforitsummarizesthe ninerules.Thisrulehasaminimalante edent (left-hand
side) and a maximal onsequent (right-hand side) among the nine rules. Moreover,
examining only one of these eight rules, say for instan e rule 9, the analyst will
believe that a mushroom has 54% han es to be edible if it has free gills and a
partial whiteveil. As a matter of fa t, it has54% han es to be edible and have a
partial white veil if it has free gills. Redundant rules an therefore be misleading
and ause misinterpretations of the results. We believe that extra ting only rule 4
willimprove the result relevan e.
In the rest of the paper, we dierentiate exa t asso iation rules, noted
l ⇒ l
′
, that
have a 100% onden e, and approximate asso iationrules, noted
l → l
′
, thathave
a onden elowerthan 100%.Exa tasso iationrules arevalidforallobje tsin the
dataset whereas approximate asso iationrules are valid for a proportion of obje ts
equal to their onden e.
1.1. Related Work
Approa hesaddressingthisissue anbe lassiedintothreemaintrends.Approa hes
inthe rsttrendprovide me hanismsforlteringextra ted asso iationrules.Inthe
twoothertrends,approa hesextend thedenitionofasso iationrulesinordernot
to extra tsimilar ones.
Approa hes in the rst trend allow the analyst to dene some templates (Baralis
and Psaila, 1997; Klemettinen and al.,1994), boolean operators (Bayardo and al.,
2000;Ng et al., 1998;Srikant et al.,1997)or SQL-like operators (Meoet al.,1998)
boolean operators are oupled with further measures of usefulness of the rules.
By sele ting a subset of all extra ted asso iation rules, these approa hes redu e
the number of rules to handle during the visualization, but redundan ies are not
suppressed.
Inthese ondtrend,someapproa hesuseataxonomyofitemstoextra tgeneralized
asso iation rules (Han and Fu, 1999; Srikant and Agrawal, 1995), i.e., asso iation
rules between sets of items that belong to dierent levels of the taxonomy. Some
approa hes use statisti al measures, su h as Pearson's orrelation or
χ
2
test for
instan e, insteadof the onden e to determine the pre ision of the rule (Brinand
al., 1997; Morimoto et al., 1998; Siverstein et al., 1998). Other approa hes in this
trend allow to extra t only rules with maximal ante edents among those with the
same supports and the same onsequents (Srikant and Agrawal, 1996; Toivonen et
al.,1995).Thatis,arule
r
willbepruned ifanotherruler
′
hasthesame onsequent
and an ante edent that is a superset of the one of
r
. In example 1, rules 4, 6, 8 and 9 have maximal ante edents and will be extra ted. Finally, the approa hproposedin(BayardoandAgrawal,1999)identiesoptimalrulesa ordingtoseveral
interestingnessmetri s( onden e, onvi tion,lift,Lapla e,gain,et .)andapartial
orderon the rules.
Approa hes in the third trend make use of the losure of the Galois onne tion
to extra t bases, or redu ed overs, for asso iation rules. Informally, a basis is a
non-redundant set that is minimal a ording to some mathemati al property and
fromwhi hallasso iationrulesarededu ible,withsupportand onden e,without
a essing the dataset. Thesebases areadaptations of the Duquenne-Guigues basis
forglobal impli ations(Duquenne andGuigues,1986;Ganterand Wille,1999)and
the Luxenburger basis for partial impli ations (Luxenburger, 1991). They were
in-trodu ed in Formal Con ept Analysis and their adaptation to the asso iation rule
frameworkisstudiedin(Pasquieretal.,1999 ;Taouiletal.,2000;Zaki,2000).Inthe
Duquenne-Guiguesbasisforexa tasso iationrules,ante edentsofrulesarefrequent
pseudo- losed itemsets and onsequentsare frequent losed itemsets. Inthe
Luxen-burger basis for approximate asso iation rules, both ante edents and onsequents
are frequent losed itemsets: We sele t approximate rules with both a maximal
ante edent and a maximal onsequent among rules having identi al supports and
onden es. In example 1,rule 9 will be the only one extra ted. The union of the
Duquenne-Guigues and the Luxenburger bases is a basis for all asso iation rules.
Thisbasis isminimal with respe tto the number of rules and, sin e for most data
types there are mu h fewer frequent losed and pseudo- losed itemsets than there
arefrequent itemsets, it is very small.However, it doesnot ontain non-redundant
rules with minimalante edent and maximal onsequent.
In previous works about the pruning of redundant impli ation rules (fun tional
dependen ies), su h as the anoni al and the minimum overs denitions (Beeri
and Bernstein, 1979; Maier, 1980), redundant rules are dened a ording to an
inferen e system based on Armstrong axioms (Armstrong, 1974). However, these
results annotbedire tlyappliedto theasso iationruleframeworksin eredundant
asso iation rules annot be dened a ording to this system: Supports and
on-den esareimportantinformationthatmustbe onsideredto hara terizeredundant
Theidea behindnon-redundant asso iation rules asdened hereafteris to identify
the most relevant rules, ea hone bringing the same information asseveral others.
1.2. Contribution
Our goal is to improve asso iation rules relevan e and usefulness by extra ting as
few rules as possible without losing information. To a hieve this, we propose to
generate a ondensed representation (Mannila and Toivonen, 1996) by maximizing
the information brought by ea h rule. As pointed out in example 1, we believe
thatthe most relevant asso iationrules arethe mostgeneral 2
non-redundant rules:
Thosewith minimalante edent andmaximal onsequent.Extra tingsu hrules will
improve the resultusefulness,while redu ing its size.Therefore,in the following:
−
Wedene non-redundant asso iation ruleswith minimalante edent and maxi-mal onsequent, alledmin-max asso iationrules.Theserules aredenedusingthe semanti for asso iation ruleextra tion basedon the Galois losure. Their
ante edentsand onsequentsare hara terized byfrequent loseditemsets and
their generators (Pasquieret al., 1998).
−
Weshowthatthemin-maxasso iationrules onstituteabasis, alled min-max basis for asso iation rules. All asso iationrules an be dedu ed by generatingall the sub-rules of the min-max asso iation rules, onsidering their supports
and onden es.
−
We propose e ient algorithms to generate the min-max basis from frequent losed itemsetsandtheirgenerators,su hasextra tedbythe Close(Pasquieret al.,1998; Pasquier et al.,1999b) and the A-Close (Pasquier et al.,1999a)
algorithms. Wealsointrodu ealgorithmstore onstru tallasso iationrules,or
a partof them, fromthis basiswithout having to a essthe data.
−
WepresenttheClose+
algorithmthatidentiesfrequent loseditemsets,their
generatorsandtheirsupportsamongfrequentitemsetsandtheirsupports.This
algorithm issimple ande ient sin e itdoesnot requireanydataset a ess.It
enables the generation of the min-max basiswhen an algorithm for extra ting
allfrequentitemsets,su hasApriori(AgrawalandSrikant,1994)forinstan e,
is used.
Extra ting min-max asso iation rules minimizes as mu h as possible the number
of rules while keeping the same information in the result: Only the most general
non-overlapping asso iation rules are extra ted and therefore redundant rules are
pruned. Sin e for many real datasets redundant rules represent the majority of
extra ted rules,the redu tion will be almost always signi ant. Thisredu tion will
be onsiderableinthe aseofdenseor orrelateddataforwhi hthetotalnumber of
rules is very large and most areredundant (Bayardo and Agrawal, 1999; Brin and
al.,1997;Siverstein et al.,1998).
2
Wesay that a rule
r : a → c
is more general thanaruler
′
: a
′
→ c
′
if theyhave identi al
supportsand onden es, the ante edent
a
ofr
isa subsetofa
′
and the onsequent
c
ofr
is a supersetofc
′
.
r
′
isthen alledasub-ruleof
r
,andr
asuper-ruleofr
′
With the min-max basis, the analyst is presented a set of rules overing all the
attributes of the dataset: All of the data-spa e is hara terized by the min-max
rules, over oming an important de ien y of most redu tion methods where large
sub-spa es of the data-spa e may be poorly hara terized or even entirely
un har-a terized (Bayardo and Agrawal, 1999). This property helps insuring that rules
surprising fortheanalyst,thatareimportantinformation(PiatetskyandMatheus,
1994;Silbers hatzandTuzhilin,1996),willbepresent.Moreover,themin-maxbasis
does not represent anyinformation lossfor the analyst:all information brought by
the setofallasso iationrulesisbrought bythe min-maxbasis.Thisapproa h does
notsueroftheproblemofinformationlossfromtheanalyst'spointofviewthat
isanimportant drawba kinasso iationruleredu tionmethods(Liuandal.,1999).
Ifthe analystsowishes, itisalsopossible toe iently dedu eallother asso iation
rules,with supports and onden es,from the min-maxbasisalone.
1.3. Organization
Inse tion2,were allthe semanti for asso iationrulesbasedonthe Galois
onne -tionandtheClosealgorithmforextra tingfrequent loseditemsetsandgenerators.
We alsopresentthe Close
+
algorithm for e iently deriving frequent losed
item-sets,their generatorsand their supportsfromfrequent itemsetsand theirsupports.
Min-max asso iation rules and the min-max basis for asso iation rules are dened
inse tion3.Algorithmsforgeneratingthisbasisarealsopresented.Inse tion4,we
present simple methods and algorithms for deriving all asso iation rules from the
min-maxbasis. Resultsof experiments ondu ted to evaluate the usefulness of this
approa h aregivenin se tion 5and se tion6 on ludesthe paper.
2. Semanti forasso iation rules based onthe Galois onne tion
Theasso iationruleextra tion isperformedfrom adata mining ontext 3
, thatisa
triplet
D = (O, I, R)
,whereO
andI
arenitesetsofobje tsanditemsrespe tively, andR ⊆ O × I
isabinaryrelation. Ea h ouple(o, i) ∈ R
denotesthefa tthatthe obje to ∈ O
isrelatedto theitemi ∈ I
.Anitemsetl
isa setofitemsl ⊆ I
,l 6= ∅
. Example 2. Adata mining ontextD
onstituted of sixobje ts,ea h one identied byitsOID,and veitemsisrepresented intable II.This ontextisusedassupportfor the examplesin the rest ofthe paper.
TheGalois onne tionofanitebinaryrelation(GanterandWille,1999)isa ouple
ofappli ations(
φ
,ψ
).φ
asso iateswith asetofobje tsO ⊆ O
the itemsrelatedto all obje tso ∈ O
andψ
asso iates with an itemsetl ⊆ I
the obje ts related to all itemsi ∈ l
. Whenan obje to
isrelated to allitemsi ∈ l
, we say thato
ontainsl
. We denoteminsupp and min onf theminimal supportand onden ethresholds.Denition 1. (Frequent itemsets) The support of an itemset
l
is the proportion of obje ts in the ontext ontainingl
:supp(l) = |ψ(l)| / |O|
.l
is a frequent itemset ifsupp(l) ≥
minsupp. 3TableII. Datamining ontext
D
. OID Items 1 A C D 2 B C E 3 A B C E 4 B E 5 A B C E 6 B C EDenition 2. (Asso iation rules) Anasso iationrule
r
isanimpli ationbetweentwo frequentitemsetsl
1
, l
2
⊆ I
withtheforml
1
→ (l
2
\l
1
)
wherel
1
⊂ l
2
.Thesupportand onden eofr
aredened by:supp(r) = supp(l
2
)
,conf
(r) = supp(l
2
) / supp(l
1
)
.The losureoperator
γ = φ ◦ ψ
asso iateswithanitemsetl
themaximalsetofitems ommon to all the obje ts ontainingl
: The losure of an itemset is equal to the interse tion of all the obje ts ontaining it. Using this losure operator, we denethe frequent losed itemsets.
Denition 3. (Frequent losed itemsets) Afrequentitemset
l ⊆ I
isafrequent losed itemsetiγ(l) = l
. Theminimal loseditemset ontaining anitemsetl
isits losureγ(l)
.Theset of frequent losed itemsetsand their supports isa minimal non-redundant
generatingsetforallfrequentitemsetsandtheirsupports,andthusforallasso iation
rules, their supports and their onden es. This theorem relies on the properties
thatthe supportofafrequentitemset isequaltothe supportofits losureandthat
maximal frequent itemsets are maximal frequent losed itemsets (Pasquier et al.,
1998). In order to improve the e ien y of frequent losed itemset extra tion, the
Close andA-Closealgorithms ompute generators offrequent losed itemsets.
Denition 4. (Generators) Anitemset
g ⊆ I
isa generator of a losed itemsetl
iγ(g) = l
and∄g
′
⊆ I
withg
′
⊂ g
su h thatγ(g
′
) = l
. A generator of ardinalityk
isak
-generator.Generators are the minimal itemsets to onsider for dis overing frequent losed
itemsets, by omputing their losures. Based on the following lemma, Close and
A-Closeperform a breadth-rst sear hfor generators in a levelwisemanner.
Lemma 1. All subsets
s ⊆ I
of a generatorg ⊆ I
are also generators. The losure ofs
is a losed subset of the losureofg
:γ(s) ⊂ γ(g)
.2.1. Extra ting frequent loseditemsets and generators with Close
TheClosealgorithmisaniterativealgorithmforextra tinggeneratorsandfrequent
losed itemsets in a levelwise manner. During an iteration
k
, a list of andidatek
-generators is onsidered; their losures and their supports are omputed from thedataset andinfrequent generatorsare dis arded.Frequent generatorsarethen used
to onstru t andidate(
k
+1)-generators.The losuresoffrequentgeneratorsarethe frequent losed itemsets and the support of a generator is also the support of itslosure.
Duringthe
k
th
iteration,asetFC
k
is onsidered.Ea helementofthisset onsistsof three information: ak
-generator, its losureand their support. The algorithm rst initializes the andidate 1-generators inF C
1
with the list of 1-itemsets and then arriesout some iterations. During ea hiterationk
:1. Closures of all andidate
k
-generators and their supports are omputed: The number of obje ts ontaining a generator determines its support and theirin-terse tion generates its losure. Ea h obje t is onsidered on e and this phase
requiresonly one s anofthe dataset.
2. Infrequent
k
-generators, i.e., generators with support lower than minsupp, are removed fromF C
k
.3. The set of andidate (
k
+1)-generators is onstru ted by joining the frequentk
-generators inF C
k
asfollows.a) Two
k
-generatorsinF C
k
thathavethesamerstk−1
itemsarejoinedto re-atea andidate (k
+1)-generator. Forinstan e, the3-generators{ABC}and {ABD}willbejoinedin orderto reatethe andidate4-generator {ABCD}.b) Candidate(
k
+1)-generatorsthatareinfrequentornon-minimalareremoved. Oneofthek
-subsetsofsu hageneratoriseitherinfrequentor non-minimal and thus doesnot belongto the set offrequentk
-generators inF C
k
. ) Thethirdphaseremoves(k
+1)-generatorswhi h losureswerealreadyom-puted. Su h agenerator
g
is easily identied asitis in luded in the losure of a frequentk
-generatorg
′
in
F C
k
: We haveg
′
⊂ g ⊆ γ(g
′
)
.
The algorithm stops when no new andidate generator an be reated.Then, ea h
setFC
k
storesthe frequentk
-generators, their losuresand their supports.Example 3. Figure1shows the exe utionof the Closealgorithm onthe ontext
D
for minsupp = 2/6. The setF C
1
is initialized with the list of all 1-itemsets. The algorithm omputessupportsand losuresofthe1-generatorsinF C
1
andinfrequent onesaredis arded.Then, joiningthefrequentgeneratorsinF C
1
,sixnew andidate 2-generatorsare reated:{AB}, {AC}, {AE},{BC}, {BE} and{CE} inF C
2
. The 2-generators {AC} and {BE} are removed formF C
2
be ause we have {AC}⊆
γ
({A})and{BE}⊆ γ
({B}).Thealgorithmdeterminessupportsand losuresofthe remaining2-generatorsinF C
2
andsuppressesinfrequentones.Then, the andidate 3-generator {ABE} is reated by joining the frequent generators inF C
2
but is removed be ause the 2-generator {BE}⊂
{ABE} isnot inF C
2
and the algorithm stops.S an
D
−→
F C
1
Generator Closeditemset Supp
{A} {AC} 3/6 {B} {BE} 5/6 {C} {C} 5/6 {D} {ACD} 1/6 {E} {BE} 5/6 Pruning infrequent itemsets
−→
F C
1
Generator Closeditemset Supp
{A} {AC} 3/6 {B} {BE} 5/6 {C} {C} 5/6 {E} {BE} 5/6 S an
D
−→
F C
2
Generator Closeditemset Supp
{AB} {ABCE} 2/6 {AE} {ABCE} 2/6 {BC} {BCE} 4/6 {CE} {BCE} 4/6 Pruning infrequent itemsets
−→
F C
2
Generator Closeditemset Supp
{AB} {ABCE} 2/6
{AE} {ABCE} 2/6
{BC} {BCE} 4/6
{CE} {BCE} 4/6
Figure1. Extra tingfrequent loseditemsetsinthe ontext
D
withClose.TheA-Closealgorithm improvesthe e ien y ofthe extra tion in aseof weakly
orrelated data. It does not ompute losures of andidate generators during the
iterations, but duringan ultimate s an arriedout after the end ofthese iterations
ifne essary. Experimental results show that Close and A-Close are parti ularly
e ient for mining asso iation rules from dense or orrelated data. On su h data,
Close outperforms A-Close, and they both outperform algorithms for
extra t-ing frequent itemsets and maximal frequent itemsets. In that ase, algorithms for
extra ting maximal frequent itemsets suer from the ost of the frequent itemset
supports omputation that requires a essing the dataset. On the ontrary, for
weakly orrelateddata,algorithmsforextra tingmaximalfrequentitemsetsarethe
most e ient andalgorithms for extra tingfrequent itemsets, aswell asA-Close,
outperform Close.
The ChARM (Zaki and Hsiao, 1999) and Closet (Pei et al., 2000) algorithms
extra t frequent losed itemsets. However, none of these algorithm extra t
gener-ators and an be used to generate the min-max basis for asso iation rules. The
Pas al (Bastide and al., 2000)algorithm is an optimization of Aprioribased on
inferen e ounting andequivalen e lassesdened a ording to itemset supports.It
an easily be extended to generate the min-max basissin e generators and losed
itemsetsarerespe tively bottom and top patternsof anequivalen e lass.
2.2. Deriving frequent losed itemsets and generators from
frequent itemsets
The Close
+
algorithm identies frequent losed itemsets and generators among
frequent itemsets without a essing the dataset. It enables the e ient generation
of the min-max basis when an algorithm for extra ting frequent itemsets is used.
Su h an algorithm gives as result the sets
F
k
, ea h setF
k
ontaining all frequentk
-itemsets, withk
varying from 1 toµ
(the size of the longest maximal frequent itemsets).Thefrequent loseditemsetsandgeneratorsareidentiedamongfrequentitemsets using propositions 1 and 2 that are derived from the property that an
is insured by the property that maximal frequent itemsets are maximal frequent
losed itemsets(Pasquier et al.,1998).
Proposition 1. The support of a generator is smaller than the supports of all its
subsets.
Proof. Let
g
be ak
-generator ands
a (k − 1
)-subsets ofg
. We then haves ⊂ g
⇒ ψ(s) ⊇ ψ(g)
. Ifψ(s) = ψ(g)
thenγ(s) = γ(g)
andg
is not a generator: It is not a minimal itemset whose losure isγ(g)
. It follows thatψ(s) ⊃ ψ(g) ⇒
supp(g) > supp(s)
.Proposition 2. Thesupportofa loseditemset isgreater thanthe supportsofallits
supersets.
Proof. Let
l
be a losedk
-itemset ands
a superset ofl
. We then havel ⊂ s ⇒
ψ(l) ⊇ ψ(s)
.Ifψ(l)
=ψ(s)
thenγ(l)
=γ(s) ⇒ l
=γ(s) ⇒ s ⊆ l
(absurd).Itfollows thatψ(l) ⊃ ψ(s) ⇒ supp(l) > supp(s)
.Thepseudo- odeof the Close
+
algorithm isgiven in gure 2.It examines
su es-sively all frequent itemsets in ea h set
F
k
, withk
varyingfrom1
toµ
. It generates thesetsF C
m
,1 ≤ m ≤ ν
,whereν
isthesizeofthelongestgenerators, ontaining them
-generators, their losures and their supports. It rst determines ifa frequentk
-itemsetisageneratorbyexaminingallits(k−1
)-subsets'supports;itthendetermine if it is a losed itemset by examining all its (k + 1
)-supersets' supports and if so, identiesitsgeneratorsbyexaminingallitssubsets'supports.Thebooleanvariablesis losed andisgenerator areused to determine ifan itemset
l
isa losed itemset or isa generator.Atthebeginningofthe
k
th
iteration(steps1to21),theset
F C
k
isempty(step2).In steps3to 20,frequentitemsetsinF
k
are onsideredsu essively. Ifanitemsetl
has thesamesupportasone ofits(k − 1
)-subsetl
′
in
F
k−1
(steps5to 7),thenl
isnota generator(step6).Otherwise,l
anditssupportareinsertedinF C
k
(step8). Then, we test ifl
hasthe same support asone of its (k
+1)-supersetl”
inF
k+1
(steps 10 to12).Ifso,wehavel
′
⊆ γ(l)
andthen
l 6= γ(l)
:l
isnot losed (step11).Otherwise,l
isa frequent losed itemset and we determine the generators ofl
(steps 13to 19) as follows. For ea h generatorg
of sizen
, with1 ≤ n ≤ k
, that is a subset ofl
(steps14to 18),ifthe supports ofg
andl
areequal theng
isagenerator ofl
andl
isinserted inF C
n
asthe losureofg
(step 16). Thus, at the end of the algorithm, ea hsetF C
k
ontains all frequentk
-generators, their losures andtheir supports. Corre tness. The orre tness of the omputation of setsF C
k
for1 ≤ k ≤ µ
relies on propositions 1 and2. Usingthe rst one,we determine if afrequentk
-itemsetl
isa generatorofa losed itemset by omparingits supportandthe supportsof thefrequent(
k − 1
)-itemsetsin luded inl
. These ondpropositionenablesto determine if a frequentk
-itemsetl
is losed by omparing its support and the supports of the frequent (k
+1)-itemsets in whi hl
isin luded. Sin e a generator hasthe same support as its losure, the determination of the generators of a losed itemset isInput : sets
F
k
offrequentk
-itemsetsOutput: sets
F C
k
offrequentk
-generators,with losure andsupport1) for
k = 1
toµ
do 2)F C
k
← ∅
3) forallitemsetsl ∈ F
k
do 4)isgenerator ← true
5) forallsubsetsl
′
∈ F
k−1
ofl
do 6) if(l
′
.supp = l.supp
)then
isgenerator ← f alse
7) end
8) if(
isgenerator = true
)theninsertl
inF C
k
.generators
withl.supp
9)isclosed ← true
10) forallsupersetsl
′′
∈ F
k+1
ofl
do 11) if(l
′′
.supp = l.supp
)then
isclosed ← f alse
12) end
13) if(
isclosed = true
)then do 14) forn = k
to0
step−1
do15) forallsubsets
g ∈ F C
n
.generatorsofl
do16) if(
g.supp = l.supp
)theninsertl
ing.closure
17) end 18) end 19) end 20) end 21) end 22) return
S
F C
k
Figure2. Close+
algorithmforderivingfrequent loseditemsetsandgenerators.
Example 4. Figure 3 shows the exe ution of the Close
+
algorithm using the sets
F
1
toF
4
offrequent itemsetsextra tedfromthe ontextD
withminsupp=2/6.All frequent1-itemsetsarefrequent1-generatorssin enoneoftheirsubsetsisafrequentitemset: The empty set isnot onsidered asa frequent itemset. The 1-itemset {C}
is also its own losure sin e all its supersets in
F
2
have a smaller support. InF
2
, the2-itemsets {AC}and{BE}arenotgeneratorssin e theyhave thesamesupportasitemsets {A}and,{B} and{E}respe tively. Thesetwo itemsetsare losed sin e
their support is lower than those of all their supersets in
F
3
; {AC} is the losure of {A} and {BE} is the losure of {B} and {E}. No frequent 3-itemset inF
3
is a generatorand{BCE}, thathasthe samesupportas{BC} and{CE} andagreatersupport than {ABCE} in
F C
4
, is the losure of {BC} and {CE} inF C
2
. Finally, the4-itemset{ABCE}is losed sin eitisamaximalfrequentitemset isthe losureof{AB} and{AE}, andis insertedin
F C
2
.Remark. As a simple optimization, the algorithm an stop testing if frequent
k
-itemsetsaregeneratorsaftertherstiterationn
duringwhi hnofrequentn
-itemset examined is a generator. In example 4, the algorithm will not test if 4-itemsets inF
4
aregeneratorssin e no3-itemset isa generator(F C
3
isemptyat theend of the third iteration).F
1
Itemset Supp {A} 3/6 {B} 5/6 {C} 5/6 {E} 5/6 Generators ofsize1−→
Closures ofsize1−→
F C
1
Generator Closeditemset Supp
{A} 3/6
{B} 5/6
{C} 5/6
{E} 5/6
F C
1
Generator Closeditemset Supp
{A} 3/6 {B} 5/6 {C} {C} 5/6 {E} 5/6
F
2
Itemset Supp {AB} 2/6 {AC} 3/6 {AE} 2/6 {BC} 4/6 {BE} 5/6 {CE} 4/6 Generators ofsize2−→
Closures ofsize2−→
F C
2
Generator Closeditemset Supp
{AB} 2/6
{AE} 2/6
{BC} 4/6
{CE} 4/6
F C
1
Generator Closeditemset Supp
{A} {AC} 3/6 {B} {BE} 5/6 {C} {C} 5/6 {E} {BE} 5/6
F
3
Itemset Supp {ABC} 2/6 {ABE} 2/6 {ACE} 2/6 {BCE} 4/6 Generators ofsize3−→
Closures ofsize3−→
F C
3
Generator Closeditemset Supp
F C
2
Generator Closeditemset Supp
{AB} 2/6 {AE} 2/6 {BC} {BCE} 4/6 {CE} {BCE} 4/6
F
4
Itemset Supp {ABCE} 2/6 Generators ofsize4−→
Closures ofsize4−→
F C
4
Generator Closeditemset Supp
F C
2
Generator Closeditemset Supp
{AB} {ABCE} 2/6
{AE} {ABCE} 2/6
{BC} {BCE} 4/6
{CE} {BCE} 4/6
Figure3. Derivingfrequent loseditemsetsandgeneratorswithClose
+
3. Min-max basis for asso iation rules
We rst dene min-maxasso iation rules:The most general non-redundant
asso i-ationrules a ording totheir semanti . Informally, anasso iationruleisredundant
ifitbringsthesame informationor lessinformationthan isbrought byanotherrule
of same support and onden e. Then, the min-max asso iation rules are the
non-redundant asso iationrules having minimalante edent and maximal onsequent:
r
isamin-maxasso iationruleifnootherasso iationruler
′
hasthesamesupportand
onden e, an ante edent that is a subset of the ante edent of
r
and a onsequent thatisa superset ofthe onsequent ofr
.Denition 5. (Min-max asso iationrules) Let
AR
be the set of asso iation rules extra ted. An asso iation ruler : l
1
→ l
2
∈ AR
is a min-max asso iation rule i∄ r
′
: l
′
1
→ l
′
2
∈ AR
with supp(r
′
)
=supp(r)
, onf(r
′
)
= onf(r)
,l
′
1
⊆ l
1
andl
2
⊆ l
′
2
.Based on this denition, we hara terize exa t and approximate min-max
asso i-ation rules that onstitute respe tively the min-max exa t basis and the min-max
approximate basisin the two following se tions.
3.1. Exa t min-max asso iation rules
First,noti e that exa t asso iation rules, with the form
r : l
1
⇒ (l
2
\ l
1
)
, are rules between two frequent itemsetsl
1
⊂ l
2
having the same losure:γ(l
1
) = γ(l
2
)
. Sin econf
(r) = 1
we havesupp(l
1
) = supp(l
2
)
, and asl
1
⊂ l
2
we seethatγ(l
1
) = γ(l
2
)
. We dene min-maxasso iationrules amongthese exa t rules.Let
g
bethe generatorofγ(l
1
) = γ(l
2
)
su hthatg ⊆ l
1
.Sin eg
isminimal, wehaveg ⊆ l
1
⊂ l
2
⊆ γ(l
2
)
. Furthermore, all itemsets in the interval[g, γ(l
2
)]
, dened by in lusion4
, have the same losure
γ(l
2
)
and thus the same support. The min-max asso iationrule amongall ruleswith the formr : l
1
⇒ (l
2
\ l
1
)
withl
1
, l
2
∈ [g, γ(l
2
)]
is the ruleg ⇒ (γ(l
2
) \ g)
. This rule has a minimal ante edent,g
, and a maximal onsequent,γ(l
2
)
, amongall theserules that have the same support.We generalize this denition to all generators of the frequent losed itemset
γ(l
2
)
. LetGen
γ(l2
)
be the set of these generators. All exa t min-max asso iation rules onstru tedwithγ(l
2
)
areruleswith theformg ⇒ (γ(l
2
)\g)
withg ∈ Gen
γ(l2)
.The extensionof this propertytoall frequent losed itemsetsdenes the min-maxexa tbasis ontaining allexa t min-max asso iation rules hara terized in denition 5.
Denition 6. (Min-max exa t basis) Let
Closed
be the set of frequent losed item-setsextra ted fromthe ontext and,for ea hfrequent losed itemsetf
,let'sdenoteGen
f
the setof generators off
. Themin-max exa t basisis:MinMaxExact
= {r : g ⇒ (f \ g) | f ∈ Closed ∧ g ∈ Gen
f
∧ g 6= f }.
The ondition
g 6= f
dis ards rules with the formg ⇒ ∅
; it is equivalent to the onditionl
1
⊂ l
2
in the denition of asso iation rules. We state in the following propositionthatthe min-maxexa t basisdoesnot lead to information loss.4
The pseudo- ode of the algorithm for onstru ting the min-max exa t basis using
frequent losed itemsetsand their generatorsis presented in gure4.Ea helement
of a set
F C
k
ontains three elds: ak
-generatorgenerator
, its losureclosure
and their supportsupp
. The algorithm returns the setMinMaxExact
ontaining the exa t min-maxrules.Input : sets
F C
k
Output: set
MinMaxExact
1)
MinMaxExact
← ∅
2) fork = 1
toν
do3) forall
k
-generatorg ∈ F C
k
do 4) if(g 6= g.closure
)5) then insert
{r : g ⇒ (g.closure \ g), g.supp}
inMinMaxExact
6) end
7) end
8) return
MinMaxExact
Figure4. Algorithmforgeneratingthemin-maxexa tbasis.
First,
MinMaxExact
isinitializedwiththeemptyset(step1).Then,ea hsetF C
k
is examinedinin reasingorderofk
values(steps2to7).Forea hk
-generatorg ∈ F C
k
of the frequent losed itemsetγ(g)
(steps 3 to 6), ifg
is dierent from its losureγ(g)
(step4), the ruler : g ⇒ (γ(g) \ g)
, whi h supportisequal to the supportofg
andγ(g)
, isinserted intoMinMaxExact
(step 5).Finally, the algorithm returnsthe setMinMaxExact
ontainingallexa tmin-maxasso iationrulesbetweengenerators andtheir losures (step 8).Example 5. Themin-maxexa tbasisextra tedfrom ontext
D
forminsupp=2/6is presented intableIII.It ontainssevenruleswhereasthesetofallexa t asso iationrules,presented in table IV, ontains fourteenrules.
TableIII. Min-maxexa tbasisextra tedfrom
D
.Generator Closure Exa trule Supp
{A} {AC} A
⇒
C 3/6 {B} {BE} B⇒
E 5/6 {C} {C} {E} {BE} E⇒
B 5/6 {AB} {ABCE} AB⇒
CE 2/6 {AE} {ABCE} AE⇒
BC 2/6 {BC} {BCE} BC⇒
E 4/6 {CE} {BCE} CE⇒
B 4/6TableIV. Exa tasso iationrulesextra tedfrom
D
.Exa trule Supp Exa trule Supp
A
⇒
C 3/6 BC⇒
E 4/6 B⇒
E 5/6 CE⇒
B 4/6 E⇒
B 5/6 AB⇒
CE 2/6 AB⇒
C 2/6 AE⇒
BC 2/6 AB⇒
E 2/6 ABC⇒
E 2/6 AE⇒
B 2/6 ABE⇒
C 2/6 AE⇒
C 2/6 ACE⇒
B 2/6Proposition 3. (i)Allexa tasso iationrulesandtheirsupports anbededu edfrom
the min-max exa t basis. (ii) All rules in the min-max exa t basis are min-max
asso iationrules.
Proof. (i) Let
r : l
1
⇒ (l
2
\ l
1
)
be an exa t asso iation rule between two frequent itemsets withl
1
⊂ l
2
. Sin econf
(r) = 1
, we havesupp(l
1
) = supp(l
2
)
and as an itemset's support is equal to its losure's support, we dedu e thatsupp(γ(l
1
)) =
supp(γ(l
2
))
whi himpliesthatγ(l
1
) = γ(l
2
) = f
. Theitemsetf
isafrequent losed itemsetf ∈ F C
and,obviously, there existsaruler
′
: g ⇒ (f \ g) ∈ MinMaxExact
su hthat
g
isageneratoroff
withg ⊆ l
1
andg ⊂ l
2
. We shownowthatthe ruler
anditssupport anbededu edfromthe ruler
′
anditssupport.Sin e
g ⊆ l
1
⊂ l
2
⊆
f
, ruler
's ante edent and onsequent an be derived from those of ruler
′
. From
γ(l
1
) = γ(l
2
) = f
, we dedu e thatsupp
(r) = supp(l
2
) = supp(γ(l
2
)) = supp(f ) =
supp(r
′
)
.
(ii)Let
r : g ⇒ (f \ g) ∈ MinMaxExact
.A ordingtodenition6,wehaveg ∈ Gen
f
andf ∈ Closed
. We demonstrate that there is no other ruler
′
: l
′
1
⇒ (l
2
′
\ l
′
1
) ∈
MinMaxExact
, su h as supp(r
′
)
= supp(r)
, onf(r
′
)
= onf(r)
,l
′
1
⊆ g
andf ⊆ l
′
2
. Ifl
′
1
⊂ g
then, a ording to denition 4,we haveγ(l
′
1
) ⊂ γ(g) = f =⇒ l
1
6∈ Gen
f
and thenr
′
6∈ MinMaxExact
. Iff ⊂ l
′
2
and a ording to denition 3, we havef = γ(f ) = γ(g) ⊂ l
′
2
= γ(l
2
′
)
. From denition 4 we dedu eg 6∈ Gen
l
′
2
and we on lude thatr
′
6∈ MinMaxExact
.3.2. Approximate min-max asso iation rules
Approximate asso iation rules, with the form
r : l
1
→ (l
2
\ l
1
)
, are rules between two frequent itemsetsl
1
⊂ l
2
su h thatγ(l
1
) ⊂ γ(l
2
)
. Sin econf
(r) < 1
we havesupp(l
1
) > supp(l
2
)
andwe dedu ethatγ(l
1
) ⊂ γ(l
2
)
.We dedu e the denition of approximate min-max asso iation rules. Let
g
1
be a generator of the frequent losed itemsetf
1
andg
2
be a generator of the frequent loseditemsetf
2
su hthatf
1
⊂ g
2
⊆ l
2
⊆ f
2
.Allruleswiththeformr : l
1
→ (l
2
\l
1
)
wherel
1
∈ [g
1
, f
1
]
andl
2
∈ [g
2
, f
2
]
have the same onden e and the same support sin eg
1
,l
1
andf
1
have the same support aswell asg
2
,l
2
andf
2
. We then dedu e thatthe min-maxasso iationruleamongalltheserules isg
1
→ (f
2
\ g
1
)
.Indeed,g
1
isthe minimalitemset in[g
1
, f
1
]
andf
2
is the maximalitemset in[g
2
, f
2
]
.The generalization of this property to all ouples of frequent itemsets
l
1
andl
2
su h thatl
1
⊂ l
2
andsupp(l
1
) 6= supp(l
2
)
denes the min-max approximate basis ontaining all approximate min-max asso iation rules hara terized indenition 5.Denition 7. (Min-max approximate basis) We denote
Gen
the setof generators of the frequent losed itemsetsinClosed
. Themin-maxapproximate basisis:MinMaxApprox
= {r : g → (f \ g) | f ∈ Closed ∧ g ∈ Gen ∧ γ(g) ⊂ f }.
Thepseudo odeofthealgorithm forgeneratingtheset
MinMaxApprox
of approxi-matemin-maxrules usingfrequent loseditemsetsandtheirgeneratorsispresentedin gure5.
Input : sets
F C
k
, onden ethresholdminconf
Output: setMinMaxApprox
1)
MinMaxApprox
← ∅
2) fork = 1
toν − 1
do3) forall
k
-generatorg ∈ F C
k
do4) forallfrequent loseditemset
f ∈ F
j>k
| f ⊃ g.closure
do 5) if(f.supp/g.supp ≥ minconf
)6) theninsert
{r : g → (f \ g), f.supp/g.supp, f.supp}
inMinMaxApprox
7) end
8) end
9) end
10) return
MinMaxApprox
Figure5. Algorithmforgeneratingthemin-maxapproximatebasis.
Thealgorithm examinesthe sets
F C
k
in in reasing orderofk
values (steps2to 9). For ea hk
-generatorg ∈ F C
k
(steps3 to 8), it onsiders all losed supersetsf
of the losureofg
(steps4to 7).It omputesthe onden eof theruler : g → (f \ g)
(step5)andinsertsr
inMinMaxReduc
ifitisabovetheminconf
threshold(step6). Example 6. Themin-maxapproximate basisextra ted from ontextD
for minsupp =2/6 andmin onf =2/5is presented intable V.It ontains ten rules whereas theset of all approximate asso iation rules, presented in table VI, ontains thirty-six
rules.
Proposition 4. (i) Allapproximate asso iationrules an bededu ed, withtheir
sup-ports and onden es, from the min-max approximate basis. (ii) All rules in the
min-maxapproximate basisaremin-maxasso iationrules.
Proof. (i) Let
r : l
1
→ (l
2
\ l
1
)
bean asso iationrule between two frequent itemsets withl
1
⊂ l
2
.Sin econf
(r) < 1
wealsohaveγ(l
1
) ⊂ γ(l
2
)
. Foranyfrequentitemsetsl
1
andl
2
, there isa generatorg
1
su h thatg
1
⊂ l
1
⊆ γ(l
1
) = γ(g
1
)
anda generatorg
2
su h thatg
2
⊂ l
2
⊆ γ(l
2
) = γ(g
2
)
. Sin el
1
⊂ l
2
, we havel
1
⊆ γ(g
1
) ⊂ l
2
⊆ γ(g
2
)
and the ruler
′
: g
Table V. Min-maxapproximatebasisextra tedfrom
D
.Generator Closure Closedsuperset Approximaterule Supp Conf
{A} {AC} {ABCE} A
→
BCE 2/6 2/3{B} {BE} {BCE} B
→
CE 4/6 4/5{B} {BE} {ABCE} B
→
ACE 2/6 2/5{C} {C} {AC} C
→
A 3/6 3/5{C} {C} {BCE} C
→
BE 4/6 4/5{C} {C} {ABCE} C
→
ABE 2/6 2/5{E} {BE} {BCE} E
→
BC 4/6 4/5{E} {BE} {ABCE} E
→
ABC 2/6 2/5{AB} {ABCE}
{AE} {ABCE}
{BC} {BCE} {ABCE} BC
→
AE 2/6 2/4{CE} {BCE} {ABCE} CE
→
AB 2/6 2/4TableVI. Approximateasso iation rulesextra tedfrom
D
.Approximaterule Supp Conf Approximaterule Supp Conf Approximaterule Supp Conf
BCE
→
A 2/6 2/4 B→
ACE 2/6 2/5 B→
CE 4/6 4/5 AC→
BE 2/6 2/3 C→
ABE 2/6 2/5 C→
BE 4/6 4/5 BC→
AE 2/6 2/4 E→
ABC 2/6 2/5 E→
BC 4/6 4/5 BE→
AC 2/6 2/5 A→
BC 2/6 2/3 A→
B 2/6 2/3 CE→
AB 2/6 2/4 B→
AC 2/6 2/5 B→
A 2/6 2/5 AC→
B 2/6 2/3 C→
AB 2/6 2/5 C→
A 3/6 3/5 BC→
A 2/6 2/4 A→
BE 2/6 2/3 A→
E 2/6 2/3 BE→
A 2/6 2/5 B→
AE 2/6 2/5 E→
A 2/6 2/5 AC→
E 2/6 2/3 E→
AB 2/6 2/5 B→
C 4/6 4/5 CE→
A 2/6 2/4 A→
CE 2/6 2/3 C→
B 4/6 4/5 BE→
C 4/6 4/5 C→
AE 2/6 2/5 C→
E 4/6 4/5 A→
BCE 2/6 2/3 E→
AC 2/6 2/5 E→
C 4/6 4/5thatthe rule
r
, its supportand its onden e an be dedu ed from the ruler
′
, its
supportand its onden e. Sin e
g
1
⊂ l
1
⊆ γ(g
1
) ⊂ g
2
⊂ l
2
⊆ γ(g
2
)
, the ante edent and the onsequent ofr
an be rebuilt starting from the ruler
′
. Moreover, we
have
γ(l
2
) = γ(g
2
)
and thussupp(r) = supp(l
2
) = supp(γ(g
2
)) = supp(r
′
)
. Sin e
g
1
⊂ l
1
⊆ γ(g
1
)
, we havesupp(g
1
) = supp(l
1
)
and we thus dedu e that:conf
(r)
=supp(l
1
) / supp(l
2
)
=supp(g
1
) / supp(γ(g
2
))
=conf
(r
′
)
.
(ii) Let
r : g ⇒ (f \ g) ∈ MinMaxExact
. A ording to denition 7, we havef ∈
Closed
,g ∈ Gen
f
′
andf
′
⊂ f
. We demonstrate that thereis no otherrule
r
′
: l
′
1
⇒
(l
′
2
\ l
′
1
) ∈ MinMaxApprox
, su h as supp(r
′
)
= supp(r)
, onf(r
′
)
= onf(r)
,l
′
1
⊆ g
andf ⊆ l
′
2
. Ifl
′
1
⊂ g
then,a ording to denition4,we haveγ(l
′
1
) ⊂ γ(g) = f
′
and thenl
1
6∈ Gen
f
′
. We dedu ethatsupp(l
′
1
) > supp(g)
and thenconf
(r
′
) < conf (r)
.
If
f ⊂ l
′
2
then, a ording todenition 3,we havef = γ(f ) ⊂ l
′
2
= γ(l
′
2
)
. Wededu e thatsupp(f ) > supp(l
′
2
)
and we on lude thatconf
(r) > conf (r
′
)
3.3. Non-transitive approximate min-max asso iation rules
We anfurtherredu e thenumber ofapproximate asso iationrules extra ted
with-outlosingthe abilityto dedu eall approximateasso iationrules,with supportand
onden e, byremoving transitive min-max asso iationrules.
A min-max asso iation rules
g → (f \ g)
withγ(g) ⊂ f
is transitive if it exists a frequent losed itemsetf
′
su h thatγ(g) ⊂ f
′
⊂ f
. Letg
′
be the generator off
′
su h thatγ(g) ⊂ g
′
⊆ f
′
⊂ f
. Then, we have the two following approximate
min-maxasso iationrules:
g → (f
′
\ g)
and
g
′
→ (f \ g
′
)
. Therule
g → (f \ g)
isthe transitive omposition of the two previous rules; its supportis equal to the se ondrule'ssupport andits onden eis equal to the produ t oftheir onden es.
We generalize this hara terization to all triplets onsisting of a generators
g
, its losuref
anda losed supersetf
′
of
f
todenethe non-transitive min-max approx-imatebasis,thatisthetransitiveredu tionofthemin-maxapproximatebasis.Let'sdenote
l
1
⋖ l
2
whenan itemsetl
1
is an immediate prede essor of an itemsetl
2
, i.e.∄l
3
su hthatl
1
⊂ l
3
⊂ l
2
.The non-transitivemin-maxapproximate rules areof the formg → (f \ g)
wheref
is a frequent losed itemset andg
a frequent generator su hthatγ(g)
isan immediateprede essor off
.Denition 8. (Non-transitive min-max approximate basis) The non-transitive
min-maxapproximate basisis:
MinMaxReduc
= {r : g → (f \ g) | f ∈ Closed ∧ g ∈ Gen ∧ γ(g) ⋖ f }.
Remark. This transitive redu tion de reases the number of approximate rules
ex-tra ted, by sele ting the most pre ise rules, i.e. whith highest onden es, sin e
transitive rules have lower onden es than non-transitive rules.
Thealgorithmpresentedingure6 onstru tstheset
MinMaxReduc
ofnon-transitive approximatemin-maxrulesusing frequent loseditemsetsandtheir generators.Forea h generator
g
, it determines all frequent losed itemsetsf
that are immediate su essorsof the losureofg
and then,it generates all rules betweeng
andf
that have a su ient onden e.First,
MinMaxReduc
is initialized with the empty set (step 1) and setsF C
k
are su essively examined in in reasing order ofk
values (steps 2 to 19). For ea hk
-generatorg ∈ F C
k
(steps 3 to 18), the setImSucc
g
of immediate su essors ofg
losure is initialized with the empty set (step 4). The setsS
j
of frequent losedj
-supersets ofγ(g)
for|γ(g)| < j ≤ µ
are onstru ted (steps 5 to 7). Then, setsS
j
are onsidered su essively in as ending order ofj
values (steps 8 to 17). For ea h itemsetf ∈ S
j
that is not a superset of an immediate su essor ofγ(g)
inImSucc
g
(step10),f
isinsertedinImSucc
g
(step11) andthe onden eofthe ruler : g → (f \ g)
is omputed (step 12). If the onden e ofr
is aboveminconf
, the ruler
isinsertedinMinMaxReduc
(steps13and14).Whenallthegeneratorsofsize lower thanν − 1
havebeen onsidered, thealgorithm returnsthesetMinMaxReduc
(step 20).Example 7. Thenon-redundant min-maxapproximatebasisextra ted from ontext
Input : sets
F C
k
, onden ethresholdmin onf Output: setMinMaxRedu1)
MinMaxReduc
← ∅
2) fork = 1
toν − 1
do 3) forallk
-generatorg ∈ F C
k
do 4)ImSucc
g
← ∅
5) forj = |g.closure|
toµ
do 6)S
j
← {f ∈ F C.closure | f ⊃ g.closure ∧ |f | = j}
7) end 8) forj = |g.closure|
toµ
do9) forallfrequent loseditemset
f ∈ S
j
do 10) if(∄s ∈ ImSucc
g
| s ⊂ f
)then do11) insert
f
inImSucc
g
12)
conf
← f.supp/g.supp
13) if(
conf
≥ minconf
)14) theninsert
{r : g → (f \ g), conf , f.supp}
inMinMaxReduc
15) end 16) end 17) end 18) end 19) end 20) return
MinMaxReduc
Figure6. Algorithmforgeneratingthenon-transitivemin-maxapproximatebasis.
onlysevenrules,thatisthreeruleslessthanthe approximate min-maxbasis.These
threerules areB
→
ACE, C→
BE andE→
ABC thathaveminimal supportand onden emeasures among the ten rules ofthe approximate min-maxbasis.Table VII. Non-transitivemin-maxapproximatebasisextra tedfrom
D
.Generator Closure Closedsuperset Approximaterule Supp Conf
{A} {AC} {ABCE} A
→
BCE 2/6 2/3{B} {BE} {BCE} B
→
CE 4/6 4/5{B} {BE} {ABCE}
{C} {C} {AC} C
→
A 3/6 3/5{C} {C} {BCE} C
→
BE 4/6 4/5{C} {C} {ABCE}
{E} {BE} {BCE} E
→
BC 4/6 4/5{E} {BE} {ABCE}
{AB} {ABCE}
{AE} {ABCE}
{BC} {BCE} {ABCE} BC
→
AE 2/6 2/4Proposition 5. All approximate asso iation rules, with supportand onden e, an
be dedu edfrom the non-transitive min-maxapproximate basis.
First,we showthatall approximate min-maxasso iation rules an bederived from
thenon-transitivemin-maxapproximateasso iationrules.Then,fromproposition4
we on lude thatall approximate asso iationrules an also be dedu ed.
Proof. Let
r : g
1
→ (f
n
\ g
1
)
bean approximate min-maxasso iationrulebetween a generatorg
1
whose losure isf
1
and afrequent losed supersetf
n
off
1
. Iff
1
⋖ f
n
thenr
isnon-transitive:r ∈ MinMaxReduc
.Iff
1
6⋖f
n
thenr
istransitiveandthereis asequen ef
1
,f
2
,. . .
,f
n
offrequent loseditemsetssu hthatg
1
⊆ f
1
⋖ f
2
⋖ . . . ⋖ f
n
withn ≥ 3
. Ea hf
i
has at least one generatorg
i
su h thatγ(g
i
) = f
i
and sin ef
1
⋖f
2
⋖. . .⋖f
n
,thereisasequen eofrulesr
i
: g
i
→ (f
i+1
\g
i
)
fori ∈ [1, n−1]
thatare non-transitivemin-maxrules.Theante edentofr
istheante edentg
1
oftherstruler
1
ofthesequen e.The onsequentofr
is(f
n
\g
1
) = (((f
n
\g
n−1
)∪g
n−1
)\g
1
)
,i.e.the unionofruler
n−1
'sante edent and onsequent minus ruler
1
'sante edent. We now showthatsupportand onden eofr
an bededu edof thoseofrulesr
i
. We havesupp(r) = supp(g
1
∪ (f
n
\ g
1
)) = supp(f
n
) = supp(g
n−1
∪ (f
n
\ g
n−1
)) = supp(r
n−1
)
. Thesupportofr
isequaltothesupportofthelastruler
n−1
ofthesequen e.Wealso have:conf
(r) = supp(f
n
)/supp(g
1
)
=supp
(f
n
)/supp(g
n−1
) × supp(g
n−1
)/supp(g
1
)
=supp(f
n
)/supp(g
n−1
) × supp(f
n−1
)/supp(g
n−2
) ×
...×supp(f
2
)/supp(g
1
)
=conf
(r
n−1
) × conf (r
n−2
) × . . . × conf (r
1
)
.The onden eofr
isequalto theprodu t ofthe onden es of the rulesr
i
fori = 1
ton − 1
.4. Deriving asso iation rules fromthe min-max bases
Weintrodu einthisse tionsimplete hniquesandalgorithmstore onstru tallexa t
asso iation rules, all approximate asso iation rules and all transitive approximate
min-maxasso iationrules fromthe min-max bases.
4.1. Deriving exa t asso iation rules
The graph-oriented representation of the exa t and the exa t min-max asso iation
rules extra ted from ontext
D
for minsupp =2/6 and min onf =2/5 are given in gure7 and8 respe tively.Ea hvertex
v
l
representsafrequentitemsetl
thatisasubsetofthemaximalfrequent itemset {ABCE}. Ea h edge between two verti esv
a
andv
c
represents the exa t asso iation rulea ⇒ c \ a
. A losed interval is a sub-graph ontaining all verti es representing itemsets of the intervals[g
i
, f ]
where ea hg
i
is a generator of the frequent losed itemsetf
. Sin e all itemsets in a losed interval have the same support,all rules in this intervalalso have the samesupport.Inthegraph representation,derivingallexa t rulesmeansaddingallpossibleedges
between two verti es ofthe same losed interval.Ea hedge in gure8between two
closed interval
generator itemset
AC
AB
BE
BC
CE
ABCE
A
C
B
E
ABC
ABE
ACE
BCE
AE
Figure7. Exa tasso iation rulesextra tedfrom
D
.closed interval
generator itemset
AB
BC
CE
A
C
B
E
AC
BE
ABCE
ABC
ABE
ACE
BCE
AE
Figure8. Exa tmin-maxasso iation rulesextra tedfrom
D
.weaddalledgesbetweentwoverti es,onerepresentingasupersetof
g
andtheother asubset off
.The algorithm re eivesthe set
MinMaxExact
of exa t min-max rules asinput and itreturnsthe setAllExact
ontaining allexa t asso iationrules.Itspseudo- ode is presented ingure 9.It onsiders all exa t min-maxrulesr
1
: a
1
⇒ c
1
with|c
1
| > 1
(steps 2 to 8). For all subsetc
2
ofc
1
(steps 3 to 7), it generates all rules with the formr
2
: a
1
⇒ c
2
andr
3
: a
1
∪ c
2
⇒ c
1
\ c
2
(steps 4 and 6). These rules have the samesupportasr
1
.Sin eruler
3
anbegeneratedseveraltimes, thealgorithm rst testsifithasnot already been insertedinAllExact
(step 5).Input : set
MinMaxExact
Output: setAllExact
1)
AllExact
← ∅
2) forallrule
{r
1
: a
1
⇒ c
1
, r
1
.supp} ∈ MinMaxExact
with|c
1
| > 1
do 3) forallsubsetc
2
⊂ c
1
do4) insert
{r
2
: a
1
⇒ c
2
, r
1
.supp}
inAllExact
5) if{r
3
: a
1
∪ c
2
⇒ c
1
\ c
2
, r
1
.supp} /
∈ AllExact
6) then insertr
3
inAllExact
7) end
8) end
9) return
AllExact
Figure9. Algorithmforre onstru tingallexa tasso iation rules.
Example 8. Consider rule AB
⇒
CE represented in gure 4 by the edge between verti es {AB} and {ABCE}. From this rule we dedu e rules AB⇒
C, AB⇒
E,ABC
⇒
E and ABE⇒
C and from rule AE⇒
BC, we dedu e rules AE⇒
B,AE
⇒
C, ABE⇒
Cand ACE⇒
B. All theserules have the same support.Remark. For onstru ting all exa t rules usingsets
F C
k
ofgenerators and frequent loseditemsets,we onsiderea hgeneratorg
andits losuref
.Wegenerateallrulesr : g ⇒ l \ g
andr : l ⇒ f \ l
forl ∈ [g, f [
.Forinstan e,fromthe generator{AB}and its losure {ABCE}, we generate rules AB⇒
CE, AB⇒
C, AB⇒
E, ABC⇒
E andABE⇒
C.Theirsupportisequal tothe supportofg
andf
, i.e.thesupportof{AB}and {ABCE}.
4.2. Deriving approximate asso iation rules
Figures10and11depi tthe graph-oriented representationsofthe approximateand
the approximate min-max asso iation rules extra ted from ontext
D
for minsupp =2/6andmin onf=2/5.Ea hedge between twoverti esv
a
andv
c
represents the approximate rulea → c \ a
.In gure 11, ea h edge between two verti es
v
g
andv
f
represents the min-max approximate ruleg → f \ g
whereg
isa generatorandf
afrequent losed superset ofg
. That isto sayan edge between a minimal vertex of a losed intervaland the maximalvertexofanother losed intervalabovethe rstone.Forinstan e,theedgegenerator itemset
AC
AB
AE
BE
BC
CE
ABCE
A
C
B
E
ABC
ABE
ACE
BCE
Figure10. Approximateasso iationrulesextra tedfrom
D
.closed interval
generator itemset
AC
AB
BE
BC
CE
ABCE
A
C
B
E
ABC
ABE
ACE
BCE
AE
Figure11. Approximatemin-maxasso iationrulesextra tedfrom
D
.To derive all approximate rules, when thereis an edge between two verti es of two
losed intervals we reate allpossibleedges between ea hvertexof the rstinterval
and ea h vertex of the se ond interval. All these rules have the same support and