• Aucun résultat trouvé

Generating a condensed representation for association rules

N/A
N/A
Protected

Academic year: 2021

Partager "Generating a condensed representation for association rules"

Copied!
32
0
0

Texte intégral

(1)

HAL Id: hal-00363015

https://hal.archives-ouvertes.fr/hal-00363015

Submitted on 26 Apr 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

Generating a condensed representation for association

rules

Nicolas Pasquier, Rafik Taouil, Yves Bastide, Gerd Stumme, Lotfi Lakhal

To cite this version:

Nicolas Pasquier, Rafik Taouil, Yves Bastide, Gerd Stumme, Lotfi Lakhal. Generating a condensed

representation for association rules. Journal of Intelligent Information Systems, Springer Verlag, 2005,

24 (1), pp.29-60. �10.1007/s10844-005-0266-z�. �hal-00363015�

(2)

Ni olasPasquier (ni olas.pasquieruni e.fr)

I3S(CNRSUMR6070)-Universitéde Ni e-SophiaAntipolis,06903SophiaAntipolis,Fran e

RakTaouil (taouiluniv-tours.fr)

LI-Université Fran oisRabelaisde Tours, 3pla eJeanJaurès, 41000Blois,Fran e

YvesBastide (yves.bastideirisa.fr)

IRISA-INRIARennes, ampusuniversitaire deBeaulieu,35042Rennes,Fran e

Gerd Stumme (stummeuni-kassel.de)

Fa hberei hMathematik/Informatik,UniversitätKassel,34121Kassel,Germany

Lot Lakhal (lotfi.lakhallim.univ-mrs.fr)

LIM(CNRSFRE2246)-UniversitédelaMéditerranée, ase901,13288Marseille,Fran e

Abstra t. Asso iation ruleextra tion fromoperational datasets often produ esseveral tensof

thousands,andevenmillions,ofasso iationrules.Moreover,manyoftheserulesareredundantand

thususeless.Usingasemanti basedonthe losureoftheGalois onne tion,wedenea ondensed

representationforasso iationrules.Thisrepresentationis hara terizedbyfrequent loseditemsets

andtheir generators. It ontainsthe non-redundantasso iation rules havingminimal ante edent

and maximal onsequent, alled min-max asso iation rules. We think that these rules are the

mostrelevant sin etheyare themostgeneralnon-redundantasso iation rules.Furthermore,this

representationis a basis, i.e., agenerating set for all asso iation rules, their supports and their

onden es,andallofthem anberetrievedneedlessa essingthedata.Weintrodu ealgorithms

forextra tingthisbasisandforre onstru tingallasso iationrules.Resultsofexperiments arried

outonrealdatasetsshowtheusefulnessofthisapproa h.Inordertogeneratethisbasiswhen an

algorithmforextra tingfrequentitemsetssu hasAprioriforinstan eisused,wealsopresent

an algorithm for deriving frequent losed itemsets and their generators from frequent itemsets

withoutusingthedataset.

Keywords: Datamining, Galois losure operator,frequent losed itemsets,generators,min-max

asso iationrules,basisforasso iation rules, ondensedrepresentation.

1. Introdu tion

The purpose of asso iation rule extra tion, introdu ed in (Agrawal et al., 1993),

is to dis over signi ant relations between binary attributes, alled items, in large

datasets. An example of asso iation rule extra ted from a dataset of supermarket

sales is: ` ereals

sugar

milk (support=7%, onden e=67%)'. This rule states that ustomers who buy ereals and sugar also tend to buy milk. The support

measuredenes the rangeof the rule, i.e.,the proportion of ustomerswhobought

the three items among all ustomers. The onden e measure denesthe pre ision

of the rule, i.e., the proportion of ustomers who bought milk among those who

bought ereals and sugar. Only rules with support and onden e above some

minimal support and onden e thresholds, dened by the analyst a ording to

the appli ation,areextra ted.

(3)

1. Extra ting frequent itemsetsandtheirsupportfromthe dataset.Frequent

item-sets aresets of items ontained in a proportion of obje ts above the minimum

supportthreshold.

2. Generating asso iation rules from frequent itemsets and supports. Only rules

with onden e above the minimum onden ethreshold aregenerated.

Therstphaseisthe most omputationallyintensive,sin e thenumber ofpotential

frequent itemsets is exponential in the size of the set of items and several dataset

s ans, very expensive in exe ution times, are required to ount their supports.

Classi al approa hes an be lassied into three main trends. Approa hes in the

rst trend arebased on the levelwise extra tion of frequent itemsets (Agrawal and

Srikant, 1994; Mannila et al., 1994). That is a breadth-rst exploration of the

sear h spa e where all potential frequent itemsets of a given size are onsidered

simultaneously (Mannila and Toivonen, 1997). These approa hes are e ient for

mining asso iation rules from weakly orrelated data, su h as market basket data,

butperforman esdrasti allyde reasewhendataaredenseor orrelated,su has

sta-tisti aldataforinstan e.Approa hesinthese ondtrendarebasedontheextra tion

of maximal 1

frequent itemsets (Bayardo, 1998; Lin and Kedem, 1998; Zaki et al.,

1997) to improve the e ien y. On e all maximal frequent itemsets are extra ted,

all frequent itemsets are derived and their support are ounted in the dataset. In

the third trend, approa hes are based on the extra tion of frequent losed

item-sets(Pasquieret al.,1998;ZakiandOgihara, 1998)denedusingthe Galois losure

operator.Theseapproa hesrstextra t allfrequent losed itemsetsand then,both

frequent itemsetsand theirsupport arederived fromthem, without dataset a ess.

Inthe aseofdenseor orrelateddata,therearemu hfewerfrequent loseditemsets

thanfrequent itemsetsandthus,these approa hesimprove theextra tion e ien y

ompared to approa hes in the rst trend. Compared to approa hes in the se ond

trend,appro hesbasedonfrequent loseditemsets an be moree ient inthe ase

of orrelated data due to the ostof generatingall subsetsofthe maximal frequent

itemsetsand ountingtheir supportin the dataset.

Anothermajor resear htopi indata miningistheproblemofrelevan e and

useful-nessofextra tedasso iationrules.Thisproblemisrelatedtothenumberofextra ted

rulesthatismostoftenverylargeandtotheimportantproportion ofredundant

rules,i.e.rules bringing the same information,among them. Thisproblembe omes

ru ial whendata aredense or orrelated, su has statisti aldata,

tele ommuni a-tion data or nominative market basket data (Bayardo and al., 2000; Brin and al.,

1997;Siversteinetal.,1998).Forinstan e,usinga ensusdatasetsample onstituted

of 10,000 obje ts, ea h one ontaining values of 73 binary attributes, more than

2,000,000asso iationrules with support and onden e above 90% wereextra ted.

The analyst isthen onfronted with the following problems:How to handle su h a

listofasso iationrules?Isitpossibleto redu eitssizewithoutlosinginformation?

Moreover, the inspe tionof extra ted asso iationrules shown thatredundant rules

represent the majority of them. Their suppression will thus onsiderably redu e

the number of rules to be handled by the analyst. In the previous example, this

1

(4)

suppressionredu edthe number ofrules toafewthousands.Inaddition,redundant

rules an be misleading as dis ussed in example 1. Thus, the following question

arises: How to redu e extra ted asso iation rules to a smaller list ontaining only

non-redundant asso iationrules ?

Example 1. To illustrate the problem of redundant asso iation rules, we present

nine rules extra ted from the Mushrooms dataset des ribing hara teristi s of

8416 mushrooms (Blake and Merz, 1998) in table I. These rules have identi al

supports and onden es, of 51% and 54% respe tively, and the item free gills

in the ante edent.

Table I. Redundantasso iation rules.

1) freegills

edible 6) freegills, partialveil

edible,whiteveil 2) freegills

edible,partialveil 7) freegills, whiteveil

edible

3) freegills

edible,whiteveil 8) freegills, whiteveil

edible,partialveil 4) freegills

edible,partialveil,whiteveil 9) freegills, partialveil,whiteveil

edible 5) freegills,partialveil

edible

Obviously, rules 1 to 3 and 5 to 9 do not add any information to rule 4 sin e all

these rules have identi al supports and onden es. We thus say that these rules

are redundant ompared to rule 4, the most relevant from the analyst's point of

viewforitsummarizesthe ninerules.Thisrulehasaminimalante edent (left-hand

side) and a maximal onsequent (right-hand side) among the nine rules. Moreover,

examining only one of these eight rules, say for instan e rule 9, the analyst will

believe that a mushroom has 54% han es to be edible if it has free gills and a

partial whiteveil. As a matter of fa t, it has54% han es to be edible and have a

partial white veil if it has free gills. Redundant rules an therefore be misleading

and ause misinterpretations of the results. We believe that extra ting only rule 4

willimprove the result relevan e.

In the rest of the paper, we dierentiate exa t asso iation rules, noted

l ⇒ l

, that

have a 100% onden e, and approximate asso iationrules, noted

l → l

, thathave

a onden elowerthan 100%.Exa tasso iationrules arevalidforallobje tsin the

dataset whereas approximate asso iationrules are valid for a proportion of obje ts

equal to their onden e.

1.1. Related Work

Approa hesaddressingthisissue anbe lassiedintothreemaintrends.Approa hes

inthe rsttrendprovide me hanismsforlteringextra ted asso iationrules.Inthe

twoothertrends,approa hesextend thedenitionofasso iationrulesinordernot

to extra tsimilar ones.

Approa hes in the rst trend allow the analyst to dene some templates (Baralis

and Psaila, 1997; Klemettinen and al.,1994), boolean operators (Bayardo and al.,

2000;Ng et al., 1998;Srikant et al.,1997)or SQL-like operators (Meoet al.,1998)

(5)

boolean operators are oupled with further measures of usefulness of the rules.

By sele ting a subset of all extra ted asso iation rules, these approa hes redu e

the number of rules to handle during the visualization, but redundan ies are not

suppressed.

Inthese ondtrend,someapproa hesuseataxonomyofitemstoextra tgeneralized

asso iation rules (Han and Fu, 1999; Srikant and Agrawal, 1995), i.e., asso iation

rules between sets of items that belong to dierent levels of the taxonomy. Some

approa hes use statisti al measures, su h as Pearson's orrelation or

χ

2

test for

instan e, insteadof the onden e to determine the pre ision of the rule (Brinand

al., 1997; Morimoto et al., 1998; Siverstein et al., 1998). Other approa hes in this

trend allow to extra t only rules with maximal ante edents among those with the

same supports and the same onsequents (Srikant and Agrawal, 1996; Toivonen et

al.,1995).Thatis,arule

r

willbepruned ifanotherrule

r

hasthesame onsequent

and an ante edent that is a superset of the one of

r

. In example 1, rules 4, 6, 8 and 9 have maximal ante edents and will be extra ted. Finally, the approa h

proposedin(BayardoandAgrawal,1999)identiesoptimalrulesa ordingtoseveral

interestingnessmetri s( onden e, onvi tion,lift,Lapla e,gain,et .)andapartial

orderon the rules.

Approa hes in the third trend make use of the losure of the Galois onne tion

to extra t bases, or redu ed overs, for asso iation rules. Informally, a basis is a

non-redundant set that is minimal a ording to some mathemati al property and

fromwhi hallasso iationrulesarededu ible,withsupportand onden e,without

a essing the dataset. Thesebases areadaptations of the Duquenne-Guigues basis

forglobal impli ations(Duquenne andGuigues,1986;Ganterand Wille,1999)and

the Luxenburger basis for partial impli ations (Luxenburger, 1991). They were

in-trodu ed in Formal Con ept Analysis and their adaptation to the asso iation rule

frameworkisstudiedin(Pasquieretal.,1999 ;Taouiletal.,2000;Zaki,2000).Inthe

Duquenne-Guiguesbasisforexa tasso iationrules,ante edentsofrulesarefrequent

pseudo- losed itemsets and onsequentsare frequent losed itemsets. Inthe

Luxen-burger basis for approximate asso iation rules, both ante edents and onsequents

are frequent losed itemsets: We sele t approximate rules with both a maximal

ante edent and a maximal onsequent among rules having identi al supports and

onden es. In example 1,rule 9 will be the only one extra ted. The union of the

Duquenne-Guigues and the Luxenburger bases is a basis for all asso iation rules.

Thisbasis isminimal with respe tto the number of rules and, sin e for most data

types there are mu h fewer frequent losed and pseudo- losed itemsets than there

arefrequent itemsets, it is very small.However, it doesnot ontain non-redundant

rules with minimalante edent and maximal onsequent.

In previous works about the pruning of redundant impli ation rules (fun tional

dependen ies), su h as the anoni al and the minimum overs denitions (Beeri

and Bernstein, 1979; Maier, 1980), redundant rules are dened a ording to an

inferen e system based on Armstrong axioms (Armstrong, 1974). However, these

results annotbedire tlyappliedto theasso iationruleframeworksin eredundant

asso iation rules annot be dened a ording to this system: Supports and

on-den esareimportantinformationthatmustbe onsideredto hara terizeredundant

(6)

Theidea behindnon-redundant asso iation rules asdened hereafteris to identify

the most relevant rules, ea hone bringing the same information asseveral others.

1.2. Contribution

Our goal is to improve asso iation rules relevan e and usefulness by extra ting as

few rules as possible without losing information. To a hieve this, we propose to

generate a ondensed representation (Mannila and Toivonen, 1996) by maximizing

the information brought by ea h rule. As pointed out in example 1, we believe

thatthe most relevant asso iationrules arethe mostgeneral 2

non-redundant rules:

Thosewith minimalante edent andmaximal onsequent.Extra tingsu hrules will

improve the resultusefulness,while redu ing its size.Therefore,in the following:

Wedene non-redundant asso iation ruleswith minimalante edent and maxi-mal onsequent, alledmin-max asso iationrules.Theserules aredenedusing

the semanti for asso iation ruleextra tion basedon the Galois losure. Their

ante edentsand onsequentsare hara terized byfrequent loseditemsets and

their generators (Pasquieret al., 1998).

Weshowthatthemin-maxasso iationrules onstituteabasis, alled min-max basis for asso iation rules. All asso iationrules an be dedu ed by generating

all the sub-rules of the min-max asso iation rules, onsidering their supports

and onden es.

We propose e ient algorithms to generate the min-max basis from frequent losed itemsetsandtheirgenerators,su hasextra tedbythe Close(Pasquier

et al.,1998; Pasquier et al.,1999b) and the A-Close (Pasquier et al.,1999a)

algorithms. Wealsointrodu ealgorithmstore onstru tallasso iationrules,or

a partof them, fromthis basiswithout having to a essthe data.

WepresenttheClose

+

algorithmthatidentiesfrequent loseditemsets,their

generatorsandtheirsupportsamongfrequentitemsetsandtheirsupports.This

algorithm issimple ande ient sin e itdoesnot requireanydataset a ess.It

enables the generation of the min-max basiswhen an algorithm for extra ting

allfrequentitemsets,su hasApriori(AgrawalandSrikant,1994)forinstan e,

is used.

Extra ting min-max asso iation rules minimizes as mu h as possible the number

of rules while keeping the same information in the result: Only the most general

non-overlapping asso iation rules are extra ted and therefore redundant rules are

pruned. Sin e for many real datasets redundant rules represent the majority of

extra ted rules,the redu tion will be almost always signi ant. Thisredu tion will

be onsiderableinthe aseofdenseor orrelateddataforwhi hthetotalnumber of

rules is very large and most areredundant (Bayardo and Agrawal, 1999; Brin and

al.,1997;Siverstein et al.,1998).

2

Wesay that a rule

r : a → c

is more general thanarule

r

: a

→ c

if theyhave identi al

supportsand onden es, the ante edent

a

of

r

isa subsetof

a

and the onsequent

c

of

r

is a supersetof

c

.

r

isthen alledasub-ruleof

r

,and

r

asuper-ruleof

r

(7)

With the min-max basis, the analyst is presented a set of rules overing all the

attributes of the dataset: All of the data-spa e is hara terized by the min-max

rules, over oming an important de ien y of most redu tion methods where large

sub-spa es of the data-spa e may be poorly hara terized or even entirely

un har-a terized (Bayardo and Agrawal, 1999). This property helps insuring that rules

surprising fortheanalyst,thatareimportantinformation(PiatetskyandMatheus,

1994;Silbers hatzandTuzhilin,1996),willbepresent.Moreover,themin-maxbasis

does not represent anyinformation lossfor the analyst:all information brought by

the setofallasso iationrulesisbrought bythe min-maxbasis.Thisapproa h does

notsueroftheproblemofinformationlossfromtheanalyst'spointofviewthat

isanimportant drawba kinasso iationruleredu tionmethods(Liuandal.,1999).

Ifthe analystsowishes, itisalsopossible toe iently dedu eallother asso iation

rules,with supports and onden es,from the min-maxbasisalone.

1.3. Organization

Inse tion2,were allthe semanti for asso iationrulesbasedonthe Galois

onne -tionandtheClosealgorithmforextra tingfrequent loseditemsetsandgenerators.

We alsopresentthe Close

+

algorithm for e iently deriving frequent losed

item-sets,their generatorsand their supportsfromfrequent itemsetsand theirsupports.

Min-max asso iation rules and the min-max basis for asso iation rules are dened

inse tion3.Algorithmsforgeneratingthisbasisarealsopresented.Inse tion4,we

present simple methods and algorithms for deriving all asso iation rules from the

min-maxbasis. Resultsof experiments ondu ted to evaluate the usefulness of this

approa h aregivenin se tion 5and se tion6 on ludesthe paper.

2. Semanti forasso iation rules based onthe Galois onne tion

Theasso iationruleextra tion isperformedfrom adata mining ontext 3

, thatisa

triplet

D = (O, I, R)

,where

O

and

I

arenitesetsofobje tsanditemsrespe tively, and

R ⊆ O × I

isabinaryrelation. Ea h ouple

(o, i) ∈ R

denotesthefa tthatthe obje t

o ∈ O

isrelatedto theitem

i ∈ I

.Anitemset

l

isa setofitems

l ⊆ I

,

l 6= ∅

. Example 2. Adata mining ontext

D

onstituted of sixobje ts,ea h one identied byitsOID,and veitemsisrepresented intable II.This ontextisusedassupport

for the examplesin the rest ofthe paper.

TheGalois onne tionofanitebinaryrelation(GanterandWille,1999)isa ouple

ofappli ations(

φ

,

ψ

).

φ

asso iateswith asetofobje ts

O ⊆ O

the itemsrelatedto all obje ts

o ∈ O

and

ψ

asso iates with an itemset

l ⊆ I

the obje ts related to all items

i ∈ l

. Whenan obje t

o

isrelated to allitems

i ∈ l

, we say that

o

ontains

l

. We denoteminsupp and min onf theminimal supportand onden ethresholds.

Denition 1. (Frequent itemsets) The support of an itemset

l

is the proportion of obje ts in the ontext ontaining

l

:

supp(l) = |ψ(l)| / |O|

.

l

is a frequent itemset if

supp(l) ≥

minsupp. 3

(8)

TableII. Datamining ontext

D

. OID Items 1 A C D 2 B C E 3 A B C E 4 B E 5 A B C E 6 B C E

Denition 2. (Asso iation rules) Anasso iationrule

r

isanimpli ationbetweentwo frequentitemsets

l

1

, l

2

⊆ I

withtheform

l

1

→ (l

2

\l

1

)

where

l

1

⊂ l

2

.Thesupportand onden eof

r

aredened by:

supp(r) = supp(l

2

)

,

conf

(r) = supp(l

2

) / supp(l

1

)

.

The losureoperator

γ = φ ◦ ψ

asso iateswithanitemset

l

themaximalsetofitems ommon to all the obje ts ontaining

l

: The losure of an itemset is equal to the interse tion of all the obje ts ontaining it. Using this losure operator, we dene

the frequent losed itemsets.

Denition 3. (Frequent losed itemsets) Afrequentitemset

l ⊆ I

isafrequent losed itemseti

γ(l) = l

. Theminimal loseditemset ontaining anitemset

l

isits losure

γ(l)

.

Theset of frequent losed itemsetsand their supports isa minimal non-redundant

generatingsetforallfrequentitemsetsandtheirsupports,andthusforallasso iation

rules, their supports and their onden es. This theorem relies on the properties

thatthe supportofafrequentitemset isequaltothe supportofits losureandthat

maximal frequent itemsets are maximal frequent losed itemsets (Pasquier et al.,

1998). In order to improve the e ien y of frequent losed itemset extra tion, the

Close andA-Closealgorithms ompute generators offrequent losed itemsets.

Denition 4. (Generators) Anitemset

g ⊆ I

isa generator of a losed itemset

l

i

γ(g) = l

and

∄g

⊆ I

with

g

⊂ g

su h that

γ(g

) = l

. A generator of ardinality

k

isa

k

-generator.

Generators are the minimal itemsets to onsider for dis overing frequent losed

itemsets, by omputing their losures. Based on the following lemma, Close and

A-Closeperform a breadth-rst sear hfor generators in a levelwisemanner.

Lemma 1. All subsets

s ⊆ I

of a generator

g ⊆ I

are also generators. The losure of

s

is a losed subset of the losureof

g

:

γ(s) ⊂ γ(g)

.

(9)

2.1. Extra ting frequent loseditemsets and generators with Close

TheClosealgorithmisaniterativealgorithmforextra tinggeneratorsandfrequent

losed itemsets in a levelwise manner. During an iteration

k

, a list of andidate

k

-generators is onsidered; their losures and their supports are omputed from the

dataset andinfrequent generatorsare dis arded.Frequent generatorsarethen used

to onstru t andidate(

k

+1)-generators.The losuresoffrequentgeneratorsarethe frequent losed itemsets and the support of a generator is also the support of its

losure.

Duringthe

k

th

iteration,asetFC

k

is onsidered.Ea helementofthisset onsistsof three information: a

k

-generator, its losureand their support. The algorithm rst initializes the andidate 1-generators in

F C

1

with the list of 1-itemsets and then arriesout some iterations. During ea hiteration

k

:

1. Closures of all andidate

k

-generators and their supports are omputed: The number of obje ts ontaining a generator determines its support and their

in-terse tion generates its losure. Ea h obje t is onsidered on e and this phase

requiresonly one s anofthe dataset.

2. Infrequent

k

-generators, i.e., generators with support lower than minsupp, are removed from

F C

k

.

3. The set of andidate (

k

+1)-generators is onstru ted by joining the frequent

k

-generators in

F C

k

asfollows.

a) Two

k

-generatorsin

F C

k

thathavethesamerst

k−1

itemsarejoinedto re-atea andidate (

k

+1)-generator. Forinstan e, the3-generators{ABC}and {ABD}willbejoinedin orderto reatethe andidate4-generator {ABCD}.

b) Candidate(

k

+1)-generatorsthatareinfrequentornon-minimalareremoved. Oneofthe

k

-subsetsofsu hageneratoriseitherinfrequentor non-minimal and thus doesnot belongto the set offrequent

k

-generators in

F C

k

. ) Thethirdphaseremoves(

k

+1)-generatorswhi h losureswerealready

om-puted. Su h agenerator

g

is easily identied asitis in luded in the losure of a frequent

k

-generator

g

in

F C

k

: We have

g

⊂ g ⊆ γ(g

)

.

The algorithm stops when no new andidate generator an be reated.Then, ea h

setFC

k

storesthe frequent

k

-generators, their losuresand their supports.

Example 3. Figure1shows the exe utionof the Closealgorithm onthe ontext

D

for minsupp = 2/6. The set

F C

1

is initialized with the list of all 1-itemsets. The algorithm omputessupportsand losuresofthe1-generatorsin

F C

1

andinfrequent onesaredis arded.Then, joiningthefrequentgeneratorsin

F C

1

,sixnew andidate 2-generatorsare reated:{AB}, {AC}, {AE},{BC}, {BE} and{CE} in

F C

2

. The 2-generators {AC} and {BE} are removed form

F C

2

be ause we have {AC}

γ

({A})and{BE}

⊆ γ

({B}).Thealgorithmdeterminessupportsand losuresofthe remaining2-generatorsin

F C

2

andsuppressesinfrequentones.Then, the andidate 3-generator {ABE} is reated by joining the frequent generators in

F C

2

but is removed be ause the 2-generator {BE}

{ABE} isnot in

F C

2

and the algorithm stops.

(10)

S an

D

−→

F C

1

Generator Closeditemset Supp

{A} {AC} 3/6 {B} {BE} 5/6 {C} {C} 5/6 {D} {ACD} 1/6 {E} {BE} 5/6 Pruning infrequent itemsets

−→

F C

1

Generator Closeditemset Supp

{A} {AC} 3/6 {B} {BE} 5/6 {C} {C} 5/6 {E} {BE} 5/6 S an

D

−→

F C

2

Generator Closeditemset Supp

{AB} {ABCE} 2/6 {AE} {ABCE} 2/6 {BC} {BCE} 4/6 {CE} {BCE} 4/6 Pruning infrequent itemsets

−→

F C

2

Generator Closeditemset Supp

{AB} {ABCE} 2/6

{AE} {ABCE} 2/6

{BC} {BCE} 4/6

{CE} {BCE} 4/6

Figure1. Extra tingfrequent loseditemsetsinthe ontext

D

withClose.

TheA-Closealgorithm improvesthe e ien y ofthe extra tion in aseof weakly

orrelated data. It does not ompute losures of andidate generators during the

iterations, but duringan ultimate s an arriedout after the end ofthese iterations

ifne essary. Experimental results show that Close and A-Close are parti ularly

e ient for mining asso iation rules from dense or orrelated data. On su h data,

Close outperforms A-Close, and they both outperform algorithms for

extra t-ing frequent itemsets and maximal frequent itemsets. In that ase, algorithms for

extra ting maximal frequent itemsets suer from the ost of the frequent itemset

supports omputation that requires a essing the dataset. On the ontrary, for

weakly orrelateddata,algorithmsforextra tingmaximalfrequentitemsetsarethe

most e ient andalgorithms for extra tingfrequent itemsets, aswell asA-Close,

outperform Close.

The ChARM (Zaki and Hsiao, 1999) and Closet (Pei et al., 2000) algorithms

extra t frequent losed itemsets. However, none of these algorithm extra t

gener-ators and an be used to generate the min-max basis for asso iation rules. The

Pas al (Bastide and al., 2000)algorithm is an optimization of Aprioribased on

inferen e ounting andequivalen e lassesdened a ording to itemset supports.It

an easily be extended to generate the min-max basissin e generators and losed

itemsetsarerespe tively bottom and top patternsof anequivalen e lass.

2.2. Deriving frequent losed itemsets and generators from

frequent itemsets

The Close

+

algorithm identies frequent losed itemsets and generators among

frequent itemsets without a essing the dataset. It enables the e ient generation

of the min-max basis when an algorithm for extra ting frequent itemsets is used.

Su h an algorithm gives as result the sets

F

k

, ea h set

F

k

ontaining all frequent

k

-itemsets, with

k

varying from 1 to

µ

(the size of the longest maximal frequent itemsets).Thefrequent loseditemsetsandgeneratorsareidentiedamongfrequent

itemsets using propositions 1 and 2 that are derived from the property that an

(11)

is insured by the property that maximal frequent itemsets are maximal frequent

losed itemsets(Pasquier et al.,1998).

Proposition 1. The support of a generator is smaller than the supports of all its

subsets.

Proof. Let

g

be a

k

-generator and

s

a (

k − 1

)-subsets of

g

. We then have

s ⊂ g

⇒ ψ(s) ⊇ ψ(g)

. If

ψ(s) = ψ(g)

then

γ(s) = γ(g)

and

g

is not a generator: It is not a minimal itemset whose losure is

γ(g)

. It follows that

ψ(s) ⊃ ψ(g) ⇒

supp(g) > supp(s)

.

Proposition 2. Thesupportofa loseditemset isgreater thanthe supportsofallits

supersets.

Proof. Let

l

be a losed

k

-itemset and

s

a superset of

l

. We then have

l ⊂ s ⇒

ψ(l) ⊇ ψ(s)

.If

ψ(l)

=

ψ(s)

then

γ(l)

=

γ(s) ⇒ l

=

γ(s) ⇒ s ⊆ l

(absurd).Itfollows that

ψ(l) ⊃ ψ(s) ⇒ supp(l) > supp(s)

.

Thepseudo- odeof the Close

+

algorithm isgiven in gure 2.It examines

su es-sively all frequent itemsets in ea h set

F

k

, with

k

varyingfrom

1

to

µ

. It generates thesets

F C

m

,

1 ≤ m ≤ ν

,where

ν

isthesizeofthelongestgenerators, ontaining the

m

-generators, their losures and their supports. It rst determines ifa frequent

k

-itemsetisageneratorbyexaminingallits(

k−1

)-subsets'supports;itthendetermine if it is a losed itemset by examining all its (

k + 1

)-supersets' supports and if so, identiesitsgeneratorsbyexaminingallitssubsets'supports.Thebooleanvariables

is losed andisgenerator areused to determine ifan itemset

l

isa losed itemset or isa generator.

Atthebeginningofthe

k

th

iteration(steps1to21),theset

F C

k

isempty(step2).In steps3to 20,frequentitemsetsin

F

k

are onsideredsu essively. Ifanitemset

l

has thesamesupportasone ofits(

k − 1

)-subset

l

in

F

k−1

(steps5to 7),then

l

isnota generator(step6).Otherwise,

l

anditssupportareinsertedin

F C

k

(step8). Then, we test if

l

hasthe same support asone of its (

k

+1)-superset

l”

in

F

k+1

(steps 10 to12).Ifso,wehave

l

⊆ γ(l)

andthen

l 6= γ(l)

:

l

isnot losed (step11).Otherwise,

l

isa frequent losed itemset and we determine the generators of

l

(steps 13to 19) as follows. For ea h generator

g

of size

n

, with

1 ≤ n ≤ k

, that is a subset of

l

(steps14to 18),ifthe supports of

g

and

l

areequal then

g

isagenerator of

l

and

l

isinserted in

F C

n

asthe losureof

g

(step 16). Thus, at the end of the algorithm, ea hset

F C

k

ontains all frequent

k

-generators, their losures andtheir supports. Corre tness. The orre tness of the omputation of sets

F C

k

for

1 ≤ k ≤ µ

relies on propositions 1 and2. Usingthe rst one,we determine if afrequent

k

-itemset

l

isa generatorofa losed itemset by omparingits supportandthe supportsof the

frequent(

k − 1

)-itemsetsin luded in

l

. These ondpropositionenablesto determine if a frequent

k

-itemset

l

is losed by omparing its support and the supports of the frequent (

k

+1)-itemsets in whi h

l

isin luded. Sin e a generator hasthe same support as its losure, the determination of the generators of a losed itemset is

(12)

Input : sets

F

k

offrequent

k

-itemsets

Output: sets

F C

k

offrequent

k

-generators,with losure andsupport

1) for

k = 1

to

µ

do 2)

F C

k

← ∅

3) forallitemsets

l ∈ F

k

do 4)

isgenerator ← true

5) forallsubsets

l

∈ F

k−1

of

l

do 6) if(

l

.supp = l.supp

)then

isgenerator ← f alse

7) end

8) if(

isgenerator = true

)theninsert

l

in

F C

k

.generators

with

l.supp

9)

isclosed ← true

10) forallsupersets

l

′′

∈ F

k+1

of

l

do 11) if(

l

′′

.supp = l.supp

)then

isclosed ← f alse

12) end

13) if(

isclosed = true

)then do 14) for

n = k

to

0

step

−1

do

15) forallsubsets

g ∈ F C

n

.generatorsof

l

do

16) if(

g.supp = l.supp

)theninsert

l

in

g.closure

17) end 18) end 19) end 20) end 21) end 22) return

S

F C

k

Figure2. Close

+

algorithmforderivingfrequent loseditemsetsandgenerators.

Example 4. Figure 3 shows the exe ution of the Close

+

algorithm using the sets

F

1

to

F

4

offrequent itemsetsextra tedfromthe ontext

D

withminsupp=2/6.All frequent1-itemsetsarefrequent1-generatorssin enoneoftheirsubsetsisafrequent

itemset: The empty set isnot onsidered asa frequent itemset. The 1-itemset {C}

is also its own losure sin e all its supersets in

F

2

have a smaller support. In

F

2

, the2-itemsets {AC}and{BE}arenotgeneratorssin e theyhave thesamesupport

asitemsets {A}and,{B} and{E}respe tively. Thesetwo itemsetsare losed sin e

their support is lower than those of all their supersets in

F

3

; {AC} is the losure of {A} and {BE} is the losure of {B} and {E}. No frequent 3-itemset in

F

3

is a generatorand{BCE}, thathasthe samesupportas{BC} and{CE} andagreater

support than {ABCE} in

F C

4

, is the losure of {BC} and {CE} in

F C

2

. Finally, the4-itemset{ABCE}is losed sin eitisamaximalfrequentitemset isthe losure

of{AB} and{AE}, andis insertedin

F C

2

.

Remark. As a simple optimization, the algorithm an stop testing if frequent

k

-itemsetsaregeneratorsaftertherstiteration

n

duringwhi hnofrequent

n

-itemset examined is a generator. In example 4, the algorithm will not test if 4-itemsets in

F

4

aregeneratorssin e no3-itemset isa generator(

F C

3

isemptyat theend of the third iteration).

(13)

F

1

Itemset Supp {A} 3/6 {B} 5/6 {C} 5/6 {E} 5/6 Generators ofsize1

−→

Closures ofsize1

−→

F C

1

Generator Closeditemset Supp

{A} 3/6

{B} 5/6

{C} 5/6

{E} 5/6

F C

1

Generator Closeditemset Supp

{A} 3/6 {B} 5/6 {C} {C} 5/6 {E} 5/6

F

2

Itemset Supp {AB} 2/6 {AC} 3/6 {AE} 2/6 {BC} 4/6 {BE} 5/6 {CE} 4/6 Generators ofsize2

−→

Closures ofsize2

−→

F C

2

Generator Closeditemset Supp

{AB} 2/6

{AE} 2/6

{BC} 4/6

{CE} 4/6

F C

1

Generator Closeditemset Supp

{A} {AC} 3/6 {B} {BE} 5/6 {C} {C} 5/6 {E} {BE} 5/6

F

3

Itemset Supp {ABC} 2/6 {ABE} 2/6 {ACE} 2/6 {BCE} 4/6 Generators ofsize3

−→

Closures ofsize3

−→

F C

3

Generator Closeditemset Supp

F C

2

Generator Closeditemset Supp

{AB} 2/6 {AE} 2/6 {BC} {BCE} 4/6 {CE} {BCE} 4/6

F

4

Itemset Supp {ABCE} 2/6 Generators ofsize4

−→

Closures ofsize4

−→

F C

4

Generator Closeditemset Supp

F C

2

Generator Closeditemset Supp

{AB} {ABCE} 2/6

{AE} {ABCE} 2/6

{BC} {BCE} 4/6

{CE} {BCE} 4/6

Figure3. Derivingfrequent loseditemsetsandgeneratorswithClose

+

(14)

3. Min-max basis for asso iation rules

We rst dene min-maxasso iation rules:The most general non-redundant

asso i-ationrules a ording totheir semanti . Informally, anasso iationruleisredundant

ifitbringsthesame informationor lessinformationthan isbrought byanotherrule

of same support and onden e. Then, the min-max asso iation rules are the

non-redundant asso iationrules having minimalante edent and maximal onsequent:

r

isamin-maxasso iationruleifnootherasso iationrule

r

hasthesamesupportand

onden e, an ante edent that is a subset of the ante edent of

r

and a onsequent thatisa superset ofthe onsequent of

r

.

Denition 5. (Min-max asso iationrules) Let

AR

be the set of asso iation rules extra ted. An asso iation rule

r : l

1

→ l

2

∈ AR

is a min-max asso iation rule i

∄ r

: l

1

→ l

2

∈ AR

with supp

(r

)

=supp

(r)

, onf

(r

)

= onf

(r)

,

l

1

⊆ l

1

and

l

2

⊆ l

2

.

Based on this denition, we hara terize exa t and approximate min-max

asso i-ation rules that onstitute respe tively the min-max exa t basis and the min-max

approximate basisin the two following se tions.

3.1. Exa t min-max asso iation rules

First,noti e that exa t asso iation rules, with the form

r : l

1

⇒ (l

2

\ l

1

)

, are rules between two frequent itemsets

l

1

⊂ l

2

having the same losure:

γ(l

1

) = γ(l

2

)

. Sin e

conf

(r) = 1

we have

supp(l

1

) = supp(l

2

)

, and as

l

1

⊂ l

2

we seethat

γ(l

1

) = γ(l

2

)

. We dene min-maxasso iationrules amongthese exa t rules.

Let

g

bethe generatorof

γ(l

1

) = γ(l

2

)

su hthat

g ⊆ l

1

.Sin e

g

isminimal, wehave

g ⊆ l

1

⊂ l

2

⊆ γ(l

2

)

. Furthermore, all itemsets in the interval

[g, γ(l

2

)]

, dened by in lusion

4

, have the same losure

γ(l

2

)

and thus the same support. The min-max asso iationrule amongall ruleswith the form

r : l

1

⇒ (l

2

\ l

1

)

with

l

1

, l

2

∈ [g, γ(l

2

)]

is the rule

g ⇒ (γ(l

2

) \ g)

. This rule has a minimal ante edent,

g

, and a maximal onsequent,

γ(l

2

)

, amongall theserules that have the same support.

We generalize this denition to all generators of the frequent losed itemset

γ(l

2

)

. Let

Gen

γ(l2

)

be the set of these generators. All exa t min-max asso iation rules onstru tedwith

γ(l

2

)

areruleswith theform

g ⇒ (γ(l

2

)\g)

with

g ∈ Gen

γ(l2)

.The extensionof this propertytoall frequent losed itemsetsdenes the min-maxexa t

basis ontaining allexa t min-max asso iation rules hara terized in denition 5.

Denition 6. (Min-max exa t basis) Let

Closed

be the set of frequent losed item-setsextra ted fromthe ontext and,for ea hfrequent losed itemset

f

,let'sdenote

Gen

f

the setof generators of

f

. Themin-max exa t basisis:

MinMaxExact

= {r : g ⇒ (f \ g) | f ∈ Closed ∧ g ∈ Gen

f

∧ g 6= f }.

The ondition

g 6= f

dis ards rules with the form

g ⇒ ∅

; it is equivalent to the ondition

l

1

⊂ l

2

in the denition of asso iation rules. We state in the following propositionthatthe min-maxexa t basisdoesnot lead to information loss.

4

(15)

The pseudo- ode of the algorithm for onstru ting the min-max exa t basis using

frequent losed itemsetsand their generatorsis presented in gure4.Ea helement

of a set

F C

k

ontains three elds: a

k

-generator

generator

, its losure

closure

and their support

supp

. The algorithm returns the set

MinMaxExact

ontaining the exa t min-maxrules.

Input : sets

F C

k

Output: set

MinMaxExact

1)

MinMaxExact

← ∅

2) for

k = 1

to

ν

do

3) forall

k

-generator

g ∈ F C

k

do 4) if(

g 6= g.closure

)

5) then insert

{r : g ⇒ (g.closure \ g), g.supp}

in

MinMaxExact

6) end

7) end

8) return

MinMaxExact

Figure4. Algorithmforgeneratingthemin-maxexa tbasis.

First,

MinMaxExact

isinitializedwiththeemptyset(step1).Then,ea hset

F C

k

is examinedinin reasingorderof

k

values(steps2to7).Forea h

k

-generator

g ∈ F C

k

of the frequent losed itemset

γ(g)

(steps 3 to 6), if

g

is dierent from its losure

γ(g)

(step4), the rule

r : g ⇒ (γ(g) \ g)

, whi h supportisequal to the supportof

g

and

γ(g)

, isinserted into

MinMaxExact

(step 5).Finally, the algorithm returnsthe set

MinMaxExact

ontainingallexa tmin-maxasso iationrulesbetweengenerators andtheir losures (step 8).

Example 5. Themin-maxexa tbasisextra tedfrom ontext

D

forminsupp=2/6is presented intableIII.It ontainssevenruleswhereasthesetofallexa t asso iation

rules,presented in table IV, ontains fourteenrules.

TableIII. Min-maxexa tbasisextra tedfrom

D

.

Generator Closure Exa trule Supp

{A} {AC} A

C 3/6 {B} {BE} B

E 5/6 {C} {C} {E} {BE} E

B 5/6 {AB} {ABCE} AB

CE 2/6 {AE} {ABCE} AE

BC 2/6 {BC} {BCE} BC

E 4/6 {CE} {BCE} CE

B 4/6

(16)

TableIV. Exa tasso iationrulesextra tedfrom

D

.

Exa trule Supp Exa trule Supp

A

C 3/6 BC

E 4/6 B

E 5/6 CE

B 4/6 E

B 5/6 AB

CE 2/6 AB

C 2/6 AE

BC 2/6 AB

E 2/6 ABC

E 2/6 AE

B 2/6 ABE

C 2/6 AE

C 2/6 ACE

B 2/6

Proposition 3. (i)Allexa tasso iationrulesandtheirsupports anbededu edfrom

the min-max exa t basis. (ii) All rules in the min-max exa t basis are min-max

asso iationrules.

Proof. (i) Let

r : l

1

⇒ (l

2

\ l

1

)

be an exa t asso iation rule between two frequent itemsets with

l

1

⊂ l

2

. Sin e

conf

(r) = 1

, we have

supp(l

1

) = supp(l

2

)

and as an itemset's support is equal to its losure's support, we dedu e that

supp(γ(l

1

)) =

supp(γ(l

2

))

whi himpliesthat

γ(l

1

) = γ(l

2

) = f

. Theitemset

f

isafrequent losed itemset

f ∈ F C

and,obviously, there existsarule

r

: g ⇒ (f \ g) ∈ MinMaxExact

su hthat

g

isageneratorof

f

with

g ⊆ l

1

and

g ⊂ l

2

. We shownowthatthe rule

r

anditssupport anbededu edfromthe rule

r

anditssupport.Sin e

g ⊆ l

1

⊂ l

2

f

, rule

r

's ante edent and onsequent an be derived from those of rule

r

. From

γ(l

1

) = γ(l

2

) = f

, we dedu e that

supp

(r) = supp(l

2

) = supp(γ(l

2

)) = supp(f ) =

supp(r

)

.

(ii)Let

r : g ⇒ (f \ g) ∈ MinMaxExact

.A ordingtodenition6,wehave

g ∈ Gen

f

and

f ∈ Closed

. We demonstrate that there is no other rule

r

: l

1

⇒ (l

2

\ l

1

) ∈

MinMaxExact

, su h as supp

(r

)

= supp

(r)

, onf

(r

)

= onf

(r)

,

l

1

⊆ g

and

f ⊆ l

2

. If

l

1

⊂ g

then, a ording to denition 4,we have

γ(l

1

) ⊂ γ(g) = f =⇒ l

1

6∈ Gen

f

and then

r

6∈ MinMaxExact

. If

f ⊂ l

2

and a ording to denition 3, we have

f = γ(f ) = γ(g) ⊂ l

2

= γ(l

2

)

. From denition 4 we dedu e

g 6∈ Gen

l

2

and we on lude that

r

6∈ MinMaxExact

.

3.2. Approximate min-max asso iation rules

Approximate asso iation rules, with the form

r : l

1

→ (l

2

\ l

1

)

, are rules between two frequent itemsets

l

1

⊂ l

2

su h that

γ(l

1

) ⊂ γ(l

2

)

. Sin e

conf

(r) < 1

we have

supp(l

1

) > supp(l

2

)

andwe dedu ethat

γ(l

1

) ⊂ γ(l

2

)

.

We dedu e the denition of approximate min-max asso iation rules. Let

g

1

be a generator of the frequent losed itemset

f

1

and

g

2

be a generator of the frequent loseditemset

f

2

su hthat

f

1

⊂ g

2

⊆ l

2

⊆ f

2

.Allruleswiththeform

r : l

1

→ (l

2

\l

1

)

where

l

1

∈ [g

1

, f

1

]

and

l

2

∈ [g

2

, f

2

]

have the same onden e and the same support sin e

g

1

,

l

1

and

f

1

have the same support aswell as

g

2

,

l

2

and

f

2

. We then dedu e thatthe min-maxasso iationruleamongalltheserules is

g

1

→ (f

2

\ g

1

)

.Indeed,

g

1

isthe minimalitemset in

[g

1

, f

1

]

and

f

2

is the maximalitemset in

[g

2

, f

2

]

.

(17)

The generalization of this property to all ouples of frequent itemsets

l

1

and

l

2

su h that

l

1

⊂ l

2

and

supp(l

1

) 6= supp(l

2

)

denes the min-max approximate basis ontaining all approximate min-max asso iation rules hara terized indenition 5.

Denition 7. (Min-max approximate basis) We denote

Gen

the setof generators of the frequent losed itemsetsin

Closed

. Themin-maxapproximate basisis:

MinMaxApprox

= {r : g → (f \ g) | f ∈ Closed ∧ g ∈ Gen ∧ γ(g) ⊂ f }.

Thepseudo odeofthealgorithm forgeneratingtheset

MinMaxApprox

of approxi-matemin-maxrules usingfrequent loseditemsetsandtheirgeneratorsispresented

in gure5.

Input : sets

F C

k

, onden ethreshold

minconf

Output: set

MinMaxApprox

1)

MinMaxApprox

← ∅

2) for

k = 1

to

ν − 1

do

3) forall

k

-generator

g ∈ F C

k

do

4) forallfrequent loseditemset

f ∈ F

j>k

| f ⊃ g.closure

do 5) if(

f.supp/g.supp ≥ minconf

)

6) theninsert

{r : g → (f \ g), f.supp/g.supp, f.supp}

in

MinMaxApprox

7) end

8) end

9) end

10) return

MinMaxApprox

Figure5. Algorithmforgeneratingthemin-maxapproximatebasis.

Thealgorithm examinesthe sets

F C

k

in in reasing orderof

k

values (steps2to 9). For ea h

k

-generator

g ∈ F C

k

(steps3 to 8), it onsiders all losed supersets

f

of the losureof

g

(steps4to 7).It omputesthe onden eof therule

r : g → (f \ g)

(step5)andinserts

r

in

MinMaxReduc

ifitisabovethe

minconf

threshold(step6). Example 6. Themin-maxapproximate basisextra ted from ontext

D

for minsupp =2/6 andmin onf =2/5is presented intable V.It ontains ten rules whereas the

set of all approximate asso iation rules, presented in table VI, ontains thirty-six

rules.

Proposition 4. (i) Allapproximate asso iationrules an bededu ed, withtheir

sup-ports and onden es, from the min-max approximate basis. (ii) All rules in the

min-maxapproximate basisaremin-maxasso iationrules.

Proof. (i) Let

r : l

1

→ (l

2

\ l

1

)

bean asso iationrule between two frequent itemsets with

l

1

⊂ l

2

.Sin e

conf

(r) < 1

wealsohave

γ(l

1

) ⊂ γ(l

2

)

. Foranyfrequentitemsets

l

1

and

l

2

, there isa generator

g

1

su h that

g

1

⊂ l

1

⊆ γ(l

1

) = γ(g

1

)

anda generator

g

2

su h that

g

2

⊂ l

2

⊆ γ(l

2

) = γ(g

2

)

. Sin e

l

1

⊂ l

2

, we have

l

1

⊆ γ(g

1

) ⊂ l

2

⊆ γ(g

2

)

and the rule

r

: g

(18)

Table V. Min-maxapproximatebasisextra tedfrom

D

.

Generator Closure Closedsuperset Approximaterule Supp Conf

{A} {AC} {ABCE} A

BCE 2/6 2/3

{B} {BE} {BCE} B

CE 4/6 4/5

{B} {BE} {ABCE} B

ACE 2/6 2/5

{C} {C} {AC} C

A 3/6 3/5

{C} {C} {BCE} C

BE 4/6 4/5

{C} {C} {ABCE} C

ABE 2/6 2/5

{E} {BE} {BCE} E

BC 4/6 4/5

{E} {BE} {ABCE} E

ABC 2/6 2/5

{AB} {ABCE}

{AE} {ABCE}

{BC} {BCE} {ABCE} BC

AE 2/6 2/4

{CE} {BCE} {ABCE} CE

AB 2/6 2/4

TableVI. Approximateasso iation rulesextra tedfrom

D

.

Approximaterule Supp Conf Approximaterule Supp Conf Approximaterule Supp Conf

BCE

A 2/6 2/4 B

ACE 2/6 2/5 B

CE 4/6 4/5 AC

BE 2/6 2/3 C

ABE 2/6 2/5 C

BE 4/6 4/5 BC

AE 2/6 2/4 E

ABC 2/6 2/5 E

BC 4/6 4/5 BE

AC 2/6 2/5 A

BC 2/6 2/3 A

B 2/6 2/3 CE

AB 2/6 2/4 B

AC 2/6 2/5 B

A 2/6 2/5 AC

B 2/6 2/3 C

AB 2/6 2/5 C

A 3/6 3/5 BC

A 2/6 2/4 A

BE 2/6 2/3 A

E 2/6 2/3 BE

A 2/6 2/5 B

AE 2/6 2/5 E

A 2/6 2/5 AC

E 2/6 2/3 E

AB 2/6 2/5 B

C 4/6 4/5 CE

A 2/6 2/4 A

CE 2/6 2/3 C

B 4/6 4/5 BE

C 4/6 4/5 C

AE 2/6 2/5 C

E 4/6 4/5 A

BCE 2/6 2/3 E

AC 2/6 2/5 E

C 4/6 4/5

thatthe rule

r

, its supportand its onden e an be dedu ed from the rule

r

, its

supportand its onden e. Sin e

g

1

⊂ l

1

⊆ γ(g

1

) ⊂ g

2

⊂ l

2

⊆ γ(g

2

)

, the ante edent and the onsequent of

r

an be rebuilt starting from the rule

r

. Moreover, we

have

γ(l

2

) = γ(g

2

)

and thus

supp(r) = supp(l

2

) = supp(γ(g

2

)) = supp(r

)

. Sin e

g

1

⊂ l

1

⊆ γ(g

1

)

, we have

supp(g

1

) = supp(l

1

)

and we thus dedu e that:

conf

(r)

=

supp(l

1

) / supp(l

2

)

=

supp(g

1

) / supp(γ(g

2

))

=

conf

(r

)

.

(ii) Let

r : g ⇒ (f \ g) ∈ MinMaxExact

. A ording to denition 7, we have

f ∈

Closed

,

g ∈ Gen

f

and

f

⊂ f

. We demonstrate that thereis no otherrule

r

: l

1

(l

2

\ l

1

) ∈ MinMaxApprox

, su h as supp

(r

)

= supp

(r)

, onf

(r

)

= onf

(r)

,

l

1

⊆ g

and

f ⊆ l

2

. If

l

1

⊂ g

then,a ording to denition4,we have

γ(l

1

) ⊂ γ(g) = f

and then

l

1

6∈ Gen

f

. We dedu ethat

supp(l

1

) > supp(g)

and then

conf

(r

) < conf (r)

.

If

f ⊂ l

2

then, a ording todenition 3,we have

f = γ(f ) ⊂ l

2

= γ(l

2

)

. Wededu e that

supp(f ) > supp(l

2

)

and we on lude that

conf

(r) > conf (r

)

(19)

3.3. Non-transitive approximate min-max asso iation rules

We anfurtherredu e thenumber ofapproximate asso iationrules extra ted

with-outlosingthe abilityto dedu eall approximateasso iationrules,with supportand

onden e, byremoving transitive min-max asso iationrules.

A min-max asso iation rules

g → (f \ g)

with

γ(g) ⊂ f

is transitive if it exists a frequent losed itemset

f

su h that

γ(g) ⊂ f

⊂ f

. Let

g

be the generator of

f

su h that

γ(g) ⊂ g

⊆ f

⊂ f

. Then, we have the two following approximate

min-maxasso iationrules:

g → (f

\ g)

and

g

→ (f \ g

)

. Therule

g → (f \ g)

isthe transitive omposition of the two previous rules; its supportis equal to the se ond

rule'ssupport andits onden eis equal to the produ t oftheir onden es.

We generalize this hara terization to all triplets onsisting of a generators

g

, its losure

f

anda losed superset

f

of

f

todenethe non-transitive min-max approx-imatebasis,thatisthetransitiveredu tionofthemin-maxapproximatebasis.Let's

denote

l

1

⋖ l

2

whenan itemset

l

1

is an immediate prede essor of an itemset

l

2

, i.e.

∄l

3

su hthat

l

1

⊂ l

3

⊂ l

2

.The non-transitivemin-maxapproximate rules areof the form

g → (f \ g)

where

f

is a frequent losed itemset and

g

a frequent generator su hthat

γ(g)

isan immediateprede essor of

f

.

Denition 8. (Non-transitive min-max approximate basis) The non-transitive

min-maxapproximate basisis:

MinMaxReduc

= {r : g → (f \ g) | f ∈ Closed ∧ g ∈ Gen ∧ γ(g) ⋖ f }.

Remark. This transitive redu tion de reases the number of approximate rules

ex-tra ted, by sele ting the most pre ise rules, i.e. whith highest onden es, sin e

transitive rules have lower onden es than non-transitive rules.

Thealgorithmpresentedingure6 onstru tstheset

MinMaxReduc

ofnon-transitive approximatemin-maxrulesusing frequent loseditemsetsandtheir generators.For

ea h generator

g

, it determines all frequent losed itemsets

f

that are immediate su essorsof the losureof

g

and then,it generates all rules between

g

and

f

that have a su ient onden e.

First,

MinMaxReduc

is initialized with the empty set (step 1) and sets

F C

k

are su essively examined in in reasing order of

k

values (steps 2 to 19). For ea h

k

-generator

g ∈ F C

k

(steps 3 to 18), the set

ImSucc

g

of immediate su essors of

g

losure is initialized with the empty set (step 4). The sets

S

j

of frequent losed

j

-supersets of

γ(g)

for

|γ(g)| < j ≤ µ

are onstru ted (steps 5 to 7). Then, sets

S

j

are onsidered su essively in as ending order of

j

values (steps 8 to 17). For ea h itemset

f ∈ S

j

that is not a superset of an immediate su essor of

γ(g)

in

ImSucc

g

(step10),

f

isinsertedin

ImSucc

g

(step11) andthe onden eofthe rule

r : g → (f \ g)

is omputed (step 12). If the onden e of

r

is above

minconf

, the rule

r

isinsertedin

MinMaxReduc

(steps13and14).Whenallthegeneratorsofsize lower than

ν − 1

havebeen onsidered, thealgorithm returnstheset

MinMaxReduc

(step 20).

Example 7. Thenon-redundant min-maxapproximatebasisextra ted from ontext

(20)

Input : sets

F C

k

, onden ethresholdmin onf Output: setMinMaxRedu

1)

MinMaxReduc

← ∅

2) for

k = 1

to

ν − 1

do 3) forall

k

-generator

g ∈ F C

k

do 4)

ImSucc

g

← ∅

5) for

j = |g.closure|

to

µ

do 6)

S

j

← {f ∈ F C.closure | f ⊃ g.closure ∧ |f | = j}

7) end 8) for

j = |g.closure|

to

µ

do

9) forallfrequent loseditemset

f ∈ S

j

do 10) if(

∄s ∈ ImSucc

g

| s ⊂ f

)then do

11) insert

f

in

ImSucc

g

12)

conf

← f.supp/g.supp

13) if(

conf

≥ minconf

)

14) theninsert

{r : g → (f \ g), conf , f.supp}

in

MinMaxReduc

15) end 16) end 17) end 18) end 19) end 20) return

MinMaxReduc

Figure6. Algorithmforgeneratingthenon-transitivemin-maxapproximatebasis.

onlysevenrules,thatisthreeruleslessthanthe approximate min-maxbasis.These

threerules areB

ACE, C

BE andE

ABC thathaveminimal supportand onden emeasures among the ten rules ofthe approximate min-maxbasis.

Table VII. Non-transitivemin-maxapproximatebasisextra tedfrom

D

.

Generator Closure Closedsuperset Approximaterule Supp Conf

{A} {AC} {ABCE} A

BCE 2/6 2/3

{B} {BE} {BCE} B

CE 4/6 4/5

{B} {BE} {ABCE}

{C} {C} {AC} C

A 3/6 3/5

{C} {C} {BCE} C

BE 4/6 4/5

{C} {C} {ABCE}

{E} {BE} {BCE} E

BC 4/6 4/5

{E} {BE} {ABCE}

{AB} {ABCE}

{AE} {ABCE}

{BC} {BCE} {ABCE} BC

AE 2/6 2/4

(21)

Proposition 5. All approximate asso iation rules, with supportand onden e, an

be dedu edfrom the non-transitive min-maxapproximate basis.

First,we showthatall approximate min-maxasso iation rules an bederived from

thenon-transitivemin-maxapproximateasso iationrules.Then,fromproposition4

we on lude thatall approximate asso iationrules an also be dedu ed.

Proof. Let

r : g

1

→ (f

n

\ g

1

)

bean approximate min-maxasso iationrulebetween a generator

g

1

whose losure is

f

1

and afrequent losed superset

f

n

of

f

1

. If

f

1

⋖ f

n

then

r

isnon-transitive:

r ∈ MinMaxReduc

.If

f

1

6⋖f

n

then

r

istransitiveandthereis asequen e

f

1

,

f

2

,

. . .

,

f

n

offrequent loseditemsetssu hthat

g

1

⊆ f

1

⋖ f

2

⋖ . . . ⋖ f

n

with

n ≥ 3

. Ea h

f

i

has at least one generator

g

i

su h that

γ(g

i

) = f

i

and sin e

f

1

⋖f

2

⋖. . .⋖f

n

,thereisasequen eofrules

r

i

: g

i

→ (f

i+1

\g

i

)

for

i ∈ [1, n−1]

thatare non-transitivemin-maxrules.Theante edentof

r

istheante edent

g

1

oftherstrule

r

1

ofthesequen e.The onsequentof

r

is

(f

n

\g

1

) = (((f

n

\g

n−1

)∪g

n−1

)\g

1

)

,i.e.the unionofrule

r

n−1

'sante edent and onsequent minus rule

r

1

'sante edent. We now showthatsupportand onden eof

r

an bededu edof thoseofrules

r

i

. We have

supp(r) = supp(g

1

∪ (f

n

\ g

1

)) = supp(f

n

) = supp(g

n−1

∪ (f

n

\ g

n−1

)) = supp(r

n−1

)

. Thesupportof

r

isequaltothesupportofthelastrule

r

n−1

ofthesequen e.Wealso have:

conf

(r) = supp(f

n

)/supp(g

1

)

=

supp

(f

n

)/supp(g

n−1

) × supp(g

n−1

)/supp(g

1

)

=

supp(f

n

)/supp(g

n−1

) × supp(f

n−1

)/supp(g

n−2

) ×

...

×supp(f

2

)/supp(g

1

)

=

conf

(r

n−1

) × conf (r

n−2

) × . . . × conf (r

1

)

.The onden eof

r

isequalto theprodu t ofthe onden es of the rules

r

i

for

i = 1

to

n − 1

.

4. Deriving asso iation rules fromthe min-max bases

Weintrodu einthisse tionsimplete hniquesandalgorithmstore onstru tallexa t

asso iation rules, all approximate asso iation rules and all transitive approximate

min-maxasso iationrules fromthe min-max bases.

4.1. Deriving exa t asso iation rules

The graph-oriented representation of the exa t and the exa t min-max asso iation

rules extra ted from ontext

D

for minsupp =2/6 and min onf =2/5 are given in gure7 and8 respe tively.

Ea hvertex

v

l

representsafrequentitemset

l

thatisasubsetofthemaximalfrequent itemset {ABCE}. Ea h edge between two verti es

v

a

and

v

c

represents the exa t asso iation rule

a ⇒ c \ a

. A losed interval is a sub-graph ontaining all verti es representing itemsets of the intervals

[g

i

, f ]

where ea h

g

i

is a generator of the frequent losed itemset

f

. Sin e all itemsets in a losed interval have the same support,all rules in this intervalalso have the samesupport.

Inthegraph representation,derivingallexa t rulesmeansaddingallpossibleedges

between two verti es ofthe same losed interval.Ea hedge in gure8between two

(22)

closed interval

generator itemset

AC

AB

BE

BC

CE

ABCE

A

C

B

E

ABC

ABE

ACE

BCE

AE

Figure7. Exa tasso iation rulesextra tedfrom

D

.

closed interval

generator itemset

AB

BC

CE

A

C

B

E

AC

BE

ABCE

ABC

ABE

ACE

BCE

AE

Figure8. Exa tmin-maxasso iation rulesextra tedfrom

D

.

weaddalledgesbetweentwoverti es,onerepresentingasupersetof

g

andtheother asubset of

f

.

(23)

The algorithm re eivesthe set

MinMaxExact

of exa t min-max rules asinput and itreturnsthe set

AllExact

ontaining allexa t asso iationrules.Itspseudo- ode is presented ingure 9.It onsiders all exa t min-maxrules

r

1

: a

1

⇒ c

1

with

|c

1

| > 1

(steps 2 to 8). For all subset

c

2

of

c

1

(steps 3 to 7), it generates all rules with the form

r

2

: a

1

⇒ c

2

and

r

3

: a

1

∪ c

2

⇒ c

1

\ c

2

(steps 4 and 6). These rules have the samesupportas

r

1

.Sin erule

r

3

anbegeneratedseveraltimes, thealgorithm rst testsifithasnot already been insertedin

AllExact

(step 5).

Input : set

MinMaxExact

Output: set

AllExact

1)

AllExact

← ∅

2) forallrule

{r

1

: a

1

⇒ c

1

, r

1

.supp} ∈ MinMaxExact

with

|c

1

| > 1

do 3) forallsubset

c

2

⊂ c

1

do

4) insert

{r

2

: a

1

⇒ c

2

, r

1

.supp}

in

AllExact

5) if

{r

3

: a

1

∪ c

2

⇒ c

1

\ c

2

, r

1

.supp} /

∈ AllExact

6) then insert

r

3

in

AllExact

7) end

8) end

9) return

AllExact

Figure9. Algorithmforre onstru tingallexa tasso iation rules.

Example 8. Consider rule AB

CE represented in gure 4 by the edge between verti es {AB} and {ABCE}. From this rule we dedu e rules AB

C, AB

E,

ABC

E and ABE

C and from rule AE

BC, we dedu e rules AE

B,

AE

C, ABE

Cand ACE

B. All theserules have the same support.

Remark. For onstru ting all exa t rules usingsets

F C

k

ofgenerators and frequent loseditemsets,we onsiderea hgenerator

g

andits losure

f

.Wegenerateallrules

r : g ⇒ l \ g

and

r : l ⇒ f \ l

for

l ∈ [g, f [

.Forinstan e,fromthe generator{AB}and its losure {ABCE}, we generate rules AB

CE, AB

C, AB

E, ABC

E andABE

C.Theirsupportisequal tothe supportof

g

and

f

, i.e.thesupportof

{AB}and {ABCE}.

4.2. Deriving approximate asso iation rules

Figures10and11depi tthe graph-oriented representationsofthe approximateand

the approximate min-max asso iation rules extra ted from ontext

D

for minsupp =2/6andmin onf=2/5.Ea hedge between twoverti es

v

a

and

v

c

represents the approximate rule

a → c \ a

.

In gure 11, ea h edge between two verti es

v

g

and

v

f

represents the min-max approximate rule

g → f \ g

where

g

isa generatorand

f

afrequent losed superset of

g

. That isto sayan edge between a minimal vertex of a losed intervaland the maximalvertexofanother losed intervalabovethe rstone.Forinstan e,theedge

(24)

generator itemset

AC

AB

AE

BE

BC

CE

ABCE

A

C

B

E

ABC

ABE

ACE

BCE

Figure10. Approximateasso iationrulesextra tedfrom

D

.

closed interval

generator itemset

AC

AB

BE

BC

CE

ABCE

A

C

B

E

ABC

ABE

ACE

BCE

AE

Figure11. Approximatemin-maxasso iationrulesextra tedfrom

D

.

To derive all approximate rules, when thereis an edge between two verti es of two

losed intervals we reate allpossibleedges between ea hvertexof the rstinterval

and ea h vertex of the se ond interval. All these rules have the same support and

Figure

Figure 1. Extrating frequent losed itemsets in the ontext D with Close.
Figure 3. Deriving frequent losed itemsets and generators with Close
Table III. Min-max exat basis extrated from D .
Table IV. Exat assoiation rules extrated from D .
+7

Références

Documents relatifs

Feur sevenadennoù an TM standard (krommenn ar lein) ha feur sevenadennoù hirder ar vogalenn dindan an TM (krommenn an diaz). Keñver etre TM ha hirder ar vogalenn.. Dre

L’une théorique rassemblant les déférentes informations théoriques concernant l’étiquetage des denrées alimentaires en se basant sur les réglementations en

In the next section, we describe the interface through two use cases: a student using the tool to attend a distant lecture in real time, and a teacher giving the lecture to

The paper estab- lishes a surprisingly simple characterization of Palm distributions for such a process: The reduced n-point Palm distribution is for any n ≥ 1 itself a log Gaussian

Negotiations to remove tariffs on Environmental Goods (EGs) and Environmental Services (ESs) at the Doha Round, followed by negotiations towards an Environmental Goods Agreement

Is it finally possible to combine the two graphs in Figure 1 and Figure 2 to obtain an acceptable design zones in term of power level and number and assemblies in which

ه نأب تاقرسلا &#34; : &#34;لودلا لك اهلدابتت يتلا ةيدبلأا نأ ىلع ، نراقملا بدلأا و اقفأ بحرأ قمعأ و ارظن يضلا ةميدقلا تاساردلا نم ةيلودلا ةيبدلأا

entre l'Union européenne et la Russie doivent désormais reposer sur le respect du droit international et sur un dialogue, dans le cadre duquel