HAL Id: hal-00381667
https://hal.archives-ouvertes.fr/hal-00381667v2
Submitted on 26 Aug 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
A Lower Bound on the Sample Size needed to perform a
Significant Frequent Pattern Mining Task
Stéphanie Jacquemont, François Jacquenet, Marc Sebban
To cite this version:
A Lower Bound on the Sample Size needed to
perform a Signi ant Frequent Pattern
Mining Task
Stéphanie Ja quemont, François Ja quenet, Mar Sebban
Laboratoire Hubert Curien,Université deSaint-Etienne, 18 ruedu Professeur
Lauras, 42000 Saint-Etienne (Fran e)
Abstra t
During the past few years, the problem of assessing the statisti al signi an e of
frequentpatternsextra ted fromagivenset
S
ofdatahasre eivedmu hattention.Consideringthat
S
always onsistsofasampledrawnfroman unknownunderlyingdistribution, two types of risks an ariseduring a frequent pattern mining pro ess:
a epting a false frequent pattern or reje ting a true one. In this ontext, many
approa hespresentedintheliteratureassumethatthedatasetsizeisan
appli ation-dependentparameter.Inthis ase,thereisatrade-obetweenbotherrorsleadingto
solutions thatonly ontrol one riskto thedetriment of theotherone. Ontheother
hand,manysampling-based methods haveattempted todetermine theoptimalsize
of
S
ensuring a good approximation of the original (potentially innite) databasefromwhi h
S
is drawn. However,these approa hes oftenresortto Chernoboundsthatdonotallowtheindependent ontrolofthetworisks.Inthispaper,weover ome
the mentioned drawba ks by providing a lower bound on thesample size required
to ontrolboth risks anda hieve asigni ant frequent pattern miningtask.
1 Introdu tion
Infrequentpatternmining(1;2),one aimstond interestingpatterns froma
databaseintheformofasso iationrules,sequen es,episodes, orrelations,et .
Manyalgorithmshavebeenproposedintheliterature todealwithasso iation
rule mining (3; 4; 5), sequential pattern mining (6; 7; 8; 9), graph mining
(10; 11; 12; 13), tree mining (14; 15; 16). Chao et al.(17) proposed a generi
library for dealing with a large family of frequent patterns. This domain of
⋆
ThisworkispartoftheongoingBINGO2resear hproje tnan edbytheFren h
NationalResear h Agen y.
hal-00381667, version 1 - 29 Jul 2009
as the dis overy of ustomers' behavior in supermarkets, the extra tion of
patterns of alarms in manufa turing supervision, the modeling of web users,
et . A pattern of a sample
S
is alled frequent if its observed frequen y isgreather than a minimal support threshold. For instan e, a sequen e mining
pro ess onsistsofdete ting frequentsubsequen esfromadatasetofsequen es
that are made up of possibly non ontiguous symbols. For example, let us
assume that the database
S
is onstituted of the following three sequen esS = {ACGT, AT GAT, CAGT A}
.Byxing theminimalsupportthresholdto2
3
, the patternAGT
is onsidered as frequent sin e it o urs three times outof the three sequen es of
S
,that a tually is>
2
3
.During the past de ade, the s ienti ommunity has mainly on entrated
its eorts on the redu tion of the omplexity of the frequent pattern mining
methodstodealwithlargedatasets.Inthis ontext,theredu tionofthesear h
spa e has onstituted one of the main obje tives. From an algorithmi point
ofview, alltheseapproa hes havetoprovethat theyare orre tand omplete,
i.e. they must guarantee that (i) all the frequent patterns that are extra ted
arereallyfrequent in
S
,and(ii)nofrequentpatternofS
hasbeenoverlooked.However, these properties are not su ient toguarantee the signi an e of a
frequentpatternminingpro ess.Indeed,
S
beingnothingelsebut asampleofanunknown target distribution
D
, miningalgorithmsoften suppose that thedistributionover
S
isthesameasD
fromwhi hthesedatahavebeendrawn.Inotherwords,theymakenoassessmentofthelikelihoodthatafrequentpattern
extra tedfrom
S
isanartifa tofthesamplingratherthana onsistentpatternin the target distribution
D
. In the same way, they do not assess the risk ofoverlookinga patternthat would bein fa tfrequent a ording to
D
.More formally, de iding if a pattern in
S
is frequent or not boils down toomparingitsobserved proportion
p
ˆ
withagivensupportthresholdp
0
.Ifp >
ˆ
p
0
,thepatternis onsideredasfrequentbytheminingalgorithm.However,thetrueprobability
p
ofthispattern omesundertheunknown targetdistributionD
.Therefore,whenanalgorithmtakesade isionaboutthestatusofapattern,ittakesarisk
α ∈ [0, 1]
ofa eptingafalsefrequentpattern(i.e.thatappearsin
S
due to han e alone), or a riskβ ∈ [0, 1]
of reje ting a true frequentone. In this ontext,
1 − α
an be alled the theoreti al pre ision, whereas1 − β
orresponds tothe theoreti al re allofthe algorithm.Itis importanttonote that most frequent pattern mining algorithms do nothing (or little) to
ontrolboth
α
andβ
.As mentionedbefore, theirmain goalonly onsistsofguarantying their orre tness and their ompleteness over
S
, but nothing isensured over the underlying distribution
D
. This an be justied by the fa tthat, statisti ally,given a onstant number of data in
S
, redu ing one of thetwo risks impliesin reasing of the se ondone.
Inthis paper, wedierentlytakeup thisproblem,byprovidingalowerbound
onthe sizeof
S
needed totheoreti allyguarantee apre isionof(1 − α)
and are allof
(1 − β)
a ording toanydistributionD
.Therefore, wereje t the wellknown statement that to in rease the re all, we have to a ept to de rease
the pre ision, orvi eversa. Weratheranswerthe following question:What is
the minimal size of
S
required to satisfy given risksα
andβ
? We laim thatthis ontributionisnovel by omparison with the state of the art. Indeed, we
will see that the few approa hes attempting to deal with this problem from
a theoreti al point of view either are based on Cherno bounds that do not
allow the independent ontrol of
α
andβ
, or allon statisti al tests thatre-quireto hoose therisktooptimize.Even thoughour ontributionisaboveall
theoreti al,we laimthat it anprovideuseful helpinmanyappli ations.For
instan e, indomainswhere the dataa quisition isnot ostly, one an wonder
what is the minimal number of examples that are required to optimize the
trade-o between the redu tion of the algorithmi onstraints and the
guar-anteeofadis overyoftrueknowledge.Therefore,insu h ases,ourtheoreti al
result provides a bound rea hable in pra ti e guaranteeing a signi ant
fre-quent pattern mining task. This is the ase for example in the modeling of
web users' behavior,wheretera-bytesof data areavailableinlogles.On the
otherhand, indomainswherethe numberof availableexamples islimited(in
mole ular biology for instan e), it enables us to draw the attention of data
miners on the fa t that some extra ted patterns ould be the result of false
dis overy, andsome others ouldhavebeen omitteddespitetheirsigni an e.
In this ase, the use of the extra ted knowledgemust be done with aution.
The rest of the paper is organized as follows. In Se tion 2, we present the
state of the artapproa hes aiming to assess the signi an e of the extra ted
patterns. Se tion3isdevotedtothe presentation ofour boundenablingusto
x inadvan e
α
andβ
;A rst illustrationis presented in Se tion4 on arealdatabase. In Se tion 5, we dis uss about the valuation of the parameters of
our bound, and we present a largerseries of experiments.
2 Related Work
2.1 Bottlene k of frequent pattern mining algorithms
Let us suppose we arry out a series of experiments onsisting of tossing a
oin
N = 10
times. LetS
be the resulting sample of 10 itemsets onstituted in this ase of only one item (<tails> or <heads>). Suppose we observe inS
respe tively 8 <tails> and 2 <heads>. By xing the support threshold top
0
= 0.5
, the pattern tails will be onsidered as frequent inS
, be ause itsobserved frequen y
p = 0.8
ˆ
is higher thanp
0
, while the pattern <heads>frequent than <heads> is signi ant? In fa t, we an easily prove that su h
a ombinationof <heads> and <tails> an often o ur over only 10 trials,
without hallenging the balan e of the oin itself. We an note that the size
of
S
has adire t impa tonthe signi an eof the result.During the past fewyears, several papers have drawn the attention of data miners on the risks
of extra ting regularities from data in the form of a random artifa t. The
previous example is a good illustration of this problem that an arise in a
frequent patternmining pro ess. Letus des ribe nowsome possible solutions
thathavebeenpresented inthe literature,andthat takeintoa ount thesize
of
S
toover ome the mentioneddrawba k.2.2 Modifying the support threshold
p
0
using Cherno boundsRatherthan dire tly omparingthe observed frequen y
p
ˆ
inS
withp
0
,arstsolution onsists of bounding
p
0
in order to take into a ount the estimateerror
|ˆ
p − p|
due tothe use of a sampleS
of nite sizeN
, wherep
isthe trueprobability of the patternunder the unknown theoreti al distribution
D
.A well-known non parametri approa h thatdeals with this problemis based
on Cherno bounds that state that the estimate error between a random
variable
X
observed onasampleS
andits expe tedvalueE(X)
a ording toD
is lower bounded byǫ
, su h that∀ǫ ∈]0, 1[, P (|X − E(X)| ≥ ǫ) < e
−2N ǫ
2
.
(1)Eq.1 states that, obviously, the higher the sample size
N
, the smaller theestimate error. Cherno bounds have been widely used in statisti al
learn-ing theory for many years, and more re ently in frequent pattern mining by
sampling-based methods (19; 20) to deal with the statisti al relevan e of the
extra ted patterns. Basi ally, sampling-based data mining methods aim to
redu e the potentially huge I/O overhead in s anning a database
DB
(thatpotentially an not be stored in memory) for dis overing frequent patterns.
Their goal onsists of sampling the original database into a sample
S
andextra t regularities from this subset while guaranteeing the a ura y of the
extra ted knowledge. Even if the sample
S
is here not drawn from anun-known underlyingdistribution
D
(but rather fromanexisting large databaseDB
), this framework looks like ours, espe ially sin e those Cherno boundsanalsobeused toprovide atheoreti alsizeof
S
ensuring anupperboundofthe estimate error. Indeed, let the observed frequen y
p
ˆ
be the randomvari-able
X
of Eq.1 omputed fromS
, andp
DB
its expe ted valueE(X)
overthewholedatabase
DB
(potentiallylarge),the Chernobounds an be rewrittenP (|ˆ
p
S
− p
DB
| ≥ ǫ) < e
−2N ǫ
2
(2)
Ineq.2 an be used in dierent ways. First, given a size
N
, solving forǫ
thisinequality equal to a given probability provides a sla k value of the support
threshold
p
0
. On the other hand, given a valueǫ
, solving forN
Eq.2 equalto
δ
provides a lower bound of the sample sizeN
satisfying the estimateerror
ǫ
. Despite its obviousadvantages, the use of the Chernoboundshas alimitation.Indeed,asusedin(19;20),thesymmetryduetotheabsolutevalue
in Eq.2 indi ates that the risk of a bad estimation
p
ˆ
is equally distributedaround
p
DB
. In other words, the risk that a pattern o urs inS
less oftenthan expe ted in
DB
is equal to the risk that a pattern o urs more oftenthanexpe ted.Inthis ontext,Chernoboundsdoesnotallowthe distin tion
between the false positive rate
α
and the false negative rateβ
, as dened intheintrodu tion.This an beaproblemindomainswhere
α
andβ
havetobeindependently handled. For instan e, suppose that a va ine is administered
to a patient a ording to the frequent presen e or not of a pattern in his
DNA. Missing apatientwho has the disease (i.e. overlookinga true frequent
pattern) would not have the same medi al ee t than the one onsisting of
administering the va ine to a healthy person (i.e. admit a false frequent
pattern).
In (21), Toivonen presents another sampling method for dis overing relevant
asso iationrules.The algorithmalsopi ksarandomsample
S
fromtheorigi-naldatabase
DB
,thenitdeterminesfromS
allfrequentasso iationsrulesthatprobablyhold in
DB
;nallyitveries withDB
if they area tually frequent.To ontrol the risk of overlooking true frequent patterns, Toivonen repla es
the support threshold
p
0
by a lower bound based onthe Cherno bounds sothatmissesare avoidedwithahigh probability.However, Toivonenonlydeals
with
β
. Indeed, by usingDB
to verify if the extra ted patterns are a tuallyfrequent, the risk
α
of falsepositiveis intrinsi ally null.However, this way ofpro eedingisonlypossibleiftheoriginaldatabase
DB
isavailable.Whilethisonditionisfullled inToivonen'sframework,itisanuna eptable onstraint
in ours whi h assumes that
S
has been drawn from an unknown theoreti aldistribution.
Re ently, Laur et al. proposed in (22; 23) an approa h that not only makes
useofChernoboundsbutalsodealswithbothrisks
α
andβ
.GivenasampleS
,they provide abound forp
0
that ensures eithera pre isionequalto1
witha high probability while ontrolling the re all, or a re all equal to
1
with ahigh probability while limiting the degradation of the pre ision. Even if this
approa h is theoreti ally well founded, the user has to hoose the riterion
(re allorpre ision)hewantstooptimize,that anbeatri kytaskindomains
where both errors
α
andβ
are denitely undesirable.2.3 Modifying the support threshold
p
0
using statisti al testsA se ond solution to he k the relevan e of a dis overed pattern is to resort
to statisti al tests that involve two hypotheses, a null hypothesis
H
0
and analternative one
H
a
. Usually,H
a
is made to des ribe an interesting situation(e.g.afrequentpattern),while
H
0
hara terizesthe irrelevantsituation(e.g.anonfrequentpattern).Whenatestisperformed,twotypesoferrors ano ur:
Therstone, alledTypeIerror, omesfromthea eptationofthehypothesis
H
0
whileH
a
is true; the se ond one, alled Type II error, orresponds tothewrongde isiontoa ept
H
a
whileH
0
istrue.Therefore,adaptedtothe ontextof frequent patternmining, the Type I error an be dened asdes ribing the
risk
α
of a epting a false frequent pattern, while the Type II error an bedened as being the risk
β
of reje ting a true frequent one. In this ontext,it isimportantto re all that there exists astatisti al trade-o between these
tworisks.Givenasamplesize
S
,β
a tuallyin reases ifonewantstoredu eα
and vi e versa. In the following, we present some state of the artapproa hes
that deal with the relevan e of extra ted patterns using su hstatisti altests.
In (24), Megiddo & Srikant deal with the evaluationof the quality of
asso i-ation rules extra ted from a set of data. They present an approa h for
esti-matingthenumberoffalsedis overiesinorderto ontrolthepre ision.Letus
onsider an asso iation rule
X ⇒ Y
, whereX
andY
are sets of items. As anull hypothesis, they assume that
X
andY
o ur in the data independently.Thus, they test the nullhypothesis
H
0
: p(X ∩ Y ) = p(X) × p(Y )
againstthealternativeone
H
a
: p(X ∩ Y ) > p(X) × p(Y )
,whi h,roughlyspeaking,meansthat a lot of transa tions that ontain
X
also ontainY
. They run astatis-ti al test exploiting the property that the observed frequen y of an itemset
asymptoti allyfollows a normal distribution. To redu e the risk of a epting
a false frequent pattern, they in rease the support threshold
p
0
byz
α
× σ
p
ˆ
,where
σ
p
ˆ
isthe standard deviation ofp
ˆ
andz
α
is the (1 − α
) per entile ofthenormaldistribution.Therefore, by apriorituningthe risk
α
,they an ontrolthe pre ision.Nevertheless, by using asmallvalue for
α
, boundingp
0
by thisway resultsin the de rease of the re all.
Re ently, in (25), Webb presents two new approa hes to applying statisti al
tests in pattern dis overy to assess the quality of a pattern. First, he
sug-gests the split of the sample
S
into anexploratoryset, from whi h a patternextra tion is a hieved, and a holdout set used to assess the quality of ea h
pattern. Despite promising experimental results, this approa h is above all
empiri al and does not provide any bound that enables both risks to be
re-du ed. Webb alsopresents an approa h based on the Bonferroni adjustment
spe ial problem arises: if
α
orresponds to the risk of taking a wrong singlede ision, repeating the test many times globally in reases that risk (26). To
over omethis drawba k,several strategieshavebeenproposed(27).Afamous
one is the Bonferoniadjustment that uses a risk
α/n
when performingn
hy-pothesis tests. However, if
n
is large, su h adjustment turns out to be stri tand leads tothe in rease of the other risk
β
.Another solution onsists of using Holm pro edure (28) that takes into
a - ount the p-value of ea htest andorders them totune a less stri t risk.Su h
astrategyis alsoused inthe BHpro edure(29)that aims toset
α
whileon-trollingthe so- alledfalse dis overy rate. However, both of these adjustments
require the omputation of the p-values of the
n
tests whi h depend on theurrent appli ation. In our paper, we willprovide a more general tool
what-ever theappli ation wedeal with.Moreover, notethat we aimtodetermine a
relationship between the number
N
of data to mine and xed risksα
andβ
.In the adjustmentpro edures mentionedbefore,α
andβ
are linked tothenumber
n
of statisti al testswhentestingmultiplehypotheses. Therefore,both obje tives annotbedire tly onne ted.
In(30),Lee etal.presentthe DELI algorithmwhi hisbasedagain ona
sam-plingmethodwhi hgeneratesasample
S
fromadatabaseDB
.TomaintaininS
ana uratesetofasso iationrules,a onden eintervalisbuiltforthetrueprobability
p
of an asso iation rule inDB
,su h thatp ∈ ˆ
p ± z
α/2
r
ˆ
p(|DB|−ˆ
p)
|S|
,where
p
ˆ
is the support of the rule inS
,α
is the Type I error, andz
α
is the(
1 − α
) per entile of the normal distribution. By xingα
, the authors showthat one an determinea suitablesize of
S
satisfyingthe TypeI error. As wean note,this approa h has twomain drawba ks. First, onlythe Type I error
α
is used to assess the statisti al signi an e of the patterns. Therefore, thesize of
S
dedu ed fromthe onden e intervaldoesnot take intoa ount theType II error
β
.On the otherhand, the omputationof this interval requiresthe use of the size of the originaldatabase
DB
.As wementioned before, ourmore generalframeworkdoesnot require to have
DB
.Finally, note that other statisti al test-based investigations have dealt with
the assessment of the signi an e of patterns in data mining. They use
e- ient tests (su h asthe Chi-square test and Fisher exa t test) tostatisti ally
measure thelevelofdependen ybetween the omponentsofapattern.An
of-ten usedstrategy onsists ofverifyingif the extra ted stru turewould alsobe
dis overedfromarandomsamplehavingsamemargins(see(31)forexample).
The approa hes we presented in this survey eitherimpose a symmetry
ondi-tion on the estimate error, or minimizeonly one risk given a sample set size,
or require the al ulation of p-values of a spe i set of statisti al tests. No
one oers theoreti al results that provide a bound on the size of
S
ing arbitrary hosen parameters
α
andβ
. In the following, we ll this gapby proposing a statisti alapproa h that exploits the asymptoti onvergen e
of the distribution of frequent patterns. We provide a bound on
N
, easilyomputable,allowingthe independent ontrolof both risks
α
andβ
.3 A statisti al view of the re all and the pre ision
3.1 Risks of reje ting true frequent patterns and a epting false ones
Let
p(w)
ˆ
betheproportionofdatainthesetS
that ontainagiven patternw
.Letusre allthat
w
is alledfrequentifp(w)
ˆ
ishigherthanaminimal supportthreshold
p
0
.Infa t,p(w)
ˆ
isnothingelsebutanestimateoftherealprobabilityp(w)
overD
. Sin ep(w)
is unknown, one an formulate a hypothesis on itsreal value and perform a statisti al test. As usually done in the standard
approa hes, we suggest to des ribe by the null hypothesis
H
0
the situationwhere
p(w)
ˆ
is not high enough to onsiderw
as being frequent. As done in(24), we suggest to keep the maximal value
p
0
that preventw
from beinga epted as frequent. Therefore, we test the null hypothesis
H
0
:p(w) = p
0
,against the alternative one
H
a
, whi h des ribes an interesting dis overy, i.e.H
a
:p(w) > p
0
.Type I error:
α
represents the risk of reje tingH
0
while it is true. In ourfrequentpatternmining ontext,
α
orresponds tothe riskofa epting a falsefrequent pattern.Therefore,
1 − α
exa tlydes ribesthetheoreti alpre isionofthe algorithmoverthe distribution
D
.For instan e,with a supportthresholdp
0
of10%
, observingp(w) = 10.2%
ˆ
inS
does not mean thatw
is denitelyfrequent inthe target distribution
D
. To be able totake awell-foundedde i-sion, we an a priori x
α
(usually5%
, but it an depend on the appli ationwe deal with), and then ompute a bound of reje tion
k
, satisfyingα
. Moreformally,
α = P (ˆ
p(w) > k|H
0
true) .
(3)The number of data of
S
that ontainw
is a binomialrandom variable withsu ess probability
p(w)
.A ordingtothesizeN
ofS
andthesupportthresh-old
p
0
, we an use either the normal or the Poisson approximation. In ourontext, we aim to provide a theoreti al bound on
N
that will be by naturequite large. Moreover, sin e we are looking for frequent patterns, we an
as-sume that
p
0
willbe hosen su iently large otherwise the framework wouldbe the one of ex eptions or rare events that is the matterof another resear h
domain. Therefore, using the entral limit theorem, we will onsider in the
following thattheproportion
p(w)
ˆ
follows anormaldistributionN
,su hthatˆ
p(w) ≈ N
p(w),
s
p(w)(1 − p(w))
N
.
Equation 3 an berewritten
α = P
ˆ
p(w) − p(w)
q
p(w)(1−p(w))
N
>
q
k − p(w)
p(w)(1−p(w))
N
|H
0
true
.
(4)Sin e
H
0
is true, we have to repla ep(w)
by itsvalue underH
0
. We getα = P
ˆ
p(w) − p
0
q
p
0
(1−p
0
)
N
>
q
k − p
0
p
0
(1−p
0
)
N
.
(5)We an then easily dedu e the bound
k
whi h orresponds to the(1 − α)
-per entile
z
α
of the normaldistribution:k = p
0
+ z
α
s
p
0
(1 − p
0
)
N
.
(6)To re ap, by xing a risk
α
, Equation 6 gives us the bound of reje tion ofH
0
. For example, let us suppose we are miningN = 10000
data. Let us xthe support threshold
p
0
= 10%
and the riskα = 5%
(z
α
= 1.645
by readingthe tableof thenormaldistribution).Pluggingthese valuesinEquation 6,we
get
k = 0.1 + 1.645 ×
q
0.1∗0.9
10000
= 0.105
.Therefore, apatternw
with asupportˆ
p(w) = 10.2%
willbe infa t reje ted inorder to ontrolthe risk of a epting falsepositives.3.1.0.1 Type II error
β
: Regardingβ
, it des ribes the probability toreje t a true frequent pattern.In ontrast to
α
,β
an be al ulateda ordingtothe previously omputed bound
k
. Sin eH
a
:p(w) > p
0
is true, we have toset a given value for
p(w)
satisfying the onstraintp(w) > p
0
. Letp
a
be thisvalue (see Se tion5for a dis ussion about
p
a
).Wegetβ = P (ˆ
p(w) < k|H
a
true).
(7)As previously done for
α
,β = P
ˆ
p(w) − p(w)
q
p(w)(1−p(w))
N
<
q
k − p(w)
p(w)(1−p(w))
N
|H
a
true
.
(8)Byrepla ing
p(w)
by its value underH
a
, wegetβ = P
ˆ
p(w) − p
a
q
p
a
(1−p
a
)
N
<
q
k − p
a
p
a
(1−p
a
)
N
.
(9)Sin e
k
is known thanks to Eq. 6, the(1 − β)
-per entilez
β
is also known,and
β
an be easily dedu ed from the normal distribution. To ontinue withour previous example (assuming that
N = 10000
data), let us suppose thatp
a
= 11%
, thenβ = 5.5%
(by reading the table of the normal distribution).Therefore, foratruesupportof
11%
,the probabilitytofalselya eptthe nullhypothesisbased on anite sample of
N = 10000
data is5.5%
.3.2 Lower bound on
N
The idealobje tiveof afrequent patternminingpro ess istoredu enot only
α
but alsoβ
. However, as mentioned before, there exists a trade-o betweenthese two risks. With a onstant number of data
N
,β
a tually in reases ifone redu es
α
and vi e versa. A solution to over ome this drawba k onsistsofdetermininghowmanydata
N
wouldbeneeded tonotex eed apriorixedα
andβ
risks.This isthe matter of the next theorem.Theorem 1 To ensure that the false positive rate and the false negative rate
donot ex eedrespe tively xedrisks
α
andβ
, the lowerboundN
low
of the sizeof the sample
S
on whi h the frequent pattern mining algorithm must be runis equal to
N
low
=
z
β
q
p
a
(1 − p
a
) + z
α
q
p
0
(1 − p
0
)
p
a
− p
0
2
, 0 < p
0
< p
a
< 1.
Proof 1 The proof is straightforward. We an dedu e from Equation 9 that
k = p
a
− z
β
s
p
a
(1 − p
a
)
N
,
(10)where
z
β
is the(1 − β)
-per entile of the normal distribution. EquatingEqua-tions 6 to 10, we an dedu e that
p
0
+ z
α
s
p
0
(1 − p
0
)
N
= p
a
− z
β
s
p
a
(1 − p
a
)
N
.
(11)Extra ting
N
from Equation 11, we obtain the lower bound.2
Fig. 1. Trade-o between Type I (light grey area) and Type II errors (dark grey
area).
p
0
(resp.p
a
) isthe expe tation ofp(w)
ˆ
underH
0
(resp.H
a
).Let us now des ribe the meaning of this bound. It is important to note that
there is a dire t relationshipbetween
β
andp
a
given a xed numberof data.Indeed,asdes ribedinFigure1,
p
a
istheexpe tationofp(w)
ˆ
underthealter-nativehypothesis
H
a
.β
orresponds tothedensity ofthe normaldistributionbeneath the bound
k
of reje tion ofH
0
. Therefore, the fartherp
a
is fromp
0
, the lower the riskβ
. Sin eβ
andp
a
are parameters in our lower bound,redu ing both implies an in rease of the needed number of data. The same
remark an be done between
α
andβ
. Redu ingα
for a given sizeN
impliesthe in rease of
β
. Therefore, redu ing both risks results inthe in rease of therequired number of data.
To illustrate this lower bound, the hart of Figure 2 shows the evolution of
N
low
a ording toα
,β
,p
0
andp
a
.For the sake of legibilitywe hooseα = β
.We plot two urves with two dierent values of
p
a
. We an note that thesmaller
p
a
− p
0
, the larger the lower bound. A further dis ussion about thevaluationof
p
a
ispresented inSe tion 5.0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
50000
100000
150000
200000
250000
300000
350000
400000
N
low
p
a
=p
0
+0.01
p
a
=p
0
+0.005
α
and
β
p
0
N
low
Fig. 2.
N
low
a ordingtoα, β, p
0
and p
a
.4 Illustration on a real world sequen e mining task
Let us illustrate the impa t of our bound in a real world sequen e mining
task. We arry out a series of experiments on the Atis (Air Travel
Infor-mationServi e) orpus. This database onsists of travelinformation requests
performedinenglish.Wehaveanoriginalset
Ω
of14044senten esfromwhi hwe draw samples
S
i
of in reasing size|S
i
|
(from10 to 14044) and we extra tfrequent patterns with a support threshold of
10%
with spam (32) whi h isawell-known sequen e miningtool.In this series ofexperiments,toallowthe
analysis of the behaviorof our bound,we assume that
Ω
represents thetheo-reti alunderlyingdistribution
D
fromwhi hthesamplesS
i
havebeen drawn.In order to assess the ee t of the size
|S
i
|
on the quality of the extra tedknowledge, we have to be able to measure the empiri al values of
α
andβ
,that we will all
α
ˆ
andβ
ˆ
.α
ˆ
is the observed proportion of patterns that havebeen extra ted as frequent from
S
i
while they are not frequent in the targetpopulation
Ω
.ˆ
β
orresponds to the observed proportion of patterns that arefrequent in
Ω
but overlooked fromS
i
.Figure 3 des ribes, a ording to an in reasing size
|S
i
|
of the sample setS
i
and a support threshold
p
0
= 10%
, the evolution of1 − ˆ
α
and1 − ˆ
β
, usinga value
p
a
= 11%
. Note that we performed 15 trials, for ea h size|S
i
|
, andwe omputed the average in order to redu e the varian e of the results. As
expe ted,the higherthe numberof sequen es, the smallerthe omputed risks
ˆ
α
andβ
ˆ
. We an also note that for small sizes ofS
i
(< 1000
) both risksα
ˆ
andβ
ˆ
arehigh (> 10%
)meaningthatalotofextra ted patternsare nottrulyfrequentin
Ω
and many othershave been overlooked. This example isa goodording to anin reasing size of
S
i
.illustrationofthebottlene kofstandardminingalgorithms.Sin e
α
ˆ
andβ
ˆ
anbe empiri ally measured, they an be ompared with theoreti al risks
α
andβ
to verify the relevan e of our bound. To a hieve this task, let us omputeN
low
for given theoreti al parametersα
,β
,p
a
andp
0
.For instan e, let ussetα = β = 5%
(p
a
= 10%
andp
a
= 11%
being already xed). Plugging thesevalues in our bound yieldsthe value
N
low
= 10165
.If weobserve fromFigure3the resultsobtained from
10165
sequen es, we an on lude that ourboundis relevant be ause the two observed errors omputed on the ATIS database
(
α = 3%
ˆ
andˆ
β = 2%
) a tually do not ex eed our a priori xed theoreti al risksα
andβ
.Note that the dieren e between the observed and the theoreti al errors an
appear quite substantial on this experiment even if it is on the safe side.
In fa t, the distan e between the observed and the theoreti al errors dire tly
depends on the sample
S
i
drawn from the unknown target distribution. Butsin e
N
low
onstitutes a lower bound needed, in the worst ase, to satisfyα
and
β
, our theorem states that we never fallon the unsafe side.5 What about the value of
p
a
?So far, to illustrate our bound, we used a value of
p
a
lose top
0
under thealternative hypothesis
H
a
. As we explained in the previous se tion, there isa strong relationship between
p
a
and our lower boundN
low
. More pre isely,N
low
quadrati ally in reases with the drop of the dieren e betweenp
a
andp
0
.Therefore, the hoi e ofarelevantvaluep
a
remainsanimportantproblemthat deserves spe ial attention. In statisti al inferen e, it is often states that
the valuation of the parameters under
H
a
has to be xed a ording to theonsideredappli ation.Toavoidtobedependentonthisappli ation,westudy
inthe following of this se tiontwo theoreti al ways toset the valueof
p
a
.5.1 A worst ase solution
The rst solution to ta kle the problem of the valuation of
p
a
is to onsiderthat a pattern
w
is trulyfrequent from the moment that itsprobabilityp(w)
over
D
isgreaterthanthe supportthresholdp
0
.LetN
0
bethenumberof datasu h that
N
0
N
low
= p
0
. Therefore, a pattern
w
is truly frequent if it o urs atmost
N
0
+ 1
times in theN
low
data. So, we get thatp
a
=
N
0
+ 1
N
low
= p
0
+
1
N
low
.
Plugging this value in Equation 10, and equating Equations 6 to 10, we get
the following analyti al representation of our lower bound, in a polynomial
formof order 4whose solutiongives
N
low
(thispolynomialhas been obtainedusing Mapple
T M
):(−2z
β
2
z
2
α
p
2
0
+ z
β
4
p
4
0
+ 4z
β
2
p
3
0
z
α
2
+ z
4
β
p
2
0
− 2z
β
2
p
4
0
z
α
2
+ z
α
4
p
4
0
+ z
4
α
p
2
0
− 2z
α
4
p
3
0
− 2z
β
4
p
3
0
)N
4
+ (2p
0
(−z
2
α
− z
4
β
+ z
β
2
p
0
− z
2
β
+ z
α
2
z
β
2
+ z
α
2
p
0
) + 4p
3
0
(z
β
2
z
α
2
− z
4
β
) + 6p
2
0
(z
β
4
− z
β
2
z
2
α
))N
3
+ (1 − 4z
2
β
p
0
+ 2p
0
z
2
α
z
β
2
+ 2z
β
2
− 2z
β
2
z
α
2
p
2
0
+ z
β
4
− 6z
β
4
p
0
+ 6z
β
4
p
2
0
)N
2
+ (2z
4
β
+ 2z
β
2
− 4z
4
β
p
0
)N + z
β
4
= 0.
We an see that the lower bound now only depends on the risks
α
andβ
,and the support threshold
p
0
.p
a
is no longer a parameter of our bound, andtherefore the solution of this equation provides the exa t lower bound
guar-anteeing at worst
α
andβ
given a support thresholdp
0
. Nevertheless, thissolution onstitutes a very pessimisti answer to our problem. For instan e,
solving this equation setting
α = β = 5%
andp
0
= 0.1
, we getN
low
= 3.10
20
sequen es!
40
50
60
70
80
90
100
0
2000
4000
6000
8000
10000
12000
14000
Precision rate
Dataset size
Theoretical
Atis
Firstnames
Towns
French Words
Reber
S31 L10
S41 L10
S51 L12
S1 L10
S2 L10
Fig. 4.Comparisonbetween various empiri alpre isions and
1 − α
,whenp
0
= 0.1
.40
50
60
70
80
90
100
0
2000
4000
6000
8000
10000
12000
14000
Recall rate
Dataset size
Theoretical
Atis
Firstnames
Towns
French Words
Reber
S31 L10
S41 L10
S51 L12
S1 L10
S2 L10
Fig.5.Comparisonbetween various empiri alre allsand
1 − β
,whenp
0
= 0.1
.5.2 An average solution
Intheprevioussolution,weassumed thatallthe patternsthathavebeen
over-looked followed anormal distributionof expe ted value
p
a
= p
0
+
1
N
low
,whi h
isa tually the worst situation.Inpra ti e, ea homitted frequent patternhas
itsown theoreti al support
p
a
that an belong tothe interval]p
0
, 1]
. How anwe take into a ount those dierent possible values of
p
a
in our bound? Wesuggestinthe followingthe omputationof anaverage solution
N
¯
low
whi histhe expe ted value of
N
low
over]p
0
, 1]
,su h that:¯
N
low
=
1
1 − p
0
×
Z
1
p
0
N
low
dp
a
.
(12)This expe ted value does not depend on
p
a
anymore. To assess the relevan eof this strategy, we omputed for dierent values of
α
andβ
(for the sake ofsimpli ity we set
α = β
) the expe ted valueN
¯
low
using Eq.12. Thesetheoret-i alresultsare des ribed inFigures4and 5 inthe formof two urvesinsolid
line. They are ompared with various urves of empiri alre all and pre ision
omputed from 10 dierent datasets. Four of them are real databases: Atis
whi hhasalreadybeenusedinthispaper,andthreeotherdatabasesavailable
at the URL http://abu. nam.fr/. Firstnames is a set of 12437 male and
femalerstnamesofdierentorigins;Townsisasetof39074namesoffren h
towns; Fren hWords isaset of 250750fren h words. We alsobuilt 6
arti- ialdatabasesfromprobabilisti automata:Reber isaset of15000sequen es
generated from the Reber grammar (33) whose target distribution is an
au-tomaton onstituted of 8 states and an alphabet of 7 letters;We generated 5
othersets of15000sequen es from5automata,ea hone omposedof
x
statesand an alphabet of
y
letters and denotedSxLy
(see Fig.6 for an example ofan automaton
S2L2
). Notethat su h an automaton onstitutes a theoreti aldistribution
D
from whi h it is possible to ompute the probabilityp(w)
ofany pattern
w
,using suitable al ulationmethods(see (34)for example).1(0.3)
2(0)
a(0.41)
b(0.7)
b(0.59)
Fig.6.Automaton
S2L2
viewed asatarget distributionD
.Forea h ofthedatabases, we omputewith spamtheset offrequentpatterns
with a support threshold of
10
%. This set will onstitute the targetdistribu-tion. Then we sample sets of growing size (from10 to 15000) from whi h we
alsoextra tfrequentpatterns,andwe al ulatetheempiri alpre ision(
1 − ˆ
α
)and re all(
1 − ˆ
β
). The results are shown respe tively in Figure4 and Figure5, and have to be ompared with the urves in solid line of those gures.
They onrmthat wea tually providearelevantlowerbound onthe number
of sequen es needed to at least guarantee a priori xed theoreti al re alland
pre ision.Whateverthedatabase,its orresponding urves(
1 − ˆ
α
or1 − ˆ
β
)arealways overthe theoreti alone. Notethat the distan e between the empiri al
risks and our lower bound is quite large for some databases, meaning that
our bound an remain quite pessimisti . However, it does not hallenge its
relevan esin e, asshown with the urves obtained fromthe automata
S1L10
and
S2L10
, it may happen that the empiri alrisks, due to spe i samplingee ts, are mu hmore lose tothe theoreti alones.
Note that the theoreti al urves des ribed in Figures 4and 5 onlyta kle the
ase of
p
0
= 0.1
. In order to provide a al ulating tool that would make theestimationoftheminimalnumberofdataeasier,webuiltthetheoreti al urves
fordierentvaluesof
p
0
.Figure7des ribesasetofaba ithatdire tlyprovidethe lowerbound
N
ˆ
low
requiredtoguarantee atleastapre isionof(1 − α)
anda re all of
(1 − β)
(on e again, for the sake of simpli ity,we setα = β
). Wean note that the urves are not the same, that means that the value of
p
0
has a dire t impa t onthe lowerbound. From a mathemati alpoint of view,
this anbeeasilyexplainedby the fa tthat
p
0
isused intheformulaeofN
low
(see Theorem 1)within a on ave fun tion inthe form of
p
0
(1 − p
0
)
whi h ismaximalfor
p
0
= 0.5
.Therefore,forthe samevaluesα
andβ
,settingp
0
= 0.5
requiresmoredatathanforothervalues.Thisexplainsthefa tthatthe urve
for
p
0
= 0.5
is under the others. Therefore, from a statisti al point of view,to avoid having large risks
α
andβ
, a good strategy onsists of hoosing asupport threshold
p
0
far from0.5
.40
50
60
70
80
90
100
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Theoretical Precision and Recall
Dataset size
0.5
0.4 or 0.6
0.3 or 0.7
0.2 or 0.8
0.1 or 0.9
Fig.7.Evolutionofthetheoreti alre allandpre isiona ordingtothelowerbound
N
low
andthe support thresholdp
0
(from0.1
to0.9
).6 Con lusion and future work
In this paper, we dealt with the assessment of the signi an e of a frequent
pattern mining pro ess. To perform this task, we presented a lower bound
on the numberof data required to satisfy theoreti al pre ision and re all. As
far as we know, this onstitutes the rst attempt to ontrol both riteria by
theoreti alnature,weshowed that ourbound an beveryuseful inrealworld
dataminingappli ations.Weempiri allytestedourboundinthespe i ase
of sequen e mining tasks. However, our work an be adapted to other data
mining ontexts that require a omparison to aminimal support threshold.
Throughoutthispaper, wewanted tostayinatheoreti alframeworkinorder
toavoida dependen e onthe appli ationwe deal with.This explains why we
did not use any information about the sample set
S
. In the future, we planto redu e the pessimism of our theoreti al bound by integrating ba kground
knowledge during the omputation of the bound. One solutionwould onsist
in using the empiri al distribution of the patterns in
S
to weight ea h valueusedinthe omputationoftheintegralinEq.12.However,thisdeservesfurther
investigations.Indeed,on eagain,su hanempiri aldistributionisdependent
ona nite sampleset whosesize must be integrated in the modelto avoidto
havebad estimates.
Finally, note also that our theorem an also onstitute a good ondition to
fulll in various ma hine learning domains. A tually, sin e building a set of
N
low
data allows us tohavea goodestimateof any patternw
,it alsoenablesusto orre tlyestimatethe probability ofany n-gram,whi hisa spe ial ase
of pattern. Sin e n-grams are used in many probabilisti models in ma hine
learning, su h as probabilisti automata, sto hasti transdu ers, or Hidden
Markov models, we think that
N
low
an onstitute a good lower bound toe iently learn su hsto hasti models.
Referen es
[1℄ B. Goethals,M. J.Zaki,Advan es infrequentitemset mining
implemen-tations:report onmi'03,SIGKDD Explorations6 (1)(2004)109117.
[2℄ J.Han,H.Cheng, D.Xin,X. Yan, Frequentpatternmining: urrent
sta-tus and future dire tions, Data Mining and Knowledge Dis overy 15(1)
(2007) 5586.
[3℄ R. Agrawal, R. Srikant, Fast algorithms for mining asso iation rules in
largedatabases,in:Pro eedingsof20thInternationalConferen eonVery
Large Data Bases,Morgan Kaufmann,1994, pp. 487499.
[4℄ H. Mannila,H. Toivonen, A. I. Verkamo, E ient algorithmsfor
dis ov-ering asso iationrules, in: KDD Workshop, 1994, pp. 181192.
[5℄ A. Ceglar,J. F. Roddi k, Asso iation mining, ACM ComputingSurveys
38(2) (2006).
[6℄ H. Mannila, H. Toivonen, A. I. Verkamo, Dis overy of frequent episodes
in event sequen es, Data Mining and Knowledge Dis overy 1 (3) (1997)
259289.
straints,in: Pro eedings of the 9th international onferen e on
informa-tion and knowledge management, ACM Press, 2000, pp. 422429.
[8℄ M.Garofalakis,R.Rastogi,K.Shim,Miningsequentialpatternswith
reg-ular expression onstraints, IEEE Transa tions on Knowledge and Data
Engineering 14(3)(2002) 530552.
[9℄ J. Pei, J.Han, W.Wang, Mining sequentialpatterns with onstraintsin
large databases, in: Pro eedings of the 11th international onferen e on
informationand knowledge management, ACM Press, 2002, pp. 1825.
[10℄ D. Jiang, J. Pei,Mining frequent ross-graph quasi- liques,ACM
Trans-a tions onKnowledgeDis overy fromData 2 (4)(2009) 142.
[11℄ P. Hintsanen, H. Toivonen, Findingreliable subgraphs from large
prob-abilisti graphs, Data Mining and Knowledge Dis overy 17 (1) (2008)
323.
[12℄ D.Chakrabarti,C. Faloutsos,Graphmining:Laws,generators,and
algo-rithms, ACM ComputingSurveys 38 (1)(2006) .
[13℄ L. Getoor, C. P. Diehl, Link mining: a survey, SIGKDD Explorations
7 (2)(2005)312.
[14℄ Y.Chi, R.R.Muntz,S.Nijssen,J.N.Kok, Frequentsubtree mining-an
overview, FundamentaInformati ae 66(1-2) (2005)161198.
[15℄ M. J. Zaki, E iently mining frequent embedded unordered trees, F
un-damenta Informati ae66 (1-2)(2005) 3352.
[16℄ A. Termier,M.-C. Rousset,M. Sebag,K.Ohara,T. Washio, H.Motoda,
Dryadeparent, an e ient and robust losed attribute tree mining
al-gorithm, IEEE Transa tion on Knowledge and Data Engineering 20 (3)
(2008) 300320.
[17℄ V. Chaoji, M. A. Hasan, S. Salem, M. J. Zaki, An integrated, generi
approa h to patternmining: data mining template library, Data Mining
and KnowledgeDis overy 17 (3) (2008)457495.
[18℄ J. Han, R. B. Altman, V. Kumar, H. Mannila, D. Pregibon, Emerging
s ienti appli ationsindatamining,Communi ationsoftheACM45(8)
(2002) 5458.
[19℄ M. Zaki, S. Parthasarathy, W. Li, M. Ogihara, Evaluation of sampling
for data mining of asso iation rules, in: In 7th International Workshop
Resear h Issues in Data Engineering,1997, pp. 4250.
[20℄ C.Raissi,P.Pon elet,Samplingforsequentialpatternmining:fromstati
databases todatastreams,IEEE InternationalConferen e onData
Min-ing(2007) 631636.
[21℄ H.Toivonen,Samplinglargedatabases forasso iationrules, in:
Pro eed-ings of the 22th International Conferen e on Very Large Data Bases,
Morgan Kaufmann, 1996, pp. 134145.
[22℄ P.-A. Laur, R. No k, J.-E. Symphor, P. Pon elet, Mining evolving data
streamsforfrequentpatterns,PatternRe ognition40(2)(2007)492503.
[23℄ P.-A.Laur,J.-E.Symphor,R.No k, P.Pon elet,Statisti alsupports for
miningsequentialpatternsandimprovingthein rementalupdatepro ess
[24℄ N.Megiddo, R.Srikant,Dis overingpredi tiveasso iationrules,in:
Pro- eedings of the 4th international onferen e of knowledge dis overy and
data mining,1998, pp. 274278.
[25℄ G. I. Webb, Dis overing signi ant patterns, Ma hine Learning 68 (1)
(2007) 133.
[26℄ J. P. Shaer, Multiple hypothesis-testing, Annual review of psy hology
46(1995) 561584.
[27℄ G.Gavin,S.Gelly,Y. Guermeur, S.Lalli h,J. Mary, M. Sebag,O. T
ey-taud, Type 1 and type 2 errors for multiple simultaneous hypothesis
testing. http://www.lri.fr/teytaud/risq.,PASCAL theoreti al hallenge
(2007) 2947.
[28℄ S.Holm,A simplesequentiallyreje tive multipletest pro edure,
S andi-navian Journalof Statisti s6 (1979) 6570.
[29℄ Y. Benjamini, Y. Ho hberg, Controlling the false dis overy rate: a new
andpowerfulapproa htomultipletesting,JournaloftheRoyalStatisti al
So iety Series B 57 (1995)289300.
[30℄ S. D. Lee, D. W. Cheung, B. Kao, Is samplinguseful in data mining? a
aseinthe maintenan eofdis overedasso iationrules,DataMiningand
KnowledgeDis overy 2 (3)(1998) 233262.
[31℄ A.Gionis,H.Mannila,T.Mielikäinen,P.Tsaparas,Assessingdatamining
results via swap randomization, Transa tions on Knowledge Dis overy
and Data 1(3) (2007).
[32℄ J.Ayres,J.Flanni k,J.Gehrke,T.Yiu,Sequentialpatternminingusinga
bitmaprepresentation,in:Pro eedingsofthe8thinternational onferen e
onknowledgedis overyanddatamining,ACMPress,2002,pp.429435.
[33℄ A. S. Reber, Impli it learning of arti ial grammars, Journal of Verbal
Learningand Verbal Behavior 6 (1967)855863.
[34℄ P.Hingston,Usingnitestateautomataforsequen emining,in:
Pro eed-ings ofthe 25thAustralasian onferen e on omputers ien e, Australian
Computer So iety,In ., 2002, pp. 105110.