HAL Id: hal-02341338
https://hal.archives-ouvertes.fr/hal-02341338
Submitted on 31 Oct 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
HASAR : mining sequential association rules for
atherosclerosis risk factor analysis
Laurent Brisson, Nicolas Pasquier, Céline Hebert, Martine Collard
To cite this version:
Laurent Brisson, Nicolas Pasquier, Céline Hebert, Martine Collard. HASAR : mining sequential
as-sociation rules for atherosclerosis risk factor analysis. Workshop in 8th PKDD conference (Principles
and Practice of Knowledge Discovery in Databases), Sep 2004, Pisa, Italy. �hal-02341338�
Atherosclerosis Risk Factor Analysis LaurentBrisson
1
, NicolasPasquier1
,CélineHebert2
, MartineCollard1
1
LaboratoireI3S(CNRSUMR-6070),UniversitédeNice Sophia-Antipolis,
2000 routedeslucioles, LesAlgorithmes,06903 Sophia-Antipolis,France;
{brisson,mcollard,pasquier}@i3s.unice.fr
2
LaboratoireGREYC(CNRSUMR-6072),UniversitédeCaenBasse-Normandie,
UFRSciences,CampusII-B.P.5186,14032Caen,France;
clecharp@etu.info.unicaen.fr
Abstract. We present the HASARmethodthat is anhybrid approachfor
ex-tracting adaptive sequentialassociation rules. Thismethod extractsassociation
rulesbetweeneventsoccurringinsubsequenttime-intervalsusingcloseditemsets
extraction and evolutionary techniques.An importantfeature is itscapacity to
considerdierenttime-intervalsdependingontheattributessemantic.Weapplied
this methodfor theanalysisoflong termmedicalobservationsofatherosclerosis
riskfactorsforcardio-vasculardiseasesprevention.Experimentalresultsshowthat
it is well-suited for extracting knowledge from temporal data where interesting
patternshavedierentobservationperiodlength.
1 Introduction
Inthispaper,weconsidersequentialassociationruleswhichexpresscasualtyrelationships
betweensets of events occurring in subsequent time-intervals. We developed a specic
approach, calledHASARfor HybridAdaptiveSequentialAssociationRules, fornding
such patternsofinterest.Ourmethodology isappliedto anhealthcareproblem,but it
is also suited to a broadcollection of data mining problems where data are temporal
observationsonindividuals,incustomerbehaviourorcreditriskpredictionforinstance.
Manystudies havefocused ontheecientminingof sequentialpatterns orpatternsin
time-related data. Most of them are based on extensions of the APRIORI algorithm
[AMS+96]proposedforextractingassociationrules.TheHASARapproachpresentedin
this papercombines techniquesforsearchingclosed itemsets, asdened in the CLOSE
algorithm[PTB+04],andanheuristicapproachrelyingonageneticalgorithm.
WeusedtheHASARapproachfortheanalysisoflongtermobservationdatainthe
STU-LONGdataset.Thisdatasetwasconstitutedintheframeworkofalongitudinalstudyof
atherosclerosisriskfactorsto evaluatetheimpact ofnon-pharmacologicalprescriptions
ontheserisks.Itcontainsdatacollectedbetween1975and2001onapopulationof1417
menbornbetween1926and1937inCzechoslovakia.First,anentryexaminationwas
per-formedanddataconcerningsocialcharacteristics,diet,tobaccoandalcoholconsumption,
physicalactivities,personalanamnesisand,physicalandbiochemicalexaminationswere
collected.Duringthisexamination,patientswereclassiedintothreegroupsaccordingto
principalatherosclerosisriskfactors(RFs):Arterialhypertension,hypercholesterolemia,
hypertriglyceridemia,overweight,smokingandpositivefamilycasehistory.These three
Pathologicalgroup(PG):Cardio-vasculardiseasediagnosed.
During the twenty one years following the patients' entry, control examinations were
performedtorecordchangesindiet,smokinghabits,physicalactivitiesandresponsibility
level in job, sport practicein leisure timeand, physical and biochemicalmeasures.For
each control,the patient id,the date and the control number were recorded. In 2001,
389patientsweredeceasedand,thecauseanddateofdeathwererecorded.Foragroup
of403patientsthat answeredtoapostalquestionnaire,detailed datasimilarto those
gatheredduring controlswerecollected.
TheSTULONGdatasetwascollectedatthe2ndDepartmentofMedicine,1stFacultyof
MedicineofCharlesUniversityandCharlesUniversityHospital,Unemocnice2,Prague2
(head.Prof.M.Aschermann,MD,SDr,FESC),underthesupervisionofProf.F.Boudík,
MD,ScD,withcollaborationofM.Tomeèková,MD,PhDandAss.Prof.J.Bultas,MD,
PhD.ThedataweretransferredtotheelectronicformbytheEuropeanCentreofMedical
Informatics,StatisticsandEpidemiologyofCharlesUniversityandAcademyofSciences
(head.Prof.RNDr.J.Zvárová,DrSc).Atpresenttimethedataanalysisissupportedby
thegrantoftheMinistryofEducationCRNrLN00B107.
Objectives. One of the main objectivesof this study wasto evaluate the impact of
behaviouralchangesstartingorstoppingadietorsportpracticeforinstanceonthe
RFandthedevelopmentofcardio-vasculardiseases(CVDs).Theinitialanalytical
ques-tionsinSTULONGevolvedaftertherstexperimentationsanddiscussionswithmedical
experts, mainly because of the evolutions of medical knowledge sincethe beginning of
thestudyandmissingdata(e.g.,uriacidmeasureisgivenforlessthan10%ofcontrols).
Duringthis work,wehavedevelopedanewmethodthat webelievewillhelp toanswer
thefollowingquestionsdenedthroughoutdiscussionswithmedicalexpertsandrelated
tothelong-termobservations:
Aretheredierencesbetweenmenof thenormal,riskandpathologicalgroupsfrom
theviewpointoftheimpactofbehaviouralchangesonRFandCVDdevelopment?
WhatcharacterizesmenwhodevelopedaCVDandthosewhostayedhealthyonthe
globalpopulationandtheriskgroup?
Are the education level and the responsibility in job good criteria for segmenting
patientswithperilousorsafebehavioursand highorlowRF?
HASARisanassociationrulesextractionapproachincorporatingtemporalrelationships.
Itextractssequentialassociation ruleswhicharewellttedtoanalysecasualty
relation-shipsbetweenbehaviouralchangesandRFevolutionsorCVDdevelopment.Association
ruleextraction,rstintroducedin[AIS93],aimsatdiscoveringcasualtyrelationships
be-tweensetsofattributevalues,calleditemsets,inlargedatasets.Anexampleassociation
rule,ttinginthecontextofmarketbasketanalysisis:
BUY(cereal)
∧
BUY(sugar)→
BUY(milk), support=20%,condence=75%.This rule states that customers who buy cereal and sugar also tend to by milk. The
support measureindicates that 20% of allcustomersbought boththree itemsand the
condence measure shows that 75 % of customers who bought cereal and sugar also
boughtmilk.Informally,thesupportrepresentstherangeoftheruleandthecondence
indicatestheprecisionoftherule. Inorderto extractonlystatistically signicant
plied to the dataset for extracting long-term observation related patterns. Sequential
associationrulesandthetechniquesweusedfortheirextractionaredenedinsection3.
Insection4,weshowexperimentalresultsandsection5concludesthepaper.
2 Data preparation
Datafor thelong-term observation were collectedduring thetwentyoneyears controls
after the patients' entry. We used both entry and control data to generate multiple
datasetsthat areadapted tothekindofknowledge wewere interestedin.Attributesof
interest were selectedand preparedaccording to discussions with medical experts, for
determiningthresholdvaluesofphysicalandbiologicalmeasuresforinstance.
2.1 Sequentialrules and search strategy
Since our main objective concerns the eect of behaviour on risk factor development,
wedecided to look forsequentialrulesinvolvingcasualty relationshipsbetweenpatient
behavioralchangesandriskfactorchangesonsubsequenttimeintervals.
Sequentialrules with the form X
→
Y we search for involve both itemsets andtime-itemsets. Wecall time_itemanattribute valueoccurringin aparticulartemporal
win-dow.Wecall time_itemsetasetof time_items.Oursequentialruleshavethefollowing
structure:
IDE_itemset
∧
BEH_time_itemset→
RF_time_itemsetwherethecomponentsare:
IDE_itemset: anitemset ofstaticidentication attributes,
BEH_time_itemset:atime_itemsetofbehaviouralattributes,
RF_time_item:atime_itemonariskfactorattribute.
Anexamplesequentialrulemaybe:
ALCOHOL=regularly
∧
BEH_PHA=decreased_sits→
RF_CHOLEST=increased.Sucharulemustbeinterpretedasacasualtyrelationshipbetweenchangesonriskfactors
occurring onan observation temporal window of Omonths and inducedby static data
and by changes on behaviour on aprevious action temporal window of A months. We
alsodenedalatency temporalwindowL betweentheactionperiodandtheobservation
periodwhich allowsawaiting timeto observetheimpactof somebehaviouralchanges.
Theexampleruleshouldbeinterpretedasfollows:
if the patient regularlyconsumedalcohol whenhe enteredthe study
and hisphysical activity afterjob decreasedoveraAmonth period
thenhischolesterol rate increasedatacontrol whichoccurredL monthsafter
andoverthesubsequentOmonthobservation time.
AnimportantelementforthemethodexibilityisthattemporalwindowsizesO,Aand
L aredened asparametersofsequentialrules. Thestrategy weapplied for extracting
theseruleswasrstrunningwidedatatransformationstotailordatatothespecictask.
Thisrst stepconsisted in applyingcorrections, replacingmissing values,creatingnew
Statistical measures are presented in sections 3.1 and 3.3. HASAR exibility is also
provided by the evolutionary algorithm involved for searching for rules. We dened a
GeneticAlgorithm(GA)forthistask.GAs[GR87]arewellsuitedtolargecombinatorial
searchspaces.GAsareadaptiveproceduresthatevolveapopulationofstructuresinorder
tondthebest individual.Theevolutionis performedbyspecic geneticoperatorslike
mutationandcrossover.Theyhavealonghistoryofbeingexploitedforrulemanipulation
[Freitas02,SAL03].Asitwassuggested,theyoertechniquessuchasnichingwhichallow
notonlytondthebestrule,butaselectionofgoodrules.
2.2 Patientclasses
Inordertoobservepotentialdierencesbetweenclassesofpatients,wedistinguishedthe
followingpartitions:
GroupsNG,PGandRGassignedtopatientsin theSTULONG study.
ClassesCVDandNCVDwhich respectivelyrepresentpatientswhohadanddidnot
haveacardio-vasculardiseaseduringtheSTULONGstudy.
Clusters ofpatientsbasedontheireducationlevelandjobresponsibilitycriteria.
Classes CVD and NCVD were obtained by splitting the EXPERIMENT table using
at-tributesCONTROL.HODNx,withx
6=
0andx6=
15,thatindicatesifacardio-vasculardiseasewasdiagnosedandDEATH.PRICUMRthatindicatesifhisdeathwasduetoacardio-vascular
disease.
Clustersof patientsbasedon theireducationlevel andjob responsibilitiescriteriawere
obtainedasfollows.Inarstattempt,weextractedallclosedpatternscontainingatleast
thesocialfactorsbutthosegatheringarelevantnumberofpatients(atleast200)didnot
reveal a signicant medical interpretation. This is due to the fact that some attribute
valuescoveralargenumberofpatients(e.g.,1023patientsamongthe1199aremarried).
Aftertalking withaphysician,the`educationlevel (VZDELANI)andresponsibilitiesin
job (ZODPOV)attributes,thataremostlikelytoinuenceatherosclerosis,werechosenas
maincriteriaforbuildingtheclusters.Weinvestigatedtheclosedpatternscontainingthe
itemscomingfromtheseattributes.Wegotthefollowing11closedpatterns(orpotential
clusters):
1. basicschooland others(forresponsibilitiesin job)
2. primaryschoolandmanagerialworker
3. primaryschoolandpartlyindependentworker
4. primaryschoolandothers
5. secondaryschooland managerialworker
6. secondaryschooland partlyindependentworker
7. secondaryschooland others
8. universityandmanagerialworker
9. universityandpartlyindependentworker
10. universityandothers
11. universityandpensioner(not becauseofICHS)
Theclosedpatternsnumber5and8weremergedtoproducetherstcluster.Weperform
asimilarprocesswithclosedpatternsnumber6and9,7and10,1and4.Thefthcluster
Cluster Secondaryschool Responsibilityinjob Healthy Atherosclerosis Total
1 yes managerialworker 150 60 210
2 yes partlyindependentworker 227 82 309
3 yes others 127 59 186
4 no others 221 122 343
5 - - 94 57 151
Totalofpatients 819 380 1199
Table 1.Descriptionofclusters.
2.3 Attributes ofchange and new tables
A preliminary step consisted in correctingsome errors orcontradictions and replacing
missing and not stated values by the same value. We built new tables CHANGES and
EXPERIMENTmorettedto thetask from theinitial tables ENTRY,CONTROL.The DEATH
tablewasusedinordertosplitthepatientsetintotheCVDandNCVDclasses.
Discus-sionswithmedicalexpertsallowedustoidentifysomeguidelinesforbuildingtablesand
to understand which initial attributes wehad to keepanwhich ones wehaveto build.
First,wekeptexisting identication attributes(IDE attributes) about patients.These
attributesarenamedaccordingtotheexpressionIDE_attribute_name.Theyrepresent:
theeducationlevelofthepatient,
theageofthepatientat themomentofthecontrol,
theinitialgroupof thepatientwhenhecameinto thestudy,
thealcoholconsumptionatthebeginningofthestudysinceSTULONGdatadonot
providethisinformationforeachcontrol.
Otherattributesdescribebehaviouralchangesandchangesrelatedtoriskfactorsfromone
controltoanother.Behaviouralchangeattributes(BEHchangeattributes)werenamed
accordingtotheexpressionBEH_attribute_nameandriskfactorchangeattributes(RF
change attributes) were named according to the expression RF_attribute_name.
At-tributesforbehaviouralchangesarerelatedto criteriabelow:
consumptionofcigarettesaday,
physical activityinjob andafterjob,
dierentkindsofdiet,
medicineforcholesterolandforbloodpressure.
Attributesforriskfactorchangesarerelatedtothefollowingcriteria:
global,HDLandLDLcholesterollevels,
triglycerideslevel,
overweightorobesity,
blood pressuremeasures,
glycemialevel.
ThenewCHANGEStableis composedwith IDE,BEHchange andRFchange attributes.
For two subsequent controls number N and N+1 of a patient, a new tuple is created
andthevalueofeachBEHandRFattributeiscomputedbycomparingattributevalues
of the twocontrols. For therst control,a newtuple is created by comparing it with
stay_sits hemainlysits hemainlysits
decreased_sits moderateorgreatactivity hemainlysits
increased_modest hemainlysits moderateactivity
stay_modest moderateactivity moderateactivity
decreased_modestgreatactivity moderateactivity
increased_great hemainlysitsormoderateactivitygreatactivity
stay_great greatactivity greatactivity
Table 2.AttributeBEH_PHA.
Value CONTROL.CHLST(N)CONTROL.CHLST(N+1)
stay_normal <6 <6
decreased
≥
6 <6increased <6
≥
6stay_high
≥
6≥
6Table 3.AttributeRF_CHOLEST.
RF_CHOLESTattributevaluesarecomputedfrom valuesofattributesCONTROL.AKTPOZAM
(whichindicatesthephysicalactivityafterjob)andCONTROL.CHLST(whichindicatesthe
globalcholesterolrate).
Inordertogetatemporaldescriptionofeachpatientallalongthestudy,weattenedthe
CHANGEStable.TheEXPERIMENTtableistheresultoftheatteningoperation;itcontains
asmanytuplesaspatientsinCONTROL.OnetupleinEXPERIMENTcontainsIDEattributes
ofthepatientandchangesattributesforeverycontrolthepatientpassedthrough.
3 Strategies
3.1 Target modeland denitions
Weconsider the time segmentwhich representstheinterventionon apatient from the
timeheenteredthestudyuntilthetimeheleft it.Ourdiscussionswithmedicalexpert
led usto xa monthas the time unit.Let usconsider apatient andthe time-interval
[
in; out
]duringhewasintheSTULONGstudy.C(T, patient)
refers to acontrol occurring at timeT
months afterin
for the patient.Acontrol
C(T, patient)
ischaracterizedbyaBEH_time_itemsetandaRF_time_itemwhichrepresentsbehaviouralandriskfactorschangesobservedforthepatientatthis
con-troltime.Acontrolperiod
CP (T, patient)
isaA-monthsizetimeinterval[T ; T +A
]whereitoccursonecontrolatleastforthepatient.Atemporalconguration
T C(T
1
, T
2
, patient)
isatimeinterval[
T
1
; T
2
]suchas:itexistsacontrolperiod
γ
=CP (T, patient)
withT
1
∈
[T ; T + A
],
T
2
∈
[T
1
+ L; T
1
+ L + O
],itoccurredacontrol
C(T
2
, patient)
,itdidnotoccuranycontrolin theinterval[
T + A; T
2
]forthepatient.Wesaythattwotemporalcongurations
T C(T
1
, T
2
, patient)
andT C(T
3
, T
4
, patient)
arecompatiblesif
T
2
≤
T
3
orT
4
≤
T
1
.Statistical measures. We dene the measure max_support as the whole number of
possible compatible temporal congurations for all records (patients) in the attened
table EXPERIMENT.Forarule antecedent
X
=IDE_itemset∪
BEH_time_itemset, weBEH_time_itemsetisobservedoveracontrolperiod
γ
=CP (T, patient)
.Foraruleconsequent
Y
=RF_time_item,wesaythatatemporalcongurationT C(T
1
,
T
2
, p)
containsY
ifRF_time_itemis observedatcontrolC(T
2
, patient)
.Foraruleantecedent
X
,wedenethecardinalitymeasureofX
,card(X)
,asthenumberofpossiblecompatible
T C(T
1
, T
2
, p)
thatcontainsX
.ForaruleconsequentY
,wedenethecardinalitymeasureof
Y
,card(Y )
,asthenumberofpossibleasthenumberofpossiblecompatible
T C(T
1
, T
2
, p)
whichcontainsY
.ForaruleX → Y
,wedenethecardinalitymeasureof
X ∧ Y
,card(X ∪ Y )
asthenumberofpossibleT C(T
1
, T
2
, p)
thatcontainsX
and
Y
.3.2 Association rulesbased approach
Twomainapproachesforextractingassociationrulescanbedistinguished.
Intherstapproach,allitemsetswithsupport
≥
minsupport,calledfrequentitemsets,areextractedand allassociationruleswith condence
≥
mincondence aregeneratedfromthem. This approach is veryecient when dataare weakly correlated,such asmarket
basket data, but performances drastically decrease when data are denseor correlated,
such as statistical data for instance. A comprehensivesurvey of this approach can be
foundin[AMS+96].
Thesecondapproachisbasedontheextraction ofgeneratorsandfrequent closed
item-sets dened using the Galois closure operator. From these, the informative basis for
association rules containing non-redundant association rules with minimal antecedent
andmaximal consequent.Thisapproachbothimprovestheextractioneciency,by
re-ducingthesearch-space,andtheresultrelevance,bysuppressingredundantrules,inthe
caseofdenseorcorrelateddata.Asummaryofthisapproachcanbefoundin[PTB+04].
Frequentscloseditemsetsandgenerators. Frequentcloseditemsetsandgenerators
aredened accordingtothe closureoperator
φ
oftheGalois connection.This operatorassociateswithanitemset
l
itsclosureφ(l)
thatisthemaximalsetofitemscommontoallobjectscontaining
l
.That is,theclosureofl
istheintersectionofallobjectscontainingl
.Theminimalcloseditemset containinganitemsetl
isitsclosureφ(l)
andwesaythatanitemset
l
is acloseditemsetifφ(l) = l
. Thegeneratorsofaclosed itemsetc
aretheminimal 3
itemsetswhichclosureis
c
.Generatorsaretheminimalitemsetswecanconsiderfordiscoveringfrequentclosed itemsets,bycomputing theirclosures.Sincethesupport
ofafrequentitemsetisequaltoitsclosuresupportandsincemaximalfrequentitemsets
aremaximalfrequentcloseditemsets,thefrequentcloseditemsetsconstituteaminimal
non-redundantgeneratingsetforallfrequentitemsetsandthus,forallassociationrules.
Considerthedataset
D
,constitutedofsixobjectsidentiedbytheirOIDandveitems,representedin gure 1(a).Theeightgeneratorsand vefrequent closeditemsets, with
theirsupports,in
D
forminsupport=2/6aregiveningure1(b).Theitemset{A}isthegeneratorofthefrequentcloseditemsets{AC}:theintersectionof
allobjectscontaining{A},thatareobjects1,3and4,gives{AC}andnosubsetof{A}
has{AC}asclosure.{A}and{AC}bothhaveasupportof
k{1,
3
,
4
}k
kDk
=3/6
.Thefrequentcloseditemset {BCE}hastwogenerators:{BC}and {CE}.{BE} isnotageneratorof
{BCE}since it is a frequent closed itemset, {B}and {E} are generatorsof {BE} and
{C}is itselfitsowngenerator.
OID Items 1 A C D 2 B C E 3 A B C E 4 B E 5 A B C E 6 B C E (a)Dataset
D
. {A} {AC} 3/6 {B} {BE} 5/6 {C} {C} 5/6 {E} {BE} 5/6 {AB} {ABCE} 2/6 {AE} {ABCE} 2/6 {BC} {BCE} 4/6 {CE} {BCE} 4/6 (b)Extractionresult.Fig.1.Generatorsandfrequentcloseditemsets.
3.3 Evolutionaryapproach
GeneticAlgorithmsarerobust,exiblealgorithmswhichtendtocopewellwithattribute
interactioninatherosclerosisdata.Furthermore,thecomprehensibilityofthediscovered
knowledgeisimportantandGAsallowustoextractcomprehensiblerulesevolving
pop-ulations of prediction patterns. Our GA implementation uses EO, a templates-based,
ANSI-C++compliantevolutionarycomputationlibrary.
Genome. The rst issue in designing a GA is how to encode each individual in the
population. To representvariable length ruleweuse axed-length genome which
con-tainsageneforeachIDE_attributeandBEH_attributeandanothergeneforoneofthe
RF_attributes.Each genecontainsthree elements:attributename, attribute valueand
anactivationagindicatingwhetherornotanitemintheruleisassociatedtothegene.
Generationofinitialpopulation. Themethodusedtogeneratetheinitialpopulation
isbasedonCLOSEalgorithm results.CLOSE generatealistwithgeneratorsandtheir
frequent closed itemset associated, that allows us to initialize the population in three
steps:
Rule antecedents are created from generators. Generators containingonly RFs are
skipped.
RuleconsequentsarecreatedwiththerstRFfoundinthefrequentcloseditemset
orthegenerator.IfnoRFisfound,wegeneraterandomlyone.
Foreachattributenotrepresentedingeneratorsthevalueofthecorrespondinggene
israndomlydened andtheactivation agissettofalse.
Genetic operators. We use a tournament selection with size of 2. The selection is
deterministic,starting from thebest ones downto the worse ones.If thetotalnumber
toselectislessthanthesizeof thesourcepopulations,thebestindividualsareselected
once. Ifmoreindividuals are neededafter reaching thebottom ofthe population, then
the selection starts again at top. If the total number required is N times that of the
sourcesize,allindividualsareselectedexactlyNtimes.Forreplacementweusethemost
straightforwardmethod, called generationalreplacement where allospring replaceall
parents;howeverweakelitism isused.
Newpatterns are generated by combining existing patterns using a crossoveroperator
orbymodifyingexistingpatternsviaamutationoperator.Crossoverisarecombination
value (domain of this value is a parameter of the GA) which is randomly added or
subtractedtothecurrentgenevalueandthelastinvertsthecurrentvalueoftheactivation
ag.All ofthemutation ratescanbedeneindependently.
Fitness function. A crucial issue in the design of a GA is the choice of the tness
function.Inarstapproachweonlyconsider supportto selectthemostfrequentrules,
condencetoconsiderreliablerulesandlifttoensureahighlevelofdependencebetween
antecedentand consequentpart of arule. To evaluate the quality of arule r: X
→
Y,ourGAappliesthetnessfunctionontheindividualassociatedtotherule.
f itness(r) = support(r) ∗ conf idence(r) ∗ lif t(r)
(1)support(r) =
card(X ∪ Y )
max
_support
(2)conf idence(r) =
card(X ∪ Y )
card(X)
(3)lif t(r) =
card(X ∪ Y )
card(X) ∗ card(Y )
(4) 4 Experimental results4.1 Patientsclasses comparison
Patientgroups. Theexperienceconsisted inextractingrulesonPGand testingthem
on NG and RG. Dierences for support and tness measures are shown in gure 2.
Onemayobservethat the best rulesfound onPG are notvalid onNG. Most of them
haveagoodsupportbut aweaktnessonRG.Thus,these resultsshowquitedierent
relationshipsbetweenthepatientbehaviourandtheirriskfactorsamonginitialgroups.
0
0.01
0.02
0.03
0.04
Support
PG
NG
RG
1e-05
0.0001
0.001
0.01
0.1
1
10
Fitness
PG
NG
RG
Fig.2.BestrulesonPGversusNGandRG.
Cardiovascular disease. The experience consisted in extracting rules on CVD and
testingthemonNCVD.Resultsingure3showthatthebestrulesfoundonCVDhave
similar support and tness on NCVD. These results seem to show that relationships
0
0.005
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Support
CVD
NCVD
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
Fitness
CVD
NCVD
Fig.3.Bestrules onCVDversusNCVD.
Socialclusters. This experienceconsisted in extractingrules onCluster1and testing
them on Cluster3 and Cluster4. We can see in gure 4 that the best rules found on
Cluster1alsohavegoodmeasuresonCluster4and resultsonCluster3arequitesimilar.
Thusitseemsthatsocialfactorslikeeducationlevelandjobresponsibilitydonotallow
todistinguishdierentbehaviourfromtheviewpointofcardiovascularrisks.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Support
C1
C3
C4
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
Fitness
C1
C3
C4
Fig.4.BestrulesonCluster1versusCluster3andCluster4.
4.2 Initialisationmethods
In order to evaluate performances of our initialization method we compare it with a
randominitializationmethod. ResultsforPGandRGgroupsareshownin gure5and
resultsforCVDandNCVDareshowningure6.Therstthreecolumnsshowstatistics
aboutinitialpopulationsandthelastcolumnshowthetnessofbestindividualsaftera
GArun.WecanobservethatmeantnessofpopulationsgeneratedusingCLOSEis8.75
to 400 times better than those of randomly generated population. It is not surprising
since CLOSE optimize rules support and condence, two mains criteria of our tness
function.However, it is interestingto note that after a GArun thetness of the best
ospring of populations generated using CLOSE is 1.55 to 4.79 times better than the
best ospringofrandomlygeneratedpopulation.Furthermore,ananalyzeofrulesshow
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Mean
St.dev
Initial max Final max
Fitness
CLOSE Initialization
Random Initialization
(a)PG0
0.0005
0.001
0.0015
0.002
0.0025
Mean
St.dev
Initial max Final max
Fitness
CLOSE Initialization
Random Initialization
(b)RG
Fig.5.Comparinginitializationmethodsonpatientgroups.
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
Mean
St.dev
Initial max Final max
Fitness
CLOSE Initialization
Random Initialization
(d)CVD0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
Mean
St.dev
Initial max Final max
Fitness
CLOSE Initialization
Random Initialization
(e)NCVD
Fig.6.Comparinginitializationmethodsonpatientclasses.
4.3 Observation timewindows variations
TimewindowsA,L andO,usedto extractantecedentsandconsequentsofrules,dene
time_itemsets.AnimportantaspectoftheHASARmethodisitscapacitytomakevary
thesetime windows.IntheSTULONG analysis,theobservation timeof RFevolutions
depends on the considered risk. For instance, accordingto physicians' knowledge, the
eectsofadietontheweightare perceptible afterafewmonthswhereastheeects on
thecholesterolmeasuresaremostoftenperceptibleafter alongerperiod.
We evaluated the eect of RFs observation time window variations on the tness of
rulesfor twoRFs: RF_CHOLESTand RF_BLOODPRESS.The resultsare shown in gure7.
For RF_CHOLEST, rules generated for a 60 months window have a much better tness
than those generated for a 15 months window. For RF_BLOODPRESS, the situation is
theopposite: a15 months windowgives better tnessthan a60 monthswindow. This
showsthateectsofnonpharmacologicalprescriptionsonhypercholesterolemiamustbe
observedonmuchlongertimeperiodthaneects onarterialhypertension.
5 Conclusion
0.0025
0.002
0.0015
0.001
0.0005
0
Fitness
60 month
15 month
(a)RF_CHOLEST0.0035
0.003
0.0025
0.002
0.0015
0.001
0.0005
0
Fitness
60 month
15 month
(b)RF_BLOODPRESSFig.7.Eectsofobservationtimewindowlengthvariationsonrulestness.
previousworksonthisdatasetessentiallyfocusedonstaticinformationfrom theENTRY
table, we have investigated the analysis of temporal data in the CONTROL table. Our
approachisbasedontwomainpoints:
asetofdatatransformationswhichis suitedtovarioustemporaldata,
an hybrid strategy which combines advantages from quite dierent techniques: an
exhaustivesearch forfrequentgeneratorsand closeditemsets which areusedasthe
initialpopulationof anevolutionaryalgorithm.
Experimental results allowed to point out dierent tendencies among patient groups
and conrmed prior medical knowledge. In order to answer in-depth to the analytical
questions,furtherinvestigationsofsequentialruleswiththeassistanceofmedicalexperts
arerequired.
Inthefuture,weplanto applytheHASARapproachtoother temporaldatasetswhere
observationtimewindowsarenotuniformonallattributes.Anotherinteresting
perspec-tiveis toextendthe approachby integratingbackgroundknowledge, such asexpressed
inmedicalontologies, inthesearchprocess.
References
[AIS93] AgrawalR., ImielinskiT. and Swami A. Mining AssociationRules Between Sets of
ItemsinLargeDatabases. ProceedingsoftheSIGMODconference,pp207216,May1993.
[AMS+96] AgrawalR.,MannilaH.,SrikantR.,ToivonenH.andVerkamoA.I. FastDiscovery
ofAssociationRules.AdvancesinKnowledgeDiscoveryanddatamining,pp307328,Fayyad
U.M.,Piatetsky-ShapiroG.,SmythP.andUthurusamyR.editors,AAAIPress,1996.
[PTB+04] PasquierN.,Taouil R.,BastideY., StummeG.and LakhalL. Generatinga
Con-densed Representation for Association Rules. Journal of Intelligent Information Systems,
KerschbergL.,RasZ.andZemankovaM.editors,KluwerAcademicPublishers,toappear.
[Freitas02] FreitasA.A. DataMiningandKnowledgeDiscoverywithEvolutionaryAlgorithms.
Springer-Verlag,Berlin,2002.
[GR87] GoldbergD.E.,RichardsonJ. GeneticAlgorithmswithSharingforMultimodal
Func-tionOptimization.Proc.Int.Conf.onGeneticAlgorithms(ICGA-87),1987,41-49.
[SAL03] SebagM.,AzéJ.,LucasN.ROC-basedEvolutionaryLearning:ApplicationtoMedical