HASAR : mining sequential association rules for atherosclerosis risk factor analysis

(1)

HAL Id: hal-02341338

https://hal.archives-ouvertes.fr/hal-02341338

Submitted on 31 Oct 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

HASAR : mining sequential association rules for

atherosclerosis risk factor analysis

Laurent Brisson, Nicolas Pasquier, Céline Hebert, Martine Collard

To cite this version:

Laurent Brisson, Nicolas Pasquier, Céline Hebert, Martine Collard. HASAR : mining sequential

as-sociation rules for atherosclerosis risk factor analysis. Workshop in 8th PKDD conference (Principles

and Practice of Knowledge Discovery in Databases), Sep 2004, Pisa, Italy. �hal-02341338�

(2)

Atherosclerosis Risk Factor Analysis LaurentBrisson

1

, NicolasPasquier

1

,CélineHebert

2

, MartineCollard

1

LaboratoireI3S(CNRSUMR-6070),UniversitédeNice Sophia-Antipolis,

2000 routedeslucioles, LesAlgorithmes,06903 Sophia-Antipolis,France;

{brisson,mcollard,pasquier}@i3s.unice.fr

2

LaboratoireGREYC(CNRSUMR-6072),UniversitédeCaenBasse-Normandie,

UFRSciences,CampusII-B.P.5186,14032Caen,France;

[email protected]

Abstract. We present the HASARmethodthat is anhybrid approachfor

ex-tracting adaptive sequentialassociation rules. Thismethod extractsassociation

rulesbetweeneventsoccurringinsubsequenttime-intervalsusingcloseditemsets

extraction and evolutionary techniques.An importantfeature is itscapacity to

considerdierenttime-intervalsdependingontheattributessemantic.Weapplied

this methodfor theanalysisoflong termmedicalobservationsofatherosclerosis

riskfactorsforcardio-vasculardiseasesprevention.Experimentalresultsshowthat

it is well-suited for extracting knowledge from temporal data where interesting

patternshavedierentobservationperiodlength.

1 Introduction

Inthispaper,weconsidersequentialassociationruleswhichexpresscasualtyrelationships

betweensets of events occurring in subsequent time-intervals. We developed a specic

approach, calledHASARfor HybridAdaptiveSequentialAssociationRules, fornding

such patternsofinterest.Ourmethodology isappliedto anhealthcareproblem,but it

is also suited to a broadcollection of data mining problems where data are temporal

observationsonindividuals,incustomerbehaviourorcreditriskpredictionforinstance.

Manystudies havefocused ontheecientminingof sequentialpatterns orpatternsin

time-related data. Most of them are based on extensions of the APRIORI algorithm

[AMS+96]proposedforextractingassociationrules.TheHASARapproachpresentedin

this papercombines techniquesforsearchingclosed itemsets, asdened in the CLOSE

algorithm[PTB+04],andanheuristicapproachrelyingonageneticalgorithm.

WeusedtheHASARapproachfortheanalysisoflongtermobservationdatainthe

STU-LONGdataset.Thisdatasetwasconstitutedintheframeworkofalongitudinalstudyof

atherosclerosisriskfactorsto evaluatetheimpact ofnon-pharmacologicalprescriptions

ontheserisks.Itcontainsdatacollectedbetween1975and2001onapopulationof1417

menbornbetween1926and1937inCzechoslovakia.First,anentryexaminationwas

per-formedanddataconcerningsocialcharacteristics,diet,tobaccoandalcoholconsumption,

physicalactivities,personalanamnesisand,physicalandbiochemicalexaminationswere

collected.Duringthisexamination,patientswereclassiedintothreegroupsaccordingto

principalatherosclerosisriskfactors(RFs):Arterialhypertension,hypercholesterolemia,

hypertriglyceridemia,overweight,smokingandpositivefamilycasehistory.These three

(3)

Pathologicalgroup(PG):Cardio-vasculardiseasediagnosed.

During the twenty one years following the patients' entry, control examinations were

performedtorecordchangesindiet,smokinghabits,physicalactivitiesandresponsibility

level in job, sport practicein leisure timeand, physical and biochemicalmeasures.For

each control,the patient id,the date and the control number were recorded. In 2001,

389patientsweredeceasedand,thecauseanddateofdeathwererecorded.Foragroup

of403patientsthat answeredtoapostalquestionnaire,detailed datasimilarto those

gatheredduring controlswerecollected.

TheSTULONGdatasetwascollectedatthe2ndDepartmentofMedicine,1stFacultyof

MedicineofCharlesUniversityandCharlesUniversityHospital,Unemocnice2,Prague2

(head.Prof.M.Aschermann,MD,SDr,FESC),underthesupervisionofProf.F.Boudík,

MD,ScD,withcollaborationofM.Tomeèková,MD,PhDandAss.Prof.J.Bultas,MD,

PhD.ThedataweretransferredtotheelectronicformbytheEuropeanCentreofMedical

Informatics,StatisticsandEpidemiologyofCharlesUniversityandAcademyofSciences

(head.Prof.RNDr.J.Zvárová,DrSc).Atpresenttimethedataanalysisissupportedby

thegrantoftheMinistryofEducationCRNrLN00B107.

Objectives. One of the main objectivesof this study wasto evaluate the impact of

behaviouralchangesstartingorstoppingadietorsportpracticeforinstanceonthe

RFandthedevelopmentofcardio-vasculardiseases(CVDs).Theinitialanalytical

ques-tionsinSTULONGevolvedaftertherstexperimentationsanddiscussionswithmedical

experts, mainly because of the evolutions of medical knowledge sincethe beginning of

thestudyandmissingdata(e.g.,uriacidmeasureisgivenforlessthan10%ofcontrols).

Duringthis work,wehavedevelopedanewmethodthat webelievewillhelp toanswer

thefollowingquestionsdenedthroughoutdiscussionswithmedicalexpertsandrelated

tothelong-termobservations:

Aretheredierencesbetweenmenof thenormal,riskandpathologicalgroupsfrom

theviewpointoftheimpactofbehaviouralchangesonRFandCVDdevelopment?

WhatcharacterizesmenwhodevelopedaCVDandthosewhostayedhealthyonthe

globalpopulationandtheriskgroup?

Are the education level and the responsibility in job good criteria for segmenting

patientswithperilousorsafebehavioursand highorlowRF?

HASARisanassociationrulesextractionapproachincorporatingtemporalrelationships.

Itextractssequentialassociation ruleswhicharewellttedtoanalysecasualty

relation-shipsbetweenbehaviouralchangesandRFevolutionsorCVDdevelopment.Association

ruleextraction,rstintroducedin[AIS93],aimsatdiscoveringcasualtyrelationships

be-tweensetsofattributevalues,calleditemsets,inlargedatasets.Anexampleassociation

rule,ttinginthecontextofmarketbasketanalysisis:

BUY(cereal)

∧

BUY(sugar)

→

BUY(milk), support=20%,condence=75%.

This rule states that customers who buy cereal and sugar also tend to by milk. The

support measureindicates that 20% of allcustomersbought boththree itemsand the

condence measure shows that 75 % of customers who bought cereal and sugar also

boughtmilk.Informally,thesupportrepresentstherangeoftheruleandthecondence

indicatestheprecisionoftherule. Inorderto extractonlystatistically signicant

(4)

plied to the dataset for extracting long-term observation related patterns. Sequential

associationrulesandthetechniquesweusedfortheirextractionaredenedinsection3.

Insection4,weshowexperimentalresultsandsection5concludesthepaper.

2 Data preparation

Datafor thelong-term observation were collectedduring thetwentyoneyears controls

after the patients' entry. We used both entry and control data to generate multiple

datasetsthat areadapted tothekindofknowledge wewere interestedin.Attributesof

interest were selectedand preparedaccording to discussions with medical experts, for

determiningthresholdvaluesofphysicalandbiologicalmeasuresforinstance.

2.1 Sequentialrules and search strategy

Since our main objective concerns the eect of behaviour on risk factor development,

wedecided to look forsequentialrulesinvolvingcasualty relationshipsbetweenpatient

behavioralchangesandriskfactorchangesonsubsequenttimeintervals.

Sequentialrules with the form X

→

Y we search for involve both itemsets and

time-itemsets. Wecall time_itemanattribute valueoccurringin aparticulartemporal

win-dow.Wecall time_itemsetasetof time_items.Oursequentialruleshavethefollowing

structure:

IDE_itemset

∧

BEH_time_itemset

→

RF_time_itemset

wherethecomponentsare:

IDE_itemset: anitemset ofstaticidentication attributes,

BEH_time_itemset:atime_itemsetofbehaviouralattributes,

RF_time_item:atime_itemonariskfactorattribute.

Anexamplesequentialrulemaybe:

ALCOHOL=regularly

∧

BEH_PHA=decreased_sits

→

RF_CHOLEST=increased.

Sucharulemustbeinterpretedasacasualtyrelationshipbetweenchangesonriskfactors

occurring onan observation temporal window of Omonths and inducedby static data

and by changes on behaviour on aprevious action temporal window of A months. We

alsodenedalatency temporalwindowL betweentheactionperiodandtheobservation

periodwhich allowsawaiting timeto observetheimpactof somebehaviouralchanges.

Theexampleruleshouldbeinterpretedasfollows:

if the patient regularlyconsumedalcohol whenhe enteredthe study

and hisphysical activity afterjob decreasedoveraAmonth period

thenhischolesterol rate increasedatacontrol whichoccurredL monthsafter

andoverthesubsequentOmonthobservation time.

AnimportantelementforthemethodexibilityisthattemporalwindowsizesO,Aand

L aredened asparametersofsequentialrules. Thestrategy weapplied for extracting

theseruleswasrstrunningwidedatatransformationstotailordatatothespecictask.

Thisrst stepconsisted in applyingcorrections, replacingmissing values,creatingnew

(5)

Statistical measures are presented in sections 3.1 and 3.3. HASAR exibility is also

provided by the evolutionary algorithm involved for searching for rules. We dened a

GeneticAlgorithm(GA)forthistask.GAs[GR87]arewellsuitedtolargecombinatorial

searchspaces.GAsareadaptiveproceduresthatevolveapopulationofstructuresinorder

tondthebest individual.Theevolutionis performedbyspecic geneticoperatorslike

mutationandcrossover.Theyhavealonghistoryofbeingexploitedforrulemanipulation

[Freitas02,SAL03].Asitwassuggested,theyoertechniquessuchasnichingwhichallow

notonlytondthebestrule,butaselectionofgoodrules.

2.2 Patientclasses

Inordertoobservepotentialdierencesbetweenclassesofpatients,wedistinguishedthe

followingpartitions:

GroupsNG,PGandRGassignedtopatientsin theSTULONG study.

ClassesCVDandNCVDwhich respectivelyrepresentpatientswhohadanddidnot

haveacardio-vasculardiseaseduringtheSTULONGstudy.

Clusters ofpatientsbasedontheireducationlevelandjobresponsibilitycriteria.

Classes CVD and NCVD were obtained by splitting the EXPERIMENT table using

at-tributesCONTROL.HODNx,withx

6=

0andx

6=

15,thatindicatesifacardio-vasculardisease

wasdiagnosedandDEATH.PRICUMRthatindicatesifhisdeathwasduetoacardio-vascular

disease.

Clustersof patientsbasedon theireducationlevel andjob responsibilitiescriteriawere

obtainedasfollows.Inarstattempt,weextractedallclosedpatternscontainingatleast

thesocialfactorsbutthosegatheringarelevantnumberofpatients(atleast200)didnot

reveal a signicant medical interpretation. This is due to the fact that some attribute

valuescoveralargenumberofpatients(e.g.,1023patientsamongthe1199aremarried).

Aftertalking withaphysician,the`educationlevel (VZDELANI)andresponsibilitiesin

job (ZODPOV)attributes,thataremostlikelytoinuenceatherosclerosis,werechosenas

maincriteriaforbuildingtheclusters.Weinvestigatedtheclosedpatternscontainingthe

itemscomingfromtheseattributes.Wegotthefollowing11closedpatterns(orpotential

clusters):

1. basicschooland others(forresponsibilitiesin job)

2. primaryschoolandmanagerialworker

3. primaryschoolandpartlyindependentworker

4. primaryschoolandothers

5. secondaryschooland managerialworker

6. secondaryschooland partlyindependentworker

7. secondaryschooland others

8. universityandmanagerialworker

9. universityandpartlyindependentworker

10. universityandothers

11. universityandpensioner(not becauseofICHS)

Theclosedpatternsnumber5and8weremergedtoproducetherstcluster.Weperform

asimilarprocesswithclosedpatternsnumber6and9,7and10,1and4.Thefthcluster

(6)

Cluster Secondaryschool Responsibilityinjob Healthy Atherosclerosis Total

1 yes managerialworker 150 60 210

2 yes partlyindependentworker 227 82 309

3 yes others 127 59 186

4 no others 221 122 343

5 - - 94 57 151

Totalofpatients 819 380 1199

Table 1.Descriptionofclusters.

2.3 Attributes ofchange and new tables

A preliminary step consisted in correctingsome errors orcontradictions and replacing

missing and not stated values by the same value. We built new tables CHANGES and

EXPERIMENTmorettedto thetask from theinitial tables ENTRY,CONTROL.The DEATH

tablewasusedinordertosplitthepatientsetintotheCVDandNCVDclasses.

Discus-sionswithmedicalexpertsallowedustoidentifysomeguidelinesforbuildingtablesand

to understand which initial attributes wehad to keepanwhich ones wehaveto build.

First,wekeptexisting identication attributes(IDE attributes) about patients.These

attributesarenamedaccordingtotheexpressionIDE_attribute_name.Theyrepresent:

theeducationlevelofthepatient,

theageofthepatientat themomentofthecontrol,

theinitialgroupof thepatientwhenhecameinto thestudy,

thealcoholconsumptionatthebeginningofthestudysinceSTULONGdatadonot

providethisinformationforeachcontrol.

Otherattributesdescribebehaviouralchangesandchangesrelatedtoriskfactorsfromone

controltoanother.Behaviouralchangeattributes(BEHchangeattributes)werenamed

accordingtotheexpressionBEH_attribute_nameandriskfactorchangeattributes(RF

change attributes) were named according to the expression RF_attribute_name.

At-tributesforbehaviouralchangesarerelatedto criteriabelow:

consumptionofcigarettesaday,

physical activityinjob andafterjob,

dierentkindsofdiet,

medicineforcholesterolandforbloodpressure.

Attributesforriskfactorchangesarerelatedtothefollowingcriteria:

global,HDLandLDLcholesterollevels,

triglycerideslevel,

overweightorobesity,

blood pressuremeasures,

glycemialevel.

ThenewCHANGEStableis composedwith IDE,BEHchange andRFchange attributes.

For two subsequent controls number N and N+1 of a patient, a new tuple is created

andthevalueofeachBEHandRFattributeiscomputedbycomparingattributevalues

of the twocontrols. For therst control,a newtuple is created by comparing it with

(7)

stay_sits hemainlysits hemainlysits

decreased_sits moderateorgreatactivity hemainlysits

increased_modest hemainlysits moderateactivity

stay_modest moderateactivity moderateactivity

decreased_modestgreatactivity moderateactivity

increased_great hemainlysitsormoderateactivitygreatactivity

stay_great greatactivity greatactivity

Table 2.AttributeBEH_PHA.

Value CONTROL.CHLST(N)CONTROL.CHLST(N+1)

stay_normal <6 <6

decreased

≥

6 <6

increased <6

≥

6

stay_high

≥

6

≥

6

Table 3.AttributeRF_CHOLEST.

RF_CHOLESTattributevaluesarecomputedfrom valuesofattributesCONTROL.AKTPOZAM

(whichindicatesthephysicalactivityafterjob)andCONTROL.CHLST(whichindicatesthe

globalcholesterolrate).

Inordertogetatemporaldescriptionofeachpatientallalongthestudy,weattenedthe

CHANGEStable.TheEXPERIMENTtableistheresultoftheatteningoperation;itcontains

asmanytuplesaspatientsinCONTROL.OnetupleinEXPERIMENTcontainsIDEattributes

ofthepatientandchangesattributesforeverycontrolthepatientpassedthrough.

3 Strategies

3.1 Target modeland denitions

Weconsider the time segmentwhich representstheinterventionon apatient from the

timeheenteredthestudyuntilthetimeheleft it.Ourdiscussionswithmedicalexpert

led usto xa monthas the time unit.Let usconsider apatient andthe time-interval

[

in; out

]duringhewasintheSTULONGstudy.

C(T, patient)

refers to acontrol occurring at time

T

months after

in

for the patient.

Acontrol

C(T, patient)

ischaracterizedbyaBEH_time_itemsetandaRF_time_item

whichrepresentsbehaviouralandriskfactorschangesobservedforthepatientatthis

con-troltime.Acontrolperiod

CP (T, patient)

isaA-monthsizetimeinterval[

T ; T +A

]where

itoccursonecontrolatleastforthepatient.Atemporalconguration

T C(T

1 , T

2 , patient)

isatimeinterval[

T

1 ; T

2

]suchas:

itexistsacontrolperiod

γ

=

CP (T, patient)

with

T

1 ∈

[

T ; T + A

],

T

2 ∈

[

T

1 + L; T

1 + L + O

],

itoccurredacontrol

C(T

2 , patient)

,

itdidnotoccuranycontrolin theinterval[

T + A; T

2

]forthepatient.

Wesaythattwotemporalcongurations

T C(T

1 , T

2 , patient)

and

T C(T

3 , T

4 , patient)

are

compatiblesif

T

2 ≤

T

3

or

T

4 ≤

T

1

.

Statistical measures. We dene the measure max_support as the whole number of

possible compatible temporal congurations for all records (patients) in the attened

table EXPERIMENT.Forarule antecedent

X

=IDE_itemset

∪

BEH_time_itemset, we

(8)

BEH_time_itemsetisobservedoveracontrolperiod

γ

=

CP (T, patient)

.

Foraruleconsequent

Y

=RF_time_item,wesaythatatemporalconguration

T C(T

1 ,

T

2 , p)

contains

Y

ifRF_time_itemis observedatcontrol

C(T

2 , patient)

.

Foraruleantecedent

X

,wedenethecardinalitymeasureof

X

,

card(X)

,asthenumber

ofpossiblecompatible

T C(T

1 , T

2 , p)

thatcontains

X

.Foraruleconsequent

Y

,wedene

thecardinalitymeasureof

Y

,

card(Y )

,asthenumberofpossibleasthenumberofpossible

compatible

T C(T

1 , T

2 , p)

whichcontains

Y

.Forarule

X → Y

,wedenethecardinality

measureof

X ∧ Y

,

card(X ∪ Y )

asthenumberofpossible

T C(T

1 , T

2 , p)

thatcontains

X

and

Y

.

3.2 Association rulesbased approach

Twomainapproachesforextractingassociationrulescanbedistinguished.

Intherstapproach,allitemsetswithsupport

≥

minsupport,calledfrequentitemsets,are

extractedand allassociationruleswith condence

≥

mincondence aregeneratedfrom

them. This approach is veryecient when dataare weakly correlated,such asmarket

basket data, but performances drastically decrease when data are denseor correlated,

such as statistical data for instance. A comprehensivesurvey of this approach can be

foundin[AMS+96].

Thesecondapproachisbasedontheextraction ofgeneratorsandfrequent closed

item-sets dened using the Galois closure operator. From these, the informative basis for

association rules containing non-redundant association rules with minimal antecedent

andmaximal consequent.Thisapproachbothimprovestheextractioneciency,by

re-ducingthesearch-space,andtheresultrelevance,bysuppressingredundantrules,inthe

caseofdenseorcorrelateddata.Asummaryofthisapproachcanbefoundin[PTB+04].

Frequentscloseditemsetsandgenerators. Frequentcloseditemsetsandgenerators

aredened accordingtothe closureoperator

φ

oftheGalois connection.This operator

associateswithanitemset

l

itsclosure

φ(l)

thatisthemaximalsetofitemscommontoall

objectscontaining

l

.That is,theclosureof

l

istheintersectionofallobjectscontaining

l

.Theminimalcloseditemset containinganitemset

l

isitsclosure

φ(l)

andwesaythat

anitemset

l

is acloseditemsetif

φ(l) = l

. Thegeneratorsofaclosed itemset

c

arethe

minimal 3

itemsetswhichclosureis

c

.Generatorsaretheminimalitemsetswecanconsider

fordiscoveringfrequentclosed itemsets,bycomputing theirclosures.Sincethesupport

ofafrequentitemsetisequaltoitsclosuresupportandsincemaximalfrequentitemsets

aremaximalfrequentcloseditemsets,thefrequentcloseditemsetsconstituteaminimal

non-redundantgeneratingsetforallfrequentitemsetsandthus,forallassociationrules.

Considerthedataset

D

,constitutedofsixobjectsidentiedbytheirOIDandveitems,

representedin gure 1(a).Theeightgeneratorsand vefrequent closeditemsets, with

theirsupports,in

D

forminsupport=2/6aregiveningure1(b).

Theitemset{A}isthegeneratorofthefrequentcloseditemsets{AC}:theintersectionof

allobjectscontaining{A},thatareobjects1,3and4,gives{AC}andnosubsetof{A}

has{AC}asclosure.{A}and{AC}bothhaveasupportof

k{1,

3 ,

4 }k

kDk

=

3/6

.Thefrequent

closeditemset {BCE}hastwogenerators:{BC}and {CE}.{BE} isnotageneratorof

{BCE}since it is a frequent closed itemset, {B}and {E} are generatorsof {BE} and

{C}is itselfitsowngenerator.

(9)

OID Items 1 A C D 2 B C E 3 A B C E 4 B E 5 A B C E 6 B C E (a)Dataset

D

. {A} {AC} 3/6 {B} {BE} 5/6 {C} {C} 5/6 {E} {BE} 5/6 {AB} {ABCE} 2/6 {AE} {ABCE} 2/6 {BC} {BCE} 4/6 {CE} {BCE} 4/6 (b)Extractionresult.

Fig.1.Generatorsandfrequentcloseditemsets.

3.3 Evolutionaryapproach

GeneticAlgorithmsarerobust,exiblealgorithmswhichtendtocopewellwithattribute

interactioninatherosclerosisdata.Furthermore,thecomprehensibilityofthediscovered

knowledgeisimportantandGAsallowustoextractcomprehensiblerulesevolving

pop-ulations of prediction patterns. Our GA implementation uses EO, a templates-based,

ANSI-C++compliantevolutionarycomputationlibrary.

Genome. The rst issue in designing a GA is how to encode each individual in the

population. To representvariable length ruleweuse axed-length genome which

con-tainsageneforeachIDE_attributeandBEH_attributeandanothergeneforoneofthe

RF_attributes.Each genecontainsthree elements:attributename, attribute valueand

anactivationagindicatingwhetherornotanitemintheruleisassociatedtothegene.

Generationofinitialpopulation. Themethodusedtogeneratetheinitialpopulation

isbasedonCLOSEalgorithm results.CLOSE generatealistwithgeneratorsandtheir

frequent closed itemset associated, that allows us to initialize the population in three

steps:

Rule antecedents are created from generators. Generators containingonly RFs are

skipped.

RuleconsequentsarecreatedwiththerstRFfoundinthefrequentcloseditemset

orthegenerator.IfnoRFisfound,wegeneraterandomlyone.

Foreachattributenotrepresentedingeneratorsthevalueofthecorrespondinggene

israndomlydened andtheactivation agissettofalse.

Genetic operators. We use a tournament selection with size of 2. The selection is

deterministic,starting from thebest ones downto the worse ones.If thetotalnumber

toselectislessthanthesizeof thesourcepopulations,thebestindividualsareselected

once. Ifmoreindividuals are neededafter reaching thebottom ofthe population, then

the selection starts again at top. If the total number required is N times that of the

sourcesize,allindividualsareselectedexactlyNtimes.Forreplacementweusethemost

straightforwardmethod, called generationalreplacement where allospring replaceall

parents;howeverweakelitism isused.

Newpatterns are generated by combining existing patterns using a crossoveroperator

orbymodifyingexistingpatternsviaamutationoperator.Crossoverisarecombination

(10)

value (domain of this value is a parameter of the GA) which is randomly added or

subtractedtothecurrentgenevalueandthelastinvertsthecurrentvalueoftheactivation

ag.All ofthemutation ratescanbedeneindependently.

Fitness function. A crucial issue in the design of a GA is the choice of the tness

function.Inarstapproachweonlyconsider supportto selectthemostfrequentrules,

condencetoconsiderreliablerulesandlifttoensureahighlevelofdependencebetween

antecedentand consequentpart of arule. To evaluate the quality of arule r: X

→

Y,

ourGAappliesthetnessfunctionontheindividualassociatedtotherule.

f itness(r) = support(r) ∗ conf idence(r) ∗ lif t(r)

(1)

support(r) =

card(X ∪ Y )

max

_

support

(2)

conf idence(r) =

card(X ∪ Y )

card(X)

(3)

lif t(r) =

card(X ∪ Y )

card(X) ∗ card(Y )

(4) 4 Experimental results

4.1 Patientsclasses comparison

Patientgroups. Theexperienceconsisted inextractingrulesonPGand testingthem

on NG and RG. Dierences for support and tness measures are shown in gure 2.

Onemayobservethat the best rulesfound onPG are notvalid onNG. Most of them

haveagoodsupportbut aweaktnessonRG.Thus,these resultsshowquitedierent

relationshipsbetweenthepatientbehaviourandtheirriskfactorsamonginitialgroups.

0

0.01

0.02

0.03

0.04 Support

PG

NG

RG

1e-05

0.0001

0.001

0.01

0.1

1

10 Fitness

PG

NG

RG

Fig.2.BestrulesonPGversusNGandRG.

Cardiovascular disease. The experience consisted in extracting rules on CVD and

testingthemonNCVD.Resultsingure3showthatthebestrulesfoundonCVDhave

similar support and tness on NCVD. These results seem to show that relationships

(11)

0

0.005

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1 Support

CVD

NCVD

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13 Fitness

CVD

NCVD

Fig.3.Bestrules onCVDversusNCVD.

Socialclusters. This experienceconsisted in extractingrules onCluster1and testing

them on Cluster3 and Cluster4. We can see in gure 4 that the best rules found on

Cluster1alsohavegoodmeasuresonCluster4and resultsonCluster3arequitesimilar.

Thusitseemsthatsocialfactorslikeeducationlevelandjobresponsibilitydonotallow

todistinguishdierentbehaviourfromtheviewpointofcardiovascularrisks.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16 Support

C1

C3

C4

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014 Fitness

C1

C3

C4

Fig.4.BestrulesonCluster1versusCluster3andCluster4.

4.2 Initialisationmethods

In order to evaluate performances of our initialization method we compare it with a

randominitializationmethod. ResultsforPGandRGgroupsareshownin gure5and

resultsforCVDandNCVDareshowningure6.Therstthreecolumnsshowstatistics

aboutinitialpopulationsandthelastcolumnshowthetnessofbestindividualsaftera

GArun.WecanobservethatmeantnessofpopulationsgeneratedusingCLOSEis8.75

to 400 times better than those of randomly generated population. It is not surprising

since CLOSE optimize rules support and condence, two mains criteria of our tness

function.However, it is interestingto note that after a GArun thetness of the best

ospring of populations generated using CLOSE is 1.55 to 4.79 times better than the

best ospringofrandomlygeneratedpopulation.Furthermore,ananalyzeofrulesshow

(12)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5 Mean

St.dev

Initial max Final max

Fitness

CLOSE Initialization

Random Initialization

(a)PG

0 0.0005

0.001 0.0015

0.002 0.0025

Mean

St.dev

Initial max Final max

Fitness

CLOSE Initialization

Random Initialization

(b)RG

Fig.5.Comparinginitializationmethodsonpatientgroups.

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008 Mean

St.dev

Initial max Final max

Fitness

CLOSE Initialization

Random Initialization

(d)CVD

0 0.0005

0.001 0.0015

0.002 0.0025

0.003 0.0035

Mean

St.dev

Initial max Final max

Fitness

CLOSE Initialization

Random Initialization

(e)NCVD

Fig.6.Comparinginitializationmethodsonpatientclasses.

4.3 Observation timewindows variations

TimewindowsA,L andO,usedto extractantecedentsandconsequentsofrules,dene

time_itemsets.AnimportantaspectoftheHASARmethodisitscapacitytomakevary

thesetime windows.IntheSTULONG analysis,theobservation timeof RFevolutions

depends on the considered risk. For instance, accordingto physicians' knowledge, the

eectsofadietontheweightare perceptible afterafewmonthswhereastheeects on

thecholesterolmeasuresaremostoftenperceptibleafter alongerperiod.

We evaluated the eect of RFs observation time window variations on the tness of

rulesfor twoRFs: RF_CHOLESTand RF_BLOODPRESS.The resultsare shown in gure7.

For RF_CHOLEST, rules generated for a 60 months window have a much better tness

than those generated for a 15 months window. For RF_BLOODPRESS, the situation is

theopposite: a15 months windowgives better tnessthan a60 monthswindow. This

showsthateectsofnonpharmacologicalprescriptionsonhypercholesterolemiamustbe

observedonmuchlongertimeperiodthaneects onarterialhypertension.

5 Conclusion

(13)

0.0025

0.002 0.0015

0.001 0.0005

0 Fitness

60 month

15 month

(a)RF_CHOLEST

0.0035

0.003 0.0025

0.002 0.0015

0.001 0.0005

0 Fitness

60 month

15 month

(b)RF_BLOODPRESS

Fig.7.Eectsofobservationtimewindowlengthvariationsonrulestness.

previousworksonthisdatasetessentiallyfocusedonstaticinformationfrom theENTRY

table, we have investigated the analysis of temporal data in the CONTROL table. Our

approachisbasedontwomainpoints:

asetofdatatransformationswhichis suitedtovarioustemporaldata,

an hybrid strategy which combines advantages from quite dierent techniques: an

exhaustivesearch forfrequentgeneratorsand closeditemsets which areusedasthe

initialpopulationof anevolutionaryalgorithm.

Experimental results allowed to point out dierent tendencies among patient groups

and conrmed prior medical knowledge. In order to answer in-depth to the analytical

questions,furtherinvestigationsofsequentialruleswiththeassistanceofmedicalexperts

arerequired.

Inthefuture,weplanto applytheHASARapproachtoother temporaldatasetswhere

observationtimewindowsarenotuniformonallattributes.Anotherinteresting

perspec-tiveis toextendthe approachby integratingbackgroundknowledge, such asexpressed

inmedicalontologies, inthesearchprocess.

References

[AIS93] AgrawalR., ImielinskiT. and Swami A. Mining AssociationRules Between Sets of

ItemsinLargeDatabases. ProceedingsoftheSIGMODconference,pp207216,May1993.

[AMS+96] AgrawalR.,MannilaH.,SrikantR.,ToivonenH.andVerkamoA.I. FastDiscovery

ofAssociationRules.AdvancesinKnowledgeDiscoveryanddatamining,pp307328,Fayyad

U.M.,Piatetsky-ShapiroG.,SmythP.andUthurusamyR.editors,AAAIPress,1996.

[PTB+04] PasquierN.,Taouil R.,BastideY., StummeG.and LakhalL. Generatinga

Con-densed Representation for Association Rules. Journal of Intelligent Information Systems,

KerschbergL.,RasZ.andZemankovaM.editors,KluwerAcademicPublishers,toappear.

[Freitas02] FreitasA.A. DataMiningandKnowledgeDiscoverywithEvolutionaryAlgorithms.

Springer-Verlag,Berlin,2002.

[GR87] GoldbergD.E.,RichardsonJ. GeneticAlgorithmswithSharingforMultimodal

Func-tionOptimization.Proc.Int.Conf.onGeneticAlgorithms(ICGA-87),1987,41-49.

[SAL03] SebagM.,AzéJ.,LucasN.ROC-basedEvolutionaryLearning:ApplicationtoMedical