Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset

(1)

O

pen

A

rchive

T

oulouse

A

rchive

O

uverte

(OATAO)

OATAO is an open access repository that collects the work of some Toulouse

researchers and makes it freely available over the web where possible.

This is

an author's

version published in:

https://oatao.univ-toulouse.fr/23807

Official URL :

https://doi.org/10.1016/j.eswa.2019.04.049

To cite this version :

Any correspondence concerning this service should be sent to the repository administrator:

tech-oatao@listes-diff.inp-toulouse.fr

Ratolojanahary, Romy and Houé Ngouna, Raymond and Medjaher, Kamal and

Junca-Bourié, Jean and Dauriac, Fabien and Sebilo, Mathieu Model selection to improve multiple

imputation for handling high rate missingness in a water quality dataset. (2019) Expert

Systems with Applications (131). 299-307. ISSN 0957-4174

OATAO

(2)

Model

selection

to

improve

multiple

imputation

for

handling

high

rate

missingness

in

a

water

quality

dataset

Romy

Ratolojanahary

a,∗

_,

_Raymond

_{Houé Ngouna}

a

_,

_Kamal

_Medjaher

a

_,

_Jean

_{Junca-Bourié}

b

_,

Fabien

Dauriac

c

_,

_Mathieu

_Sebilo

d

a Laboratoire Génie de Production, École Nationale d’Ingénieurs de Tarbes, BP1629, 47 avenue d’Azereix, Tarbes Cedex 16 65016, France

b Agence de l’eau Adour-Garonne, Tarbes, 7 Passage de l’Europe, Pau 640 0 0, France c Chambre d’Agriculture des Hautes-Pyrénées, 20 Place du Foirail, Tarbes 650 0 0, France d IEES, Université Pierre et Marie Curie, 4 Place Jussieu, Paris 75005, France

a

r

t

i

c

l

e

i

n

f

o

Keywords: Multiple imputation High missingness Model selection Machine learning Data preprocessing Water quality

a

b

s

t

r

a

c

t

Inthecurrenteraof“informationeverywhere”,extractingknowledgefromagreatamountofdatais in-creasinglyacknowledgedasapromisingchannelforprovidingrelevantinsightstodecisionmakers.One keyissueencounteredmaybethepoorqualityoftherawdata,particularlyduetothehighmissingness, thatmayaffectthequalityandtherelevanceoftheresults’interpretation.Automatingtheexplorationof theunderlyingdatawithpowerfulmethods,allowingtohandlemissingnessandthenperforma learn-ingprocesstodiscoverrelevantknowledge,canthenbeconsideredasasuccessfulstrategyforsystems’ monitoring.Withinthecontextofwaterqualityanalysis,theaimofthepresentstudyistoproposea ro-bustmethodforselectingthebestalgorithmtocombinewithMICE(MultivariateImputationsbyChained Equations)inordertohandlemultiplerelationshipsbetweenahighamountoffeaturesofinterest(more than200)concernedwithahighrateofmissingness(morethan80%).Themaincontributionisto im-proveMICE,takingadvantageoftheabilityofMachineLearningalgorithmstoaddresscomplex relation-shipsamongalargenumberofparameters.ThecompetingmethodsthatareimplementedareRandom Forest(RF),BoostedRegressionTrees(BRT),K-NearestNeighbors(KNN)andSupportVectorRegression (SVR).Theobtainedresults showthatthehybridization ofMICEwithSVR,KNN,RFand BRTperforms betterthanthe original MICEtakenalone.Furthermore, MICE-SVRgives agood trade-off interms of performanceandcomputingtime.

1. Introduction

Theproliferationofsensingdeviceshasincreasedtheabilityof organizationstoacquire variousandgreatamountofdata, allow-ingthemtoimplementreal-timemonitoringoftheirsystems.This is generally based on the analyses of complex relationships be-tweenseveralfactorsofinterest,suchasinwaterqualityanalysis. Onlinemonitoringhasindeedofferedthedevelopmentofdecision systemsthatareabletoacceleratedecision-makingandanticipate actions toprevent undesiredeventsortoeradicatecriticalissues. Toachieve suchagoal,itisrequiredtopre-process therawdata, especiallywhensomevaluesaremissingonacertainlevel.

∗ _{Corresponding author.}

E-mail addresses: romy-alinoro.ratolojanahary@enit.fr (R. Ratolojanahary),

Missingdataisarecurringphenomenoninreal-world applica-tions(Sterneetal.,2009;Yang,Liu,Zeng,&Xie,2019).Itmay oc-curduetosensorfailures,badornon-existingstrategyfordata ac-quisition,budgetissues,lackofresponsefromaparticipantinthe caseofsurvey or various other reasons.If thecomplete data are representativeof the studiedphenomenon, this missing informa-tionisnegligible,otherwisetheresultsmaybeincorrectandmay leadtowronginterpretations.Forexample,anomaliescouldgo un-detectediftheyhappenduringanon-monitoredperiodoftime.

There are two waysofdealing withmissingdata: deletion or imputation(Buhi, 2008). Deletion means discarding the observa-tionsorthevariableswithmissingdata,whichiscalled complete-caseanalysis,whileimputationconsistsinreconstructingthe miss-ingvalues.Becauseofitssimplicity,deletionisusuallythedefault methodusedinpractice.However,therearemanycasesinvarious ﬁelds in which thismethod showedsome limitations. Indeed, it decreasesthesamplesizeandmayleadtoalossofsubstantial in-formation.In ClarkandAltman(2003)forinstance,thenumberof raymond.houe-ngouna@enit.fr (R. Houé Ngouna), kamal.medjaher@enit.fr (K. Med-

jaher), jean.junca-bourie@eau-adour-garonne.fr (J. Junca-Bourié), fdauriac@hautes-

(3)

observationsdropped from1189to 518(43%ofthe originaldata) inanovariancancerdataset,whichledtobiasedinterpretation.

Another deletion method is pairwise deletion through which onlynon-missingvaluesareusedforanalyses,forinstancein cor-relationsscorescalculationwherethemethodfailswhenthetwo correlatedvariablesarenotﬁlledatthesametime.Insteadof dis-cardinganobservationoravariableconcernedwithmissingvalue, itispreferabletoestimateaccuratelythosemissingvaluesinorder toproviderelevantinterpretations.

Quoting White and co-authors, “awareness has grown of the need to go beyondcomplete-analysis” and some major improve-mentsofthesimplistic methodshavebeen proposedinthe liter-ature,since Rubin’s innovative proposalfor approaching

missing-ness(White,Royston, & Wood, 2010). Among others, Rubin,who

is the author of Multiple Imputation (MI), deﬁned a conceptual framework for characterizing missing data that allows to distin-guishvarious types andto determine when missingdata can be ignored (Little & Rubin, 1987; Rubin, 1976). The major insight of theproposed imputation methodis that it addresses uncertainty andcomplexityofthedatastructure,allowingtogobeyond delet-ingordiscardingdata.

Following Rubin, vanBuuren introduced the Multiple Imputa-tions by Chained Equations (MICE), a MI techniquethat requires fewerassumptions on missingnessandalsohandles relationships between variables (van Buuren & Groothuis-Oudshoorn, 2011). However,originalMICEconsidersonlylinearrelationshipsandhas beensuccessfullyappliedtodatasetwithatmost70%of missing-ness.Itmaythereforefail inother casessuchasinwaterquality dataasconsideredinthepresentstudy,whicharecharacterizedby ahighrateofmissingnessandagreatamountoffactorsof inter-estthatarenotnecessarilylinearlyrelated.Thissuggeststheneed ofanalternativemethodtoimprovetheimputationmechanismin ordertoproviderelevantinterpretationoftheresults,whichisthe purposeofthiswork.

Therestofthepaperisorganizedasfollows:themain imputa-tionmethodsavailableintheliteraturearereviewedin Section 2, followed by the presentation of a method to improve MICE for multipledataimputationin Section3. Anapplicationof the pro-posedmethodonexperimental dataset,alongwithassociated re-sults,aredescribedin Section4whilethelastsectioncontainsthe conclusionandperspectivesofthepresentwork.

2. Relatedwork

Inordertochooseanappropriatemethodforhandlingmissing data,the underlying causeof the missingness has to be investi-gated. Indeed, as mentionned in Buhi (2008), each method only works under certain assumptions, namely complete randomness, conditionalrandomnessorsystematicreasons.

2.1.Missingnesspatterns

Theconceptualframeworkallowingtotakeintoaccountcertain assumptions, as noted above, has been deﬁned by Rubin (1976). Thereare three typesof missingdata,depending onthe missing mechanism:(1)MissingcompletelyatRandom(MCAR),(2) Miss-ingatRandom(MAR)and(3)MissingNotatRandom(MNAR).

Let R be the locations of the missing data in a dataset X=

(

Xobs,Xmiss

)

, and

ψ

the parameters of the missing data model;

whereX_obs andX_miss arerespectively theobserved andthe miss-ing values.MCAR, MAR andMNAR patternsare formally deﬁned asfollows(vanBuuren,2018):

• DataareMCARiftheprobabilityofmissingnessisindependent ofboth the observedvariables andthevariables withmissing

values.Thisisthecase,forexample,whenpeopleforgetto an-sweraquestioninasurvey.Formally,

P

(

R=0

|

Xobs,Xmiss,

ψ

)

=P

(

R=0

|

ψ

)

(1) • DataareMARiftheprobabilityofmissingnessisdueentirelyto theobservedvariablesandisindependentoftheunseendata. Inotherwords,themissingnessisafunctionofsomeother ob-servedvariablesinthedataset(forexample,peopleofonesex arelesslikelytodisclosetheirweight):

P

(

R=0

|

Xobs,Xmiss,

ψ

)

=P

(

R=0

|

Xobs,

ψ

)

(2)

Therefore,MARdataare agoodcandidatefordataimputation basedonobservedvariables(Buhi,2008).

• DataareMNARifthemissingvalueisrelatedtotheactual val-ues(for example, people who weighmore are most likely to notdisclosetheirweight):

P

(

R=0

|

Xobs,Xmiss,

ψ

)

(3)

dependsonallthreeelements.

When data are MNAR, the missingness process is called non-ignorable,meaning thatthe causeofthemissingnessmust be in-cluded in the model,whereas MAR andMCAR data missingness processes are called ignorable. Following the assumptions behind thesethree patterns, severalmethods have beenprovided in the literatureforsolvingappropriatelythemissingness.

2.2. Singleimputationmethods

Methods that compute one single value per missing data are referred as single imputation methods. The most common single imputationmethodsare mean,medianormodeimputation, con-sisting inreplacing the missing value withthe mean, medianor modeoftheassociatedvariable(Buhi,2008).Inthiscase,the miss-ing value is easy to compute, butthe method ignores the corre-lation among the variables andunderestimates the standard de-viation. If thevariable containing missingvalues iscategorical, a simple option is to create a new category for the missing val-ues. Thismethod is suitable for MNAR data,i.e. when the miss-ingness is correlated to the values of the missing data. When a variableoftheincompletedatasetisaperiodictimeseries,amore elaboratedsingleimputation techniqueisto apply a linear inter-polationoranAutoregressive IntegratedMovingAverage(ARIMA) modeltofillinthemissingvalues(Shao,Meng,&Sun,2016). Al-though thosetwo techniquesare simple,thefirst one isnot effi-cient whenthemissinggapislarge,andthesecond onerequires a periodic time series. Anothertechnique involves predictingthe valuesfromtheobservedvariables.Forexample,K-nearest neigh-bors (KNN)replaces the missingvalue witha linearcombination of the K nearest non-missing observations (Jordanov, Petrov, &

Petrozziello,2018;Tutz&Ramzan,2015).Tousethisalgorithm,it

isnecessary to choosethe optimalKanddeﬁne a distance mea-surementbetweentwoobservations.Alocalsimilarityimputation based on Fast Clustering was proposed in Zhao, Chen, Yang, Hu,

andObaidat(2018).Theauthorspartitiontheincompletedatawith

afastclusteringmethod(StackedAutoencoder-based),thenﬁllthe missingdatawithin each cluster usinga KNNalgorithm. The ob-tained results showed that the proposed method outperformed otherlocalsimilarity-basedmethods.Shaoandco-authorsapplied two Single Layer Feed ForwardNeuralNetworks (Extreme Learn-ingMachineandRadialBasisFunctionNetwork)onaperiodicsoil moisture time series (Shaoetal., 2016). Thismethodperformed betterpredictions thanalinearinterpolation andARIMA in inﬁll-ingmissingsegments.However,itrequiresparametertuningin or-dertobeperforming.

(4)

Table 1

Advantages and Drawbacks of the reported single imputation methods.

Method Advantages Drawbacks

Mean Easy to implement - Underestimates standard deviation

- Ignores relationships between variables

Add a category Easy to implement Only works with categorical and MNAR data

Linear Interpolation Takes time into account Does not work when the missing gap is large

ARIMA Takes time into account Requires a periodic time series

Linear Regression Takes into account relationships between variables - Underestimates the variance

- Ignores non linear relationships between variables

Stochastic linear regression Takes into account relationships between variables Ignores non linear relationships between variables

KNN Takes into account relationships between variables Requires parameter tuning

ANN Takes into account the time factor Requires parameter tuning

Fig. 1. Overview of the multiple imputation method.

Abriefsummaryoftheseimplementationsofsingleimputation methodsispresentedin Table1thatprovidesthemaindrawbacks and advantages. Awell-known limitationthat they havein com-mon is that once a missing value is imputed, it is treated as a non-missingvalue.

2.3. Multipleimputationmethods

In order to solve the limitations of single imputation, some authors have proposed to take into account the uncertainty of the imputed values (Little & Rubin, 1987; Neter, Maynes, &

Ra-manathan, 1965). In that purpose,Rubin hasdeveloped the

Mul-tiple Imputation(MI)method,whichcombines severalsingle im-putations(Little&Rubin,1987),asdescribedinthefollowing. 2.3.1. Principlesofmultipleimputation

The principlesofMIare illustratedin Fig. 1,basedonthe fol-lowing main steps: (1) imputation phase where m datasets are producedby drawingthemfromadistribution,whichcanbe dif-ferent foreach variable (vanBuuren, 2018), (2)analysis phasein which the m datasets are analyzed, and (3) pooling phase that combines the m datasets to produce a ﬁnal result, for example by calculating the mean of the imputed valuesfor each missing value.Themdatasetscanbegeneratedinparallelusingparametric

statistical theory and assuming a joint model for all the vari-ables (van Buuren, 2007; Rubin & Schafer, 1990), such as in Multiple imputAtions of incoMplEte muLtIvariate dAta (AMELIA), which uses expectation-maximization witha bootstrapping algo-rithm (Honaker, King, & Blackwell, 2011). Such approach lacks ﬂexibility and may lead to bias (van Buuren, 2007). The other alternative is to generate the m datasets until a stop criterion

is met: in Hong and Wu (2011) for instance, the authors

iter-atively used association rules to successfully estimate the miss-ingvalues.Although thestudied datasethadahighmissingrate, it was relatively small (there were only three variables). Some other examples of the sequential methods are Sequential Impu-tationfor Missing Value (IMPSEQ) (Betrie, Sadiq, Tesfamariam, &

Morin, 2014), a covariance-based imputation method and MICE,

a series of linear regressions that consider a different distribu-tionforeachvariable(vanBuuren,2007;Raghunathan,Lepkowski,

Hoewyk, & Solenberger, 2001). Betrieand co-authors havefound

thatthetwosequentialmethodsoutperformAMELIA(Betrieetal.,

2014).In StekhovenandBuhlmann(2011),theauthorsintroduced

a MI methodcalled MissForest, whichis similar to MICE, except that it uses Random Forest instead of Linear Regression in the imputationstep.AsMissForestyieldedabetter performancethan MICE,that resultisencouragingtowardstweakingtheMICE algo-rithm,whichistheobjectofthepresentwork.Abriefsummaryof

,---

,

0

missing data

e

filled data V1 obs1

•

obs2

₀

obs3

•

... ... V2 ...

0 ...

• ...

0

... ... ...

/

---+

multiple imputations

r---V1 V2

...

obs1 _V12(1)

...

obs2 _V21(1)

•

... obs3

•

(1) V32

...

... ... ... ... V1 V2 ... obs1

•

_V12(2) ... obs2 (2) V21

• ...

obs3

•

(2) V32 ...

..

.

...

v(k) imputed data during the

,, k-th imputation process

,

~I _ _

I---Imputed data

,---

i

---+

/

imputations obs1 obs2 obs3

...

V1 V2 ...

•

V12 ... V21

• ...

•

V32 ...

...

..

.

aggregated value

Vij of an imputed data

~---

-analyses and

Pooled results

results pooling...._ _ _ _ _ _ _ _ _

(5)

Fig. 2. Overview of the MICE algorithm.

Table 2

Advantages and Drawbacks of the reported MI methods.

Method Advantages Drawbacks

AMELIA Can be applied to categorical, ordinal or continuous data Assumes a joint model for all the variables

MI using decision rules Works well when the missing-value rate is high Not adapted to data with a large number of variables

IMPSEQ Time complexity - Lack of robustness toward outliers

- Does not take into account nonlinear relationships between variables

MICE Flexibility - Does not take into account non-linear relationships between variables

- Theoretical justiﬁcation needed

MissForest - Adapted to high dimensional datasets Computation time issue

- Takes into account linear relationships between variables

theadvantagesanddrawbacksofthemethodspresentedabove is givenin Table 2,whiletheoriginal MICEprinciplesaredescribed inthefollowing.

2.3.2. MainprinciplesofMICE

The mainstepsofMICEaresummarizedin Fig.2anddetailed

in Algorithm 1. MICE algorithm implementation was based on a

method describedin Azur, Stuart, Frangakis, and Leaf (2011). It assumesthat missing data are of MAR type. The first step is to initialize the missing values to the mean of each column. Then themissingvaluesofthe firstvariableare resetto “missing”. Af-terthat, aregressionmodel isfittedonthe subsetofthe dataset wherethe value of this variable is present. Finally, the obtained model is used to fill in the value and update the dataset. This process is repeated for each variable until all the missing data are estimated. The whole process, first step excluded, is reiter-atedn_cycles times until theestimated dataconverge. Inthe lit-erature,itisadvised toincrease thenumberofcyclesinfunction ofthesize ofthedatasetandthe missingnessratio(Graham,

Ol-chowski,& Gilreath, 2007). Although MICE hasbeen proved

eﬃ-cient in the literature, the trade-off between computational cost and performance becomes imbalanced when dealing with large datasetsand/ordatasetswithahighmissingnessrate.Indeed,the numberofimputeddatasetshastobe increased,andsodoesthe computationaltime.Furthermore,ahighmissingnessrateimplies highuncertainty. Anotherkey issueisthat thisformofthe algo-rithmisbasedonlinearregression,whichmaynotreﬂectthe ac-tual relationships betweenthe variables of the current study. To addresstheseissues,animprovedversionofMICEisproposedand describedinthefollowing.

3. TheproposedmethodtoimproveMICE

Asnotedabove,thedatasetconcernedwithwaterquality con-sideredinthisstudyhasa very highmissingrate(82%).Besides, thereisagreatamountofvariables(morethan200)inwhicheach isconcerned with atleastone missingvalue. The methods men-tionedabove,includingthemostperforming,havebeenappliedin a lessconstrained contextandtherefore,can failto providegood results in the specific caseof the dataset considered in this pa-per.ItisthenproposedtotakeadvantageoftheabilityofMachine Learningalgorithms forhandling suchissues inorder toimprove MICE.Thetwomainideasare:(1)defineasetofcompeting meth-ods,andthen(2)replacetheLinearRegressionintheoriginalMICE by eachofthesemethods inordertoselectthe mostperforming thatfitsthecontextofthepresentstudy.

Thecompetingmethodshavebeenchosenamongthemost per-forming supervised learning algorithms in the literature, namely Random Forest(RF), Boosted Regression Trees (BRT), and Sup-portVectorRegression(SVR).Besides,K-NearestNeighbors(KNN), which iscommonlyused to solve missingness,hasalso been se-lected.

Themainstepsoftheproposedmethodareillustratedin Fig.3. 1. TheﬁrstphaseoftheoriginalMICEisinitialized(step1). 2. Acompetingmethodisthenchosen,followedbyamechanism

foroptimallysettingitshyperparameters(step2).

3. Next, phase (II) of the original MICE ismodiﬁed by replacing LinearRegressionwiththechosen method,andthenlaunched inaloopthatgoesanumberoftimescorrespondingtothe pre-deﬁnednumberofcycles(step3).

v

,

v

,

•

0 0

•

0

•

0 missing data • filled data ...

..

.

...

-+

...

..

.

v

,

v

,

..

.

•

m2

.

..

m,

• ...

•

m2

...

• •

...

m

,

=

mean(V;) 'j repeat

r

: until a : :maximum: : number : : of cycles 1

~

:

-+~

:

-+

y

v

,

• -

e

-•

• v

,

-

•

-(i) V21

-

• -•

v

,

...

v

,

v

,

... m2 ...

•

m2 ..

.

-

• -

- ···- V, ~(V2 , ... ) _V21(i)

•

... m2 ... linear

•

m2 ..

.

regression

•

...

•

.. .

v

,

...

v

,

v

,

...

-e

-

···

-

•

(i) v,2 ...

•

... V2~(V1 , ... ) _V21(i)

•

...

-e

-

· -

linear

•

(i) _... regression V32

•

...

• •

.. . ...

I

...

I

~---

(6)

Fig. 3. The proposed method for model selection to improve MICE.

Algorithm1 MICE.

Input:

• Xincompletedatamatrixofsizen_obs× n_f eatures

• n_cyclesnumberofcycles Output:

• Completeddatamatrixofsizen_obs− n_f eatures Xf ull:=mean_impute(X)

fori:₌1ton_cyclesdo for j:=1ton_f eaturesdo

yj:=Xj/*the j − thcolumn*/

X₍j):=X

\

Xj

m⊂

{

1,n

}

=

{

i

|

Xj!=NaN

}

/* m denotes theindices where

X jisnotmissing*/

regressor:=linear_regressor() regressor.ﬁt(Xm

(j),y m

j) /*themodelisﬁttedonthesubsetof

thedatasetwhere X jisnotmissing*/

y¬ m

(j) := regressor.predict(X(¬ jm)) /* ¬m denotes the indices

where X jismissing*/

endfor endfor return Xf ull

4. After convergence, performance indicators for the current methodarecomputed(step4).

5. Whenallthecompetingmethodshavebeenprocessed accord-ing to the four previous steps, a selection mechanism takes placebycomparingtheirperformanceindicators(step5). 6. Finally, the best method is applied to solve the missingness

(step6).

Duetothehighmissingnessrate,theoptimalchoiceofthe hy-perparameters(asconsideredinstep2)isbasedonamodiﬁed ver-sionofthestudieddatasetconstructedaccordingtothefollowing procedure:

• For each variable, a triangular distribution is simulated with differentparameters(min,mode,max).Ifavariablealwayshas thesame value,then that value isreplicated ineach observa-tion.Thetriangulardistributionhasbeenusedbecauseit

pro-vides a simple representation of the real distribution of the dataset andallows moreﬂexibility by takinginto account the uncertaintyofthevalues.

• The data are scaled so that the units ofthe variables donot playanyrole.

• The observationsareshuﬄedandthe missingnessdistribution of the real dataset is reproduced in order to mimic the real problemasaccuratelyaspossible.

Twomainperformance indicatorshavebeenusedforthe com-parison(asrealizedinstep5), namelyprocessingtime andMean SquaredError(MSE).

3.1. Theoreticalbackgroundofthecompetingmethods 3.1.1. RandomForest

RandomForestisanensemblemethodbasedonfullygrown re-gressiontrees.Theobjectiveistobuildseveralweaklearners(the regressiontrees)inparallelinordertoproduceastrongregressor. Themainstepsareasfollows:

1. Theobservationsaresampledwithreplacement(bootstrap ag-gregating).

2. Asetofvariablesisselectedrandomly.

3. Thetree isbuiltupon the observations fromstep(1) andthe variablesfromstep(2).

4. Theﬁnalpredictionismadeby averagingoverthe predictions ofalldecisiontrees.

Inthisalgorithm,oneofthemostrelevanthyperparametersto set in order to make the model perform well is the number of trees.

3.1.2. BoostedRegressionTrees

Similarly to the Random Forest algorithm, BRT is an ensem-ble methodbased on regression trees. Gradient boosting is used totrain theweak learners(shallowregression trees)sequentially. Inthisalgorithm, ahigher focusis seton observationsthat have highererrorsontheprevious treeanda gradientdescent isused tominimizethelossfunction(leastsquarederrors)ateachstep.

Letyibethetargetvalueandf(xi)itspredictor.

Theobjectivefunctionisgivenasin Eq.(4):

L

(

y,f

)

= n i=1 l

(

yi,f

(

xi

))

(4) Oo

Supervised learning

methods trials

Set of selected supervised

methods

Choose a method

f)&

optimize hyper.

MODIFIED MICE

Raw

data

preprocessing I

=

~ - - - ~

I I A I : wsolve the 1 : missingness : I ~ - -~ - ~ I I I

•--- ---•

.. ..

Bestleamll'II

method selection

0

Compute performance indicators

(7)

wherel

(

yi,f

(

xi

))

:=

(

yi− f

(

xi

))

2 .

Thealgoritmgoesasfollows:

f₀isthetrivialtree,itreturnsthemeanvalueofY. Fork:₌1tom:

• Calculate the negative gradient ₋

l

(

yi,f

(

xi

))

, which

corre-spondstotheresidualfori=1ton.

• Fitaregressiontreehkfortheresiduals.

• Create a model f_k₌ f_k₋₁₊

νγ

_kh_k_, where

γ

is the step magnitude, found by searchingargmin_γn_i₌₁l

(

yi,

(

fk−1

(

xi

))

+

νγ

hk

(

xi

)

,and

ν

is the learning rate.

Returnfm.

Forthisalgorithm,thenumberoftreesm,aswellasthe learn-ingrate

ν

,arethehyperparametersthatneedtobesetbytheuser inorderforthemethodtoperformwell.

3.1.3. K-NearestNeighbors

LetX andy bethe trainingdata,X∗ anewobservationandy∗ theassociated value to predict.The KNN algorithm goesthrough thefollowingsteps:

1. CalculatethedistancebetweenX∗andeachoftheobservations ofthetrainingset;

2. TaketheyvaluesoftheKclosestobservationsyi1 ,yi2 ,...,yik;

3. Assign toy∗ a linearcombinationof thesevalues(usuallythe mean).

Threehyperparametershavetobedeﬁnedproperlysothatthe algorithmperforms well:thedistance,thenumberofneighborsK andthetypeofaggregationoftheneighborsvalues.

3.1.4. SupportVectorRegression

Let X, y be a trainingdata. The objective ofSVR is to ﬁnd a function f such that the deviation of f(.) from the real values y

is at most

ε

(Smola & Schölkopf, 2004). If the problem has no

solution,slack variables

ξ

_i,

ξ

_i∗ are introduced to tolerate part of theerror.First,let’sconsiderthecasewherefislinear,i.e. f

(

x

)

= w.x+b.fisthenthesolution ofthefollowingoptimization prob-lem(Eq.(5)): Minimize 1 2

||

w

||

2 +C n i=1

(

ξ

i+

ξ

i∗

)

s.t. yi− <w,xi>−b ≤

ε

+

ξ

i <w,xi>+b− yi ≤

ε

+

ξ

i∗

ξ

i,

ξ

i∗ ≥ 0 (5)

where C>0 is the trade-off between the ﬂatness of f and the amountoftolerateddeviationslargerthan

ε

,and _<, _> isascalar product. By using the dual representation of the problembased on Lagrange multipliers, we ﬁnally get: f

(

x

)

=sumn

i=1

(

α

i+

α

i∗

)

<

x_i_,x_>₊b where

α

_iaretheLagrangianmultipliers.Iftheadequate fisnotlinear,wecanmapthedataintoahighdimensionalspace wherethefunctionfbecomeslinear(Fig.4).Instead ofsearching

Fig. 4. Mapping to the feature space in SVR.

fortheexpressionof

φ

,afunctionkcalledakernelfunction,which satisﬁes k

(

x,x

)

=<

φ

(

x

)

,

φ

(

x

)

> is used. The existence of such a function is proved by the Mercer’s theorem.

ε

, C andthe kernel functions are the Support Vector Regression (SVR) hyperparame-tersthat needtobe selectedproperlyfortheperformance ofthe algorithm.

4. Applicationandresults 4.1. Thecontextofthestudy

Theincompletedatasetusedinthispaperistakenfroma wa-tersampleanalysismadeatOursbelille,intheAdourplain, South-WestofFrance,from1991to2017.Theoperationalprincipleofthis drinkingwatercollectionpointisdescribedin Fig.5.First,the wa-terispumped,itsnitraterateismeasuredandisconveyedtolarge aerial tanks in order to be treated by active charcoal. Then, the treatedwaterisstoredinawatertank.Inathirdstep,some sens-ingdevicesarethenusedtomonitorsomequalityindicators,such asthepH.Ina fourthstep,ondemand,thestoredwateris chlo-rinated, before being dragged to another underground well, few kilometersawayfromthepumpingwell.Fromthissecondstorage tank, waterisdistributedto thecitizensoftheAdourregion.The regionbeneﬁts ofan oceanicclimate,witha rainywinterandan averagetemperaturerangingfrom4to19◦ C.

Theacquireddatacontain148 observationsof411water qual-ity indicators, with an overall missingness of 82%. Fig. 6a is an overviewofthedataset,wheresome ofthemeasuredwater qual-ityindicatorsaredisplayed,while Fig.6bsummarizesthe missing-nessdistributionpervariableinthedataset.

Onlythevariablesthat aremeasuredatleast5timesare con-sidered,whichreducesthedatasetto257variables(52%ofthe411 variables).Itisnotedthattheremovedvariablesdonotrestrictthe analysissince they are not amongthe common hyperparameters forwaterqualityassessmentfoundintheliterature.

4.2. Settingsandassumptionsoftheimplementation

Based on the presentation of the three missingness patterns, andthe natureof thestudied dataset(as describedinthe previ-oussubsection),wecanassumethat ourstudyiswithintheMAR pattern.

Moreover,theproposedmethoddependsonseveralfactors:(a) the number of cycles to perform the imputations, (b) the

num-Fig. 5. Operational principle of the drinking water well of Oursbelille.

y• _y_•

~

r

.

E

*

l

* * *

(l)

'

* *

*

t

*

* * *

* *

_*

<D(x) }.":'.~· ... ~:-.•~·-.'"'<: .... ~· ·~.,- ., ....

l}

:

I~\}f.:

:~I

~

:

'.

-~""....1. __

.J. __

_L_

~

*+~~;;

;;;;,~

\. -~ : .:· • ~··:._.. '1"-.:, .:·

/(;!?~t/\

':,

'

.

..:. .. .,, ~ · :, • ·.:,- -~·• . .. active charcoa mping lick nslng nltoring ! Storage for public distribution -'• ,· .. .:~,;. ~N

(8)

(a)

Overview of the dataset.

(b) Number of missing values per variable.

Fig. 6. Description of the dataset.

ber ofvaluesdeﬁnedforeach hyperparameter,(c)thesizeofthe dataset,(d) thenumberofvariables ofinterest,and(e)the com-plexity of the ML algorithm itself. Forthese reasons, inorder to obtainrelevantresultsinareasonablerunningtime,andby oppo-sition towhatis commonlyusedinliterature, onlyone value for thenumberofcycles(i.e.10cycles)isconsideredinthiswork.

Theimplementationoftheproposedmethodwasperformedby usingPythonprogramminglanguage,onacomputer withthe fol-lowingmainfeatures:

• OperatingSystem:Windows10;

• RAM:Intel(R)Core(TM)i5-7200UCPU@2.50GHz2.70GHz;

• Processor:8.00Go.

The corresponding results are described and discussed in the following.

4.3. Implementationoftheproposedmethod

The main steps of the proposed method have been imple-mentedaccordingtothefollowingexplanations.

• Step1.TheﬁrstphaseoftheoriginalMICE,thatismean impu-tation,islaunched(initializationstep).

• Step 2.The next step concerns the hyperparametertuning of the Machine Learning algorithms. There is no analytical solu-tionthatallowstoﬁndtheoptimalvalues.Therefore,todoso, across-validationisperformedusingthemodiﬁeddataset,and a meansquared error(MSE) is measured. The optimal hyper-parametersarethereforethosethathavethelowestMSE.Note thatonlyalimitednumberofcandidatevalueshavebeentaken intoaccount becauseaddingmorewoulddrasticallyaffectthe algorithmiccomplexity.

The competing methods are MICE, MICE combined with RF (MICE-RF),MICE combinedwithBRT (MICE-BRT),KNN (MICE-KNN)andMICEcombinedwithSVR(MICE-SVR).

Thecandidate valuesforthehyperparametersofthe four Ma-chine Learningalgorithms (KNN,RF, BRT, SVR)are detailedin

Table 3.For KNN, let usnotice that since the studieddataset

containsvariableswithonlyﬁvenonmissingvalues,the num-berofneighborsisatmost4.

• Step3.Phase(II)oftheoriginalMICEismodiﬁedbyreplacing Linear Regressionwithone ofthe competingalgorithms, each withitsoptimalhyperparameters(asobtainedinstep2).

• Step4.Theperformanceindicators,namelyMSEandprocessing time,arecomputedforeachalgorithm.

• Step5.ThemethodthatperformedbestintermsofMSE,and withareasonablecomputingtime,isthenselected.

• Step6.Finally,thewinningmethodisused tosolvethe miss-ingness.

Table 3

Candidate values of the hyperparameters for each Machine Learning method.

Algorithm Hyperparameter Candidate values

RF n_trees {10, 15, 20, 50, 100} BRT m {30, 50, 100, 150} ν {0.01, 0.1, 0.5} KNN K {2, 3, 4} d {euclidean, manhattan} y∗ _{{uniform, weighted}} SVR ε {0.01, 0.1} C {0.01, 0.1, 1, 10, 100}

kernel {rbf, poly, sigmoid}

γ {1e-3, 0.01, 0.1, 1}

Themainresultsofthisimplementationarepresentedand dis-cussedinthenextsubsection.

4.4.Resultsanddiscussion

Inthefollowing,onlysteps2,4and5,whichcontainthemain resultsoftheimplementation,arepresented.

• Step2:Hyperparametertuning.

Random Forest. In thisalgorithm,the performance increases pro-portionally to the number of trees. However, it becomes rapidly time consuming. The objective is to ﬁnd the smallest value for whichtheperformanceisgoodenough.Althoughitisnotthe op-timalvalue,thenumberoftreesissetto15inordertoreducethe

Fig. 7. Variation of MSE to choose the hyperparameters in BRT.

0ate Nitrates Conductivi• Total Slmazlne Atrazlne

tyat 2SGC metolachlor 140

?

1991-06-24 44,00 NaN NaN NaN NaN 120 1992-06-23 45,00 NaN NaN NaN NaN 100 1993·10-19 43,40 NaN NaN 0.0 0.13

"'

₈

I 60 0

1994.08-23 43.00 NaN NaN NaN NaN 0

40 0 1996·09·16 39.75 NaN NaN N,N NaN #Missing Values 0.675

• _•

n trees= 30 0.650

_•

_{n trees=}₅₀ 0.625

...

n trees= 100

•

n trees = 150 0.600

•

LJ.J 0.575 V,

!

::;: 0.550 0.525 0.500

...

•

0-475

• '

0.0 0.1 0.2 0.3 0-4 0.5 V

(9)

(a) Choice of K and the linear combination.

(b) Choice of K and the distance.

Fig. 8. Variation of MSE to choose the hyperparameters in KNN.

(a) Choice of

ε and the kernel function.

(b) Choice of C and

ε.

Fig. 9. Variation of MSE to choose the hyperparameters in SVR.

Table 4

Performance indicator (MSE) of the main RF hyperparameter.

n_trees MSE 10 0.5159 15 0.4943 20 0.4850 50 0.4691 100 0.4653

computationtime. Furthermore,theerrordoesnotdecreasea lot between15and100estimators(see Table4).

BoostedRegression Trees. Similarly to the previous algorithm,the besttrade-off betweencomputingtimeandperformanceissought. Itisnotedthatthenumberoftreesishigher,becauseshallowtrees arebuiltinBRTinsteadoffullygrownonesinRF.

Fig.7representsMSEinfunctionofthelearningrate

ν

,where the labels representthe number of trees. According to these re-sults,theoptimalhyperparametersforthisstudyare

ν

₌0_.01and m=150. For computational time sake, hyperparameters with a slightlyhighermeansquarederror(onlyadifferenceof0.001)are chosen:

ν

=0.1andm=30.

K-Nearest Neighbors. For this algorithm, the hyperparameters to tunearethenumberofneighborsK,thedistancedandthelinear combinationmethodofthe neighborsvalue y∗.Inthisstudy, the euclideandistanceischosen,K=4,andy∗ istheweightedmean oftheKNN.Theirchoiceisillustratedin Fig.8.Indeed,MSEscore islowerforthesevalues.

SupportVectorRegression. Forthisalgorithm,

ε

,C,thekernel func-tionandthe parameter

γ

associated to thekernel functionneed

Table 5

Performance indicator scores.

MICE MICE-SVR MICE-4NN MICE-RF MICE-BRT

Processing time 6.87 5.29 8.25 65.18 32.59

MSE 1.09e24 0.44 0.58 0.55 0.54

tobe tuned. In Fig.9, itisseen that theMSE isgenerallylowest forthepolynomialkernel,andfor

ε

=0.1.The lowestMSEscore isobtainedwith

ε

=0.01,C=1,kernel=polyandtheassociated

γ

=0_.01.

• Step4:Computingtheperformanceindicators

Theresultssummarizedin Table5show that MICE-SVRisthe mostperforming methodregardingboth processingtime (5.29 seconds)andMSE(0.44).

The processing time was signiﬁcantly high while combining MICEwithRFandBRT.Indeed,allthreemethods,MICE,RF,and BRTarealreadycomputationallyexpensivebythemselves.With anumberofestimatorssetto15forRandomForest,anumber ofcyclessetto 10forMICE and251variables toﬁll, MICE-RF computes 15_{× 10}_{× 251}₌37651 fully grown regression trees. Similarly,MICE-BRTcomputes43500shallowregressiontrees. MICEperformedtheworstbecauseintermsofMSEinthe cur-rentimplementationofthealgorithm.Indeed,all thevariables wereusedaspredictorsintheregression,whereasan interme-diate variable selection step would have been appropriate. It alsoproves thattherelationshipbetweenthevariablesarenot linear.

MICE-KNNis a littlelessperforming than theother combina-tionsofMICEwithMachineLearningalgorithms.Thisisdueto the fact that the closest resembling observations are logically those that are closer in time. However, these values are not

• .

i.niform

• .

euclidean

0.57

•

weighted 0.57

•

manhattan

.

•

0.56

• .

0.56

_•

.

• .

•

0.55

_•

0.55

_•

~ 0_54

.

~ Q54

.

•

0.53 0.53 0.52

_•

0.52

.

• _•

• .

0.51

•

0.51

•

2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.( K K

MSE for C=l MSE for kernel= 'poly'

0.9 0.9

•

• .

.

0.8 0.8

*

.

0.7 0.7 w

• •

w

• .

Eps:0.01 ~

.. ..

~

_•

Eps=O.l 0.6

•

0.6

.

0.5

.

rt,f 0.5

•

poly

• •

• ..

sigmoid 0.4 0.4 0.0 0.2 0.4 0.6 0.8 LO 0.0 0.2 0.4 0.6 0.8 LO Eps Eps

(10)

systematicallyﬁlledandtheclosestneighborsareonlysearched amongnon-missingobservationsforagivenvariable.

• Step5:Selectionofthemostperformingmethod.

MICE-SVR performed best in both criteria, it is therefore the bestperformingcompetingmethodinthisparticularcase. The proposed methodology can handle datasets with a high missingness rate, and is also suitable for high-dimensionaldata. It is a flexible method that can take into account complex non-linear relationshipsbetweenvariables (if the competingmethods are non-linear). It makes it possibleto automate theselection of thebest methodtosolvemissingness,whichreducestheamount of workof the dataanalyst, who canfocus on taskswith higher addedvalue,aimingatextractingknowledge.However,afew lim-itations are worth noting, particularly concerning the numberof cyclespresetto10,andtherelativelylownumberofpotential hy-perparameters values (that does not allow a rigorous sensitivity analysisofthesehyperparameters).Furthermore,theseparameters are tuned using an artificial dataset whichhas been constructed bymodifying therealone.Alltheselimitationsaremainlydueto algorithmic complexity, whichconstitutesby itself a challenge as wellasagreatscientificissue.

5. Conclusionandperspectives

It is widely acknowledged that data-driven methods provide powerful algorithms to analyze any issue that is of interest for decision-makers.However, performing such analyses with incom-pletedatamaynotbehelpfultotakereliabledecisions.Inthis pa-per,amethodologyforselectingthebestalgorithmstoaddressthe issue of data imputation, in the context ofwater quality assess-ment,hasbeenproposed. Abenchmarkoffourofthemost pow-erful andcommonlyusedML algorithms hasbeen performedfor that purpose (Random Forest, Booted Regression Trees,K-Nearest Neighbors, Support Vector Regression). The results showed that MICE-SVRisthebestinthatitconvergesfasterthanthethree oth-ers, andprovides the bestperformance (notably interms of pre-diction averageerror).It canthenbe appliedto highmissingness dataset,includingdataforwaterqualityassessmentthatareoften incomplete,asinthecaseofAdour(south-westofFrance) consid-eredinthepresentstudy.

Based on the weaknesses of the proposed method, as men-tionedinthediscussionoftheresults,thefollowingimprovements are planned forfurther studies: (1)deeper automate the mecha-nismofthemodelselection bysettingfuzzyrulesinaninference enginethatwillaggregateall theperformance indicatorsina sin-gle indicator; (2)improve, foreach competingmethod, the opti-mal choiceof thehyperparameters usingevolutionary algorithms inordertospeedupthecomputingtimeandincreasethenumber ofvaluesforeachhyperparameter;(3)automatethechoiceofthe number ofcycles neededforthe convergenceof theimputations bytakingintoaccountthesizeofthedataanditsmissingnessrate; (4) introduce thetemporaldimension withinthe imputation pro-cess.

Conﬂictofinterest

Thereisnoconﬂictofinterest.

Creditauthorshipcontributionstatement

Romy Ratolojanahary: Conceptualization, Methodology,

Soft-ware, Formal analysis, Investigation, Writing - original draft.

RaymondHoué Ngouna:Conceptualization,Methodology,

Investi-gation,Writing -original draft,Writing-review &editing, Super-vision.KamalMedjaher:Conceptualization,Methodology, Investi-gation,Writing -original draft,Writing-review &editing, Super-vision, Funding acquisition. Jean Junca-Bourié: Investigation, Re-sources,Supervision,Fundingacquisition.FabienDauriac: Investi-gation, Resources,Fundingacquisition.Mathieu Sebilo: Investiga-tion,Writing-review&editing,Supervision.

References

Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20 (1), 40–49. doi: 10.1002/mpr.329 .

Betrie, G. D., Sadiq, R., Tesfamariam, S., & Morin, K. A. (2014). On the issue of incomplete and missing water-quality data in mine site databases: Compar- ing three imputation methods. Mine Water and the Environment, 35 (1), 3–9. doi: 10.1007/s10230- 014- 0322- 4 .

Buhi, E. (2008). Out of sight, not out of mind: Strategies for handling missing data. American Journal of Health Behavior, 32 (1). doi: 10.5993/ajhb.32.1.8 .

van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional speciﬁcation. Statistical Methods in Medical Research, 16 (3), 219–242.

doi: 10.1177/0962280206074463 .

van Buuren, S. (2018). Flexible imputation of missing data . Chapman and Hall/CRC .

van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations inr. Journal of Statistical Software, 45 (3). doi: 10.18637/jss.v045. i03 .

Clark, T. G., & Altman, D. G. (2003). Developing a prognostic model in the pres- ence of missing data. Journal of Clinical Epidemiology, 56 (1), 28–37. doi: 10.1016/ s0895- 4356(02)00539- 5 .

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? some practical clariﬁcations of multiple imputation theory. Prevention Science, 8 (3), 206–213. doi: 10.1007/s11121- 007- 0070- 9 .

Honaker, J., King, G., & Blackwell, M. (2011). Ameliaii: A program for missing data. Journal of Statistical Software, 45 (7). doi: 10.18637/jss.v045.i07 .

Hong, T.-P., & Wu, C.-W. (2011). Mining rules from an incomplete dataset with a

high missing rate. Expert Systems with Applications, 38 (4), 3931–3936. doi: 10.

1016/j.eswa.2010.09.054 .

Jordanov, I., Petrov, N., & Petrozziello, A. (2018). Classiﬁers accuracy improvement based on missing data imputation. Journal of Artiﬁcial Intelligence and Soft Com- puting Research, 8 (1). doi: 10.1515/jaiscr- 2018- 0 0 02 .

Little, R. J. A., & Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys . John Wiley & Sons, Inc.. doi: 10.1002/9780470316696 .

Neter, J., Maynes, E. S., & Ramanathan, R. (1965). The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association,

60 (312), 1005–1027. doi: 10.1080/01621459.1965.10480846 .

Raghunathan, T. E. , Lepkowski, J. M. , Hoewyk, J. V. , & Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27 (1), 85–95 .

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63 (3), 581–592. doi: 10.

1093/biomet/63.3.581 .

Rubin, D. B. , & Schafer, J. L. (1990). Eﬃciently creating multiple imputations for incomplete multivariate normal data. In Proceedings of the statistical computing

section of the American statistical association .

Shao, J., Meng, W., & Sun, G. (2016). Evaluation of missing value imputation methods for wireless soil datasets. Personal and Ubiquitous Computing, 21 (1), 113–123. doi: 10.10 07/s0 0779- 016- 0978- 9 .

Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14 (3), 199–222. doi: 10.1023/b:stco.0 0 0 0 035301.4 954 9.88 .

Stekhoven, D. J., & Buhlmann, P. (2011). MissForest–non-parametric missing value

imputation for mixed-type data. Bioinformatics, 28 (1), 112–118. doi: 10.1093/

bioinformatics/btr597 .

Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., et al. (2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ, 338 (jun29 1). doi: 10.1136/bmj.b2393 . b2393– b2393

Tutz, G., & Ramzan, S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics & Data Analysis, 90 , 84– 99. doi: 10.1016/j.csda.2015.04.009 .

White, I. R., Royston, P., & Wood, A. M. (2010). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30 (4), 377– 399. doi: 10.1002/sim.4067 .

Yang, C., Liu, J., Zeng, Y., & Xie, G. (2019). Real-time condition monitoring and fault detection of components based on machine-learning reconstruction model. Re- newable Energy, 133 , 433–441. doi: 10.1016/j.renene.2018.10.062 .

Zhao, L., Chen, Z., Yang, Z., Hu, Y., & Obaidat, M. S. (2018). Local similarity imputation based on fast clustering for incomplete data in cyber-physical systems. IEEE Systems Journal, 12 (2), 1610–1620. doi: 10.1109/jsyst.2016.2576026 .