O
pen
A
rchive
T
oulouse
A
rchive
O
uverte
(OATAO)
OATAO is an open access repository that collects the work of some Toulouse
researchers and makes it freely available over the web where possible.
This is
an author's
version published in:
https://oatao.univ-toulouse.fr/23807
Official URL :
https://doi.org/10.1016/j.eswa.2019.04.049
To cite this version :
Any correspondence concerning this service should be sent to the repository administrator:
tech-oatao@listes-diff.inp-toulouse.fr
Ratolojanahary, Romy and Houé Ngouna, Raymond and Medjaher, Kamal and
Junca-Bourié, Jean and Dauriac, Fabien and Sebilo, Mathieu Model selection to improve multiple
imputation for handling high rate missingness in a water quality dataset. (2019) Expert
Systems with Applications (131). 299-307. ISSN 0957-4174
OATAO
Model
selection
to
improve
multiple
imputation
for
handling
high
rate
missingness
in
a
water
quality
dataset
Romy
Ratolojanahary
a,∗,
Raymond
Houé Ngouna
a,
Kamal
Medjaher
a,
Jean
Junca-Bourié
b,
Fabien
Dauriac
c,
Mathieu
Sebilo
da Laboratoire Génie de Production, École Nationale d’Ingénieurs de Tarbes, BP1629, 47 avenue d’Azereix, Tarbes Cedex 16 65016, France
b Agence de l’eau Adour-Garonne, Tarbes, 7 Passage de l’Europe, Pau 640 0 0, France c Chambre d’Agriculture des Hautes-Pyrénées, 20 Place du Foirail, Tarbes 650 0 0, France d IEES, Université Pierre et Marie Curie, 4 Place Jussieu, Paris 75005, France
a
r
t
i
c
l
e
i
n
f
o
Keywords: Multiple imputation High missingness Model selection Machine learning Data preprocessing Water qualitya
b
s
t
r
a
c
t
Inthecurrenteraof“informationeverywhere”,extractingknowledgefromagreatamountofdatais in-creasinglyacknowledgedasapromisingchannelforprovidingrelevantinsightstodecisionmakers.One keyissueencounteredmaybethepoorqualityoftherawdata,particularlyduetothehighmissingness, thatmayaffectthequalityandtherelevanceoftheresults’interpretation.Automatingtheexplorationof theunderlyingdatawithpowerfulmethods,allowingtohandlemissingnessandthenperforma learn-ingprocesstodiscoverrelevantknowledge,canthenbeconsideredasasuccessfulstrategyforsystems’ monitoring.Withinthecontextofwaterqualityanalysis,theaimofthepresentstudyistoproposea ro-bustmethodforselectingthebestalgorithmtocombinewithMICE(MultivariateImputationsbyChained Equations)inordertohandlemultiplerelationshipsbetweenahighamountoffeaturesofinterest(more than200)concernedwithahighrateofmissingness(morethan80%).Themaincontributionisto im-proveMICE,takingadvantageoftheabilityofMachineLearningalgorithmstoaddresscomplex relation-shipsamongalargenumberofparameters.ThecompetingmethodsthatareimplementedareRandom Forest(RF),BoostedRegressionTrees(BRT),K-NearestNeighbors(KNN)andSupportVectorRegression (SVR).Theobtainedresults showthatthehybridization ofMICEwithSVR,KNN,RFand BRTperforms betterthanthe original MICEtakenalone.Furthermore, MICE-SVRgives agood trade-off interms of performanceandcomputingtime.
1. Introduction
Theproliferationofsensingdeviceshasincreasedtheabilityof organizationstoacquire variousandgreatamountofdata, allow-ingthemtoimplementreal-timemonitoringoftheirsystems.This is generally based on the analyses of complex relationships be-tweenseveralfactorsofinterest,suchasinwaterqualityanalysis. Onlinemonitoringhasindeedofferedthedevelopmentofdecision systemsthatareabletoacceleratedecision-makingandanticipate actions toprevent undesiredeventsortoeradicatecriticalissues. Toachieve suchagoal,itisrequiredtopre-process therawdata, especiallywhensomevaluesaremissingonacertainlevel.
∗ Corresponding author.
E-mail addresses: romy-alinoro.ratolojanahary@enit.fr (R. Ratolojanahary),
Missingdataisarecurringphenomenoninreal-world applica-tions(Sterneetal.,2009;Yang,Liu,Zeng,&Xie,2019).Itmay oc-curduetosensorfailures,badornon-existingstrategyfordata ac-quisition,budgetissues,lackofresponsefromaparticipantinthe caseofsurvey or various other reasons.If thecomplete data are representativeof the studiedphenomenon, this missing informa-tionisnegligible,otherwisetheresultsmaybeincorrectandmay leadtowronginterpretations.Forexample,anomaliescouldgo un-detectediftheyhappenduringanon-monitoredperiodoftime.
There are two waysofdealing withmissingdata: deletion or imputation(Buhi, 2008). Deletion means discarding the observa-tionsorthevariableswithmissingdata,whichiscalled complete-caseanalysis,whileimputationconsistsinreconstructingthe miss-ingvalues.Becauseofitssimplicity,deletionisusuallythedefault methodusedinpractice.However,therearemanycasesinvarious fields in which thismethod showedsome limitations. Indeed, it decreasesthesamplesizeandmayleadtoalossofsubstantial in-formation.In ClarkandAltman(2003)forinstance,thenumberof raymond.houe-ngouna@enit.fr (R. Houé Ngouna), kamal.medjaher@enit.fr (K. Med-
jaher), jean.junca-bourie@eau-adour-garonne.fr (J. Junca-Bourié), fdauriac@hautes-
observationsdropped from1189to 518(43%ofthe originaldata) inanovariancancerdataset,whichledtobiasedinterpretation.
Another deletion method is pairwise deletion through which onlynon-missingvaluesareusedforanalyses,forinstancein cor-relationsscorescalculationwherethemethodfailswhenthetwo correlatedvariablesarenotfilledatthesametime.Insteadof dis-cardinganobservationoravariableconcernedwithmissingvalue, itispreferabletoestimateaccuratelythosemissingvaluesinorder toproviderelevantinterpretations.
Quoting White and co-authors, “awareness has grown of the need to go beyondcomplete-analysis” and some major improve-mentsofthesimplistic methodshavebeen proposedinthe liter-ature,since Rubin’s innovative proposalfor approaching
missing-ness(White,Royston, & Wood, 2010). Among others, Rubin,who
is the author of Multiple Imputation (MI), defined a conceptual framework for characterizing missing data that allows to distin-guishvarious types andto determine when missingdata can be ignored (Little & Rubin, 1987; Rubin, 1976). The major insight of theproposed imputation methodis that it addresses uncertainty andcomplexityofthedatastructure,allowingtogobeyond delet-ingordiscardingdata.
Following Rubin, vanBuuren introduced the Multiple Imputa-tions by Chained Equations (MICE), a MI techniquethat requires fewerassumptions on missingnessandalsohandles relationships between variables (van Buuren & Groothuis-Oudshoorn, 2011). However,originalMICEconsidersonlylinearrelationshipsandhas beensuccessfullyappliedtodatasetwithatmost70%of missing-ness.Itmaythereforefail inother casessuchasinwaterquality dataasconsideredinthepresentstudy,whicharecharacterizedby ahighrateofmissingnessandagreatamountoffactorsof inter-estthatarenotnecessarilylinearlyrelated.Thissuggeststheneed ofanalternativemethodtoimprovetheimputationmechanismin ordertoproviderelevantinterpretationoftheresults,whichisthe purposeofthiswork.
Therestofthepaperisorganizedasfollows:themain imputa-tionmethodsavailableintheliteraturearereviewedin Section 2, followed by the presentation of a method to improve MICE for multipledataimputationin Section3. Anapplicationof the pro-posedmethodonexperimental dataset,alongwithassociated re-sults,aredescribedin Section4whilethelastsectioncontainsthe conclusionandperspectivesofthepresentwork.
2. Relatedwork
Inordertochooseanappropriatemethodforhandlingmissing data,the underlying causeof the missingness has to be investi-gated. Indeed, as mentionned in Buhi (2008), each method only works under certain assumptions, namely complete randomness, conditionalrandomnessorsystematicreasons.
2.1.Missingnesspatterns
Theconceptualframeworkallowingtotakeintoaccountcertain assumptions, as noted above, has been defined by Rubin (1976). Thereare three typesof missingdata,depending onthe missing mechanism:(1)MissingcompletelyatRandom(MCAR),(2) Miss-ingatRandom(MAR)and(3)MissingNotatRandom(MNAR).
Let R be the locations of the missing data in a dataset X=
(
Xobs,Xmiss)
, andψ
the parameters of the missing data model;whereXobs andXmiss arerespectively theobserved andthe miss-ing values.MCAR, MAR andMNAR patternsare formally defined asfollows(vanBuuren,2018):
• DataareMCARiftheprobabilityofmissingnessisindependent ofboth the observedvariables andthevariables withmissing
values.Thisisthecase,forexample,whenpeopleforgetto an-sweraquestioninasurvey.Formally,
P
(
R=0|
Xobs,Xmiss,ψ
)
=P(
R=0|
ψ
)
(1) • DataareMARiftheprobabilityofmissingnessisdueentirelyto theobservedvariablesandisindependentoftheunseendata. Inotherwords,themissingnessisafunctionofsomeother ob-servedvariablesinthedataset(forexample,peopleofonesex arelesslikelytodisclosetheirweight):P
(
R=0|
Xobs,Xmiss,ψ
)
=P(
R=0|
Xobs,ψ
)
(2)Therefore,MARdataare agoodcandidatefordataimputation basedonobservedvariables(Buhi,2008).
• DataareMNARifthemissingvalueisrelatedtotheactual val-ues(for example, people who weighmore are most likely to notdisclosetheirweight):
P
(
R=0|
Xobs,Xmiss,ψ
)
(3)dependsonallthreeelements.
When data are MNAR, the missingness process is called non-ignorable,meaning thatthe causeofthemissingnessmust be in-cluded in the model,whereas MAR andMCAR data missingness processes are called ignorable. Following the assumptions behind thesethree patterns, severalmethods have beenprovided in the literatureforsolvingappropriatelythemissingness.
2.2. Singleimputationmethods
Methods that compute one single value per missing data are referred as single imputation methods. The most common single imputationmethodsare mean,medianormodeimputation, con-sisting inreplacing the missing value withthe mean, medianor modeoftheassociatedvariable(Buhi,2008).Inthiscase,the miss-ing value is easy to compute, butthe method ignores the corre-lation among the variables andunderestimates the standard de-viation. If thevariable containing missingvalues iscategorical, a simple option is to create a new category for the missing val-ues. Thismethod is suitable for MNAR data,i.e. when the miss-ingness is correlated to the values of the missing data. When a variableoftheincompletedatasetisaperiodictimeseries,amore elaboratedsingleimputation techniqueisto apply a linear inter-polationoranAutoregressive IntegratedMovingAverage(ARIMA) modeltofillinthemissingvalues(Shao,Meng,&Sun,2016). Al-though thosetwo techniquesare simple,thefirst one isnot effi-cient whenthemissinggapislarge,andthesecond onerequires a periodic time series. Anothertechnique involves predictingthe valuesfromtheobservedvariables.Forexample,K-nearest neigh-bors (KNN)replaces the missingvalue witha linearcombination of the K nearest non-missing observations (Jordanov, Petrov, &
Petrozziello,2018;Tutz&Ramzan,2015).Tousethisalgorithm,it
isnecessary to choosethe optimalKanddefine a distance mea-surementbetweentwoobservations.Alocalsimilarityimputation based on Fast Clustering was proposed in Zhao, Chen, Yang, Hu,
andObaidat(2018).Theauthorspartitiontheincompletedatawith
afastclusteringmethod(StackedAutoencoder-based),thenfillthe missingdatawithin each cluster usinga KNNalgorithm. The ob-tained results showed that the proposed method outperformed otherlocalsimilarity-basedmethods.Shaoandco-authorsapplied two Single Layer Feed ForwardNeuralNetworks (Extreme Learn-ingMachineandRadialBasisFunctionNetwork)onaperiodicsoil moisture time series (Shaoetal., 2016). Thismethodperformed betterpredictions thanalinearinterpolation andARIMA in infill-ingmissingsegments.However,itrequiresparametertuningin or-dertobeperforming.
Table 1
Advantages and Drawbacks of the reported single imputation methods.
Method Advantages Drawbacks
Mean Easy to implement - Underestimates standard deviation
- Ignores relationships between variables
Add a category Easy to implement Only works with categorical and MNAR data
Linear Interpolation Takes time into account Does not work when the missing gap is large
ARIMA Takes time into account Requires a periodic time series
Linear Regression Takes into account relationships between variables - Underestimates the variance
- Ignores non linear relationships between variables
Stochastic linear regression Takes into account relationships between variables Ignores non linear relationships between variables
KNN Takes into account relationships between variables Requires parameter tuning
ANN Takes into account the time factor Requires parameter tuning
Fig. 1. Overview of the multiple imputation method.
Abriefsummaryoftheseimplementationsofsingleimputation methodsispresentedin Table1thatprovidesthemaindrawbacks and advantages. Awell-known limitationthat they havein com-mon is that once a missing value is imputed, it is treated as a non-missingvalue.
2.3. Multipleimputationmethods
In order to solve the limitations of single imputation, some authors have proposed to take into account the uncertainty of the imputed values (Little & Rubin, 1987; Neter, Maynes, &
Ra-manathan, 1965). In that purpose,Rubin hasdeveloped the
Mul-tiple Imputation(MI)method,whichcombines severalsingle im-putations(Little&Rubin,1987),asdescribedinthefollowing. 2.3.1. Principlesofmultipleimputation
The principlesofMIare illustratedin Fig. 1,basedonthe fol-lowing main steps: (1) imputation phase where m datasets are producedby drawingthemfromadistribution,whichcanbe dif-ferent foreach variable (vanBuuren, 2018), (2)analysis phasein which the m datasets are analyzed, and (3) pooling phase that combines the m datasets to produce a final result, for example by calculating the mean of the imputed valuesfor each missing value.Themdatasetscanbegeneratedinparallelusingparametric
statistical theory and assuming a joint model for all the vari-ables (van Buuren, 2007; Rubin & Schafer, 1990), such as in Multiple imputAtions of incoMplEte muLtIvariate dAta (AMELIA), which uses expectation-maximization witha bootstrapping algo-rithm (Honaker, King, & Blackwell, 2011). Such approach lacks flexibility and may lead to bias (van Buuren, 2007). The other alternative is to generate the m datasets until a stop criterion
is met: in Hong and Wu (2011) for instance, the authors
iter-atively used association rules to successfully estimate the miss-ingvalues.Although thestudied datasethadahighmissingrate, it was relatively small (there were only three variables). Some other examples of the sequential methods are Sequential Impu-tationfor Missing Value (IMPSEQ) (Betrie, Sadiq, Tesfamariam, &
Morin, 2014), a covariance-based imputation method and MICE,
a series of linear regressions that consider a different distribu-tionforeachvariable(vanBuuren,2007;Raghunathan,Lepkowski,
Hoewyk, & Solenberger, 2001). Betrieand co-authors havefound
thatthetwosequentialmethodsoutperformAMELIA(Betrieetal.,
2014).In StekhovenandBuhlmann(2011),theauthorsintroduced
a MI methodcalled MissForest, whichis similar to MICE, except that it uses Random Forest instead of Linear Regression in the imputationstep.AsMissForestyieldedabetter performancethan MICE,that resultisencouragingtowardstweakingtheMICE algo-rithm,whichistheobjectofthepresentwork.Abriefsummaryof
,---
,
0
missing datae
filled data V1 obs1•
obs20
obs3•
... ... V2 ...0
...
•
...
0
... ... .../
---+
multiple imputationsr---V1 V2
...
obs1 V12 (1)...
obs2 V21 (1)•
... obs3•
(1) V32...
... ... ... ... V1 V2 ... obs1•
V12 (2) ... obs2 (2) V21•
...
obs3•
(2) V32 .....
.
...
...
...
v(k) imputed data during the
,, k-th imputation process
,
~I _ _
I---Imputed data
,---
i
---+
/
imputations obs1 obs2 obs3...
V1 V2 ...•
V12 ... V21•
...
•
V32 ......
...
..
.
aggregated value
Vij of an imputed data
~---
-analyses and
Pooled results
results pooling...._ _ _ _ _ _ _ _ _Fig. 2. Overview of the MICE algorithm.
Table 2
Advantages and Drawbacks of the reported MI methods.
Method Advantages Drawbacks
AMELIA Can be applied to categorical, ordinal or continuous data Assumes a joint model for all the variables
MI using decision rules Works well when the missing-value rate is high Not adapted to data with a large number of variables
IMPSEQ Time complexity - Lack of robustness toward outliers
- Does not take into account nonlinear relationships between variables
MICE Flexibility - Does not take into account non-linear relationships between variables
- Theoretical justification needed
MissForest - Adapted to high dimensional datasets Computation time issue
- Takes into account linear relationships between variables
theadvantagesanddrawbacksofthemethodspresentedabove is givenin Table 2,whiletheoriginal MICEprinciplesaredescribed inthefollowing.
2.3.2. MainprinciplesofMICE
The mainstepsofMICEaresummarizedin Fig.2anddetailed
in Algorithm 1. MICE algorithm implementation was based on a
method describedin Azur, Stuart, Frangakis, and Leaf (2011). It assumesthat missing data are of MAR type. The first step is to initialize the missing values to the mean of each column. Then themissingvaluesofthe firstvariableare resetto “missing”. Af-terthat, aregressionmodel isfittedonthe subsetofthe dataset wherethe value of this variable is present. Finally, the obtained model is used to fill in the value and update the dataset. This process is repeated for each variable until all the missing data are estimated. The whole process, first step excluded, is reiter-atedn_cycles times until theestimated dataconverge. Inthe lit-erature,itisadvised toincrease thenumberofcyclesinfunction ofthesize ofthedatasetandthe missingnessratio(Graham,
Ol-chowski,& Gilreath, 2007). Although MICE hasbeen proved
effi-cient in the literature, the trade-off between computational cost and performance becomes imbalanced when dealing with large datasetsand/ordatasetswithahighmissingnessrate.Indeed,the numberofimputeddatasetshastobe increased,andsodoesthe computationaltime.Furthermore,ahighmissingnessrateimplies highuncertainty. Anotherkey issueisthat thisformofthe algo-rithmisbasedonlinearregression,whichmaynotreflectthe ac-tual relationships betweenthe variables of the current study. To addresstheseissues,animprovedversionofMICEisproposedand describedinthefollowing.
3. TheproposedmethodtoimproveMICE
Asnotedabove,thedatasetconcernedwithwaterquality con-sideredinthisstudyhasa very highmissingrate(82%).Besides, thereisagreatamountofvariables(morethan200)inwhicheach isconcerned with atleastone missingvalue. The methods men-tionedabove,includingthemostperforming,havebeenappliedin a lessconstrained contextandtherefore,can failto providegood results in the specific caseof the dataset considered in this pa-per.ItisthenproposedtotakeadvantageoftheabilityofMachine Learningalgorithms forhandling suchissues inorder toimprove MICE.Thetwomainideasare:(1)defineasetofcompeting meth-ods,andthen(2)replacetheLinearRegressionintheoriginalMICE by eachofthesemethods inordertoselectthe mostperforming thatfitsthecontextofthepresentstudy.
Thecompetingmethodshavebeenchosenamongthemost per-forming supervised learning algorithms in the literature, namely Random Forest(RF), Boosted Regression Trees (BRT), and Sup-portVectorRegression(SVR).Besides,K-NearestNeighbors(KNN), which iscommonlyused to solve missingness,hasalso been se-lected.
Themainstepsoftheproposedmethodareillustratedin Fig.3. 1. ThefirstphaseoftheoriginalMICEisinitialized(step1). 2. Acompetingmethodisthenchosen,followedbyamechanism
foroptimallysettingitshyperparameters(step2).
3. Next, phase (II) of the original MICE ismodified by replacing LinearRegressionwiththechosen method,andthenlaunched inaloopthatgoesanumberoftimescorrespondingtothe pre-definednumberofcycles(step3).
v
,
v
,
•
0 0•
•
0•
•
0 missing data • filled data .....
.
...
-+
...
..
.
v
,
v
,
..
.
•
m2.
..
m,•
...
•
m2...
• •
...m
,
=
mean(V;) 'j repeatr
: until a : :maximum: : number : : of cycles 1~
:
-+~
:
-+
y
v
,
•
-
e
-•
•
v
,
-
•
-(i) V21-
•
-•
v
,
...v
,
v
,
... m2 ...•
m2 ...
-
•
-
- ···- V, ~(V2 , ... ) V21 (i)•
... m2 ... linear•
m2 ...
regression•
...•
•
.. .v
,
...v
,
v
,
...-e
-
-
···
-
•
(i) v,2 ...•
... V2~(V1 , ... ) V21 (i)•
...-e
-
-
·
-
linear•
(i) ... regression V32•
...• •
.. . ...I
...I
~---
Fig. 3. The proposed method for model selection to improve MICE.
Algorithm1 MICE.
Input:
• Xincompletedatamatrixofsizen_obs× n_f eatures
• n_cyclesnumberofcycles Output:
• Completeddatamatrixofsizen_obs− n_f eatures Xf ull:=mean_impute(X)
fori:=1ton_cyclesdo for j:=1ton_f eaturesdo
yj:=Xj/*the j − thcolumn*/
X(j):=X
\
Xjm⊂
{
1,n}
={
i|
Xj!=NaN}
/* m denotes theindices whereX jisnotmissing*/
regressor:=linear_regressor() regressor.fit(Xm
(j),y m
j) /*themodelisfittedonthesubsetof
thedatasetwhere X jisnotmissing*/
y¬ m
(j) := regressor.predict(X(¬ jm)) /* ¬m denotes the indices
where X jismissing*/
endfor endfor return Xf ull
4. After convergence, performance indicators for the current methodarecomputed(step4).
5. Whenallthecompetingmethodshavebeenprocessed accord-ing to the four previous steps, a selection mechanism takes placebycomparingtheirperformanceindicators(step5). 6. Finally, the best method is applied to solve the missingness
(step6).
Duetothehighmissingnessrate,theoptimalchoiceofthe hy-perparameters(asconsideredinstep2)isbasedonamodified ver-sionofthestudieddatasetconstructedaccordingtothefollowing procedure:
• For each variable, a triangular distribution is simulated with differentparameters(min,mode,max).Ifavariablealwayshas thesame value,then that value isreplicated ineach observa-tion.Thetriangulardistributionhasbeenusedbecauseit
pro-vides a simple representation of the real distribution of the dataset andallows moreflexibility by takinginto account the uncertaintyofthevalues.
• The data are scaled so that the units ofthe variables donot playanyrole.
• The observationsareshuffledandthe missingnessdistribution of the real dataset is reproduced in order to mimic the real problemasaccuratelyaspossible.
Twomainperformance indicatorshavebeenusedforthe com-parison(asrealizedinstep5), namelyprocessingtime andMean SquaredError(MSE).
3.1. Theoreticalbackgroundofthecompetingmethods 3.1.1. RandomForest
RandomForestisanensemblemethodbasedonfullygrown re-gressiontrees.Theobjectiveistobuildseveralweaklearners(the regressiontrees)inparallelinordertoproduceastrongregressor. Themainstepsareasfollows:
1. Theobservationsaresampledwithreplacement(bootstrap ag-gregating).
2. Asetofvariablesisselectedrandomly.
3. Thetree isbuiltupon the observations fromstep(1) andthe variablesfromstep(2).
4. Thefinalpredictionismadeby averagingoverthe predictions ofalldecisiontrees.
Inthisalgorithm,oneofthemostrelevanthyperparametersto set in order to make the model perform well is the number of trees.
3.1.2. BoostedRegressionTrees
Similarly to the Random Forest algorithm, BRT is an ensem-ble methodbased on regression trees. Gradient boosting is used totrain theweak learners(shallowregression trees)sequentially. Inthisalgorithm, ahigher focusis seton observationsthat have highererrorsontheprevious treeanda gradientdescent isused tominimizethelossfunction(leastsquarederrors)ateachstep.
Letyibethetargetvalueandf(xi)itspredictor.
Theobjectivefunctionisgivenasin Eq.(4):
L
(
y,f)
= n i=1 l(
yi,f(
xi))
(4) OoSupervised learning
methods trials
Set of selected supervised
methods
Choose a method
f)&
optimize hyper.MODIFIED MICE
Rawdata
preprocessing I=
~ - - - ~
I I A I : wsolve the 1 : missingness : I ~ - -~ - ~ I I I•--- ---•
.. ..
Bestleamll'II
method selection
0
Compute performance indicatorswherel
(
yi,f(
xi))
:=(
yi− f(
xi))
2 .Thealgoritmgoesasfollows:
f0 isthetrivialtree,itreturnsthemeanvalueofY. Fork:=1tom:
• Calculate the negative gradient −
l
(
yi,f(
xi))
, whichcorre-spondstotheresidualfori=1ton.
• Fitaregressiontreehkfortheresiduals.
• Create a model fk= fk−1 +
νγ
khk, whereγ
is the step magnitude, found by searchingargminγni=1 l(
yi,(
fk−1(
xi))
+νγ
hk(
xi)
,andν
is the learning rate.Returnfm.
Forthisalgorithm,thenumberoftreesm,aswellasthe learn-ingrate
ν
,arethehyperparametersthatneedtobesetbytheuser inorderforthemethodtoperformwell.3.1.3. K-NearestNeighbors
LetX andy bethe trainingdata,X∗ anewobservationandy∗ theassociated value to predict.The KNN algorithm goesthrough thefollowingsteps:
1. CalculatethedistancebetweenX∗andeachoftheobservations ofthetrainingset;
2. TaketheyvaluesoftheKclosestobservationsyi1 ,yi2 ,...,yik;
3. Assign toy∗ a linearcombinationof thesevalues(usuallythe mean).
Threehyperparametershavetobedefinedproperlysothatthe algorithmperforms well:thedistance,thenumberofneighborsK andthetypeofaggregationoftheneighborsvalues.
3.1.4. SupportVectorRegression
Let X, y be a trainingdata. The objective ofSVR is to find a function f such that the deviation of f(.) from the real values y
is at most
ε
(Smola & Schölkopf, 2004). If the problem has nosolution,slack variables
ξ
i,ξ
i∗ are introduced to tolerate part of theerror.First,let’sconsiderthecasewherefislinear,i.e. f(
x)
= w.x+b.fisthenthesolution ofthefollowingoptimization prob-lem(Eq.(5)): Minimize 1 2||
w||
2 +C n i=1(
ξ
i+ξ
i∗)
s.t. yi− <w,xi>−b ≤ε
+ξ
i <w,xi>+b− yi ≤ε
+ξ
i∗ξ
i,ξ
i∗ ≥ 0 (5)where C>0 is the trade-off between the flatness of f and the amountoftolerateddeviationslargerthan
ε
,and <, > isascalar product. By using the dual representation of the problembased on Lagrange multipliers, we finally get: f(
x)
=sumni=1
(
α
i+α
i∗)
<xi,x>+b where
α
iaretheLagrangianmultipliers.Iftheadequate fisnotlinear,wecanmapthedataintoahighdimensionalspace wherethefunctionfbecomeslinear(Fig.4).Instead ofsearchingFig. 4. Mapping to the feature space in SVR.
fortheexpressionof
φ
,afunctionkcalledakernelfunction,which satisfies k(
x,x)
=<φ
(
x)
,φ
(
x)
> is used. The existence of such a function is proved by the Mercer’s theorem.ε
, C andthe kernel functions are the Support Vector Regression (SVR) hyperparame-tersthat needtobe selectedproperlyfortheperformance ofthe algorithm.4. Applicationandresults 4.1. Thecontextofthestudy
Theincompletedatasetusedinthispaperistakenfroma wa-tersampleanalysismadeatOursbelille,intheAdourplain, South-WestofFrance,from1991to2017.Theoperationalprincipleofthis drinkingwatercollectionpointisdescribedin Fig.5.First,the wa-terispumped,itsnitraterateismeasuredandisconveyedtolarge aerial tanks in order to be treated by active charcoal. Then, the treatedwaterisstoredinawatertank.Inathirdstep,some sens-ingdevicesarethenusedtomonitorsomequalityindicators,such asthepH.Ina fourthstep,ondemand,thestoredwateris chlo-rinated, before being dragged to another underground well, few kilometersawayfromthepumpingwell.Fromthissecondstorage tank, waterisdistributedto thecitizensoftheAdourregion.The regionbenefits ofan oceanicclimate,witha rainywinterandan averagetemperaturerangingfrom4to19◦ C.
Theacquireddatacontain148 observationsof411water qual-ity indicators, with an overall missingness of 82%. Fig. 6a is an overviewofthedataset,wheresome ofthemeasuredwater qual-ityindicatorsaredisplayed,while Fig.6bsummarizesthe missing-nessdistributionpervariableinthedataset.
Onlythevariablesthat aremeasuredatleast5timesare con-sidered,whichreducesthedatasetto257variables(52%ofthe411 variables).Itisnotedthattheremovedvariablesdonotrestrictthe analysissince they are not amongthe common hyperparameters forwaterqualityassessmentfoundintheliterature.
4.2. Settingsandassumptionsoftheimplementation
Based on the presentation of the three missingness patterns, andthe natureof thestudied dataset(as describedinthe previ-oussubsection),wecanassumethat ourstudyiswithintheMAR pattern.
Moreover,theproposedmethoddependsonseveralfactors:(a) the number of cycles to perform the imputations, (b) the
num-Fig. 5. Operational principle of the drinking water well of Oursbelille.
y• y•
~
r
.
E*
l* * *
(l)'
* *
* *
*
t
*
* * *
* *
*
<D(x) }.":'.~· ... ~:-.•~·-.'"'<: .... ~· ·~.,- ., ....l}
:
I~\}f.:
:~I
~
:
'.
-~""....1. __
.J. ___L_
~
*+~~;;
;;;;,~
\. -~ : .:· • ~··:._.. '1"-.:, .:·/(;!?~t/\
':,
'
.
..:. .. .,, ~ · :, • ·.:,- -~·• . .. active charcoa mping lick nslng nltoring ! Storage for public distribution -'• ,· .. .:~,;. ~N(a)
Overview of the dataset.
(b) Number of missing values per variable.
Fig. 6. Description of the dataset.
ber ofvaluesdefinedforeach hyperparameter,(c)thesizeofthe dataset,(d) thenumberofvariables ofinterest,and(e)the com-plexity of the ML algorithm itself. Forthese reasons, inorder to obtainrelevantresultsinareasonablerunningtime,andby oppo-sition towhatis commonlyusedinliterature, onlyone value for thenumberofcycles(i.e.10cycles)isconsideredinthiswork.
Theimplementationoftheproposedmethodwasperformedby usingPythonprogramminglanguage,onacomputer withthe fol-lowingmainfeatures:
• OperatingSystem:Windows10;
• RAM:Intel(R)Core(TM)i5-7200UCPU@2.50GHz2.70GHz;
• Processor:8.00Go.
The corresponding results are described and discussed in the following.
4.3. Implementationoftheproposedmethod
The main steps of the proposed method have been imple-mentedaccordingtothefollowingexplanations.
• Step1.ThefirstphaseoftheoriginalMICE,thatismean impu-tation,islaunched(initializationstep).
• Step 2.The next step concerns the hyperparametertuning of the Machine Learning algorithms. There is no analytical solu-tionthatallowstofindtheoptimalvalues.Therefore,todoso, across-validationisperformedusingthemodifieddataset,and a meansquared error(MSE) is measured. The optimal hyper-parametersarethereforethosethathavethelowestMSE.Note thatonlyalimitednumberofcandidatevalueshavebeentaken intoaccount becauseaddingmorewoulddrasticallyaffectthe algorithmiccomplexity.
The competing methods are MICE, MICE combined with RF (MICE-RF),MICE combinedwithBRT (MICE-BRT),KNN (MICE-KNN)andMICEcombinedwithSVR(MICE-SVR).
Thecandidate valuesforthehyperparametersofthe four Ma-chine Learningalgorithms (KNN,RF, BRT, SVR)are detailedin
Table 3.For KNN, let usnotice that since the studieddataset
containsvariableswithonlyfivenonmissingvalues,the num-berofneighborsisatmost4.
• Step3.Phase(II)oftheoriginalMICEismodifiedbyreplacing Linear Regressionwithone ofthe competingalgorithms, each withitsoptimalhyperparameters(asobtainedinstep2).
• Step4.Theperformanceindicators,namelyMSEandprocessing time,arecomputedforeachalgorithm.
• Step5.ThemethodthatperformedbestintermsofMSE,and withareasonablecomputingtime,isthenselected.
• Step6.Finally,thewinningmethodisused tosolvethe miss-ingness.
Table 3
Candidate values of the hyperparameters for each Machine Learning method.
Algorithm Hyperparameter Candidate values
RF n_trees {10, 15, 20, 50, 100} BRT m {30, 50, 100, 150} ν {0.01, 0.1, 0.5} KNN K {2, 3, 4} d {euclidean, manhattan} y∗ {uniform, weighted} SVR ε {0.01, 0.1} C {0.01, 0.1, 1, 10, 100}
kernel {rbf, poly, sigmoid}
γ {1e-3, 0.01, 0.1, 1}
Themainresultsofthisimplementationarepresentedand dis-cussedinthenextsubsection.
4.4.Resultsanddiscussion
Inthefollowing,onlysteps2,4and5,whichcontainthemain resultsoftheimplementation,arepresented.
• Step2:Hyperparametertuning.
Random Forest. In thisalgorithm,the performance increases pro-portionally to the number of trees. However, it becomes rapidly time consuming. The objective is to find the smallest value for whichtheperformanceisgoodenough.Althoughitisnotthe op-timalvalue,thenumberoftreesissetto15inordertoreducethe
Fig. 7. Variation of MSE to choose the hyperparameters in BRT.
0ate Nitrates Conductivi• Total Slmazlne Atrazlne
tyat 2SGC metolachlor 140
?
1991-06-24 44,00 NaN NaN NaN NaN 120 1992-06-23 45,00 NaN NaN NaN NaN 100 1993·10-19 43,40 NaN NaN 0.0 0.13"'
8
I 60 01994.08-23 43.00 NaN NaN NaN NaN 0
40 0 1996·09·16 39.75 NaN NaN N,N NaN #Missing Values 0.675
•
•
n trees= 30 0.650•
n trees= 50 0.625...
n trees= 100•
n trees = 150 0.600•
LJ.J 0.575 V,!
::;: 0.550 0.525 0.500...
•
0-475•
'
0.0 0.1 0.2 0.3 0-4 0.5 V(a) Choice of K and the linear combination.
(b) Choice of K and the distance.
Fig. 8. Variation of MSE to choose the hyperparameters in KNN.
(a) Choice of
ε and the kernel function.
(b) Choice of C and
ε.
Fig. 9. Variation of MSE to choose the hyperparameters in SVR.
Table 4
Performance indicator (MSE) of the main RF hyperparameter.
n_trees MSE 10 0.5159 15 0.4943 20 0.4850 50 0.4691 100 0.4653
computationtime. Furthermore,theerrordoesnotdecreasea lot between15and100estimators(see Table4).
BoostedRegression Trees. Similarly to the previous algorithm,the besttrade-off betweencomputingtimeandperformanceissought. Itisnotedthatthenumberoftreesishigher,becauseshallowtrees arebuiltinBRTinsteadoffullygrownonesinRF.
Fig.7representsMSEinfunctionofthelearningrate
ν
,where the labels representthe number of trees. According to these re-sults,theoptimalhyperparametersforthisstudyareν
=0.01and m=150. For computational time sake, hyperparameters with a slightlyhighermeansquarederror(onlyadifferenceof0.001)are chosen:ν
=0.1andm=30.K-Nearest Neighbors. For this algorithm, the hyperparameters to tunearethenumberofneighborsK,thedistancedandthelinear combinationmethodofthe neighborsvalue y∗.Inthisstudy, the euclideandistanceischosen,K=4,andy∗ istheweightedmean oftheKNN.Theirchoiceisillustratedin Fig.8.Indeed,MSEscore islowerforthesevalues.
SupportVectorRegression. Forthisalgorithm,
ε
,C,thekernel func-tionandthe parameterγ
associated to thekernel functionneedTable 5
Performance indicator scores.
MICE MICE-SVR MICE-4NN MICE-RF MICE-BRT
Processing time 6.87 5.29 8.25 65.18 32.59
MSE 1.09e24 0.44 0.58 0.55 0.54
tobe tuned. In Fig.9, itisseen that theMSE isgenerallylowest forthepolynomialkernel,andfor
ε
=0.1.The lowestMSEscore isobtainedwithε
=0.01,C=1,kernel=polyandtheassociatedγ
=0.01.• Step4:Computingtheperformanceindicators
Theresultssummarizedin Table5show that MICE-SVRisthe mostperforming methodregardingboth processingtime (5.29 seconds)andMSE(0.44).
The processing time was significantly high while combining MICEwithRFandBRT.Indeed,allthreemethods,MICE,RF,and BRTarealreadycomputationallyexpensivebythemselves.With anumberofestimatorssetto15forRandomForest,anumber ofcyclessetto 10forMICE and251variables tofill, MICE-RF computes 15× 10× 251=37651 fully grown regression trees. Similarly,MICE-BRTcomputes43500shallowregressiontrees. MICEperformedtheworstbecauseintermsofMSEinthe cur-rentimplementationofthealgorithm.Indeed,all thevariables wereusedaspredictorsintheregression,whereasan interme-diate variable selection step would have been appropriate. It alsoproves thattherelationshipbetweenthevariablesarenot linear.
MICE-KNNis a littlelessperforming than theother combina-tionsofMICEwithMachineLearningalgorithms.Thisisdueto the fact that the closest resembling observations are logically those that are closer in time. However, these values are not
•
.
i.niform•
.
euclidean0.57
•
weighted 0.57•
manhattan.
•
0.56•
.
0.56•
.
•
.
•
•
0.55•
0.55•
~ 0_54.
.
.
~ Q54.
•
•
0.53 0.53 0.52•
0.52.
•
•
•
.
0.51•
0.51•
•
•
•
•
2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.( K KMSE for C=l MSE for kernel= 'poly'
0.9 0.9
•
•
.
.
0.8 0.8*
.
0.7 0.7 w• •
w•
.
Eps:0.01 ~.. ..
~•
•
Eps=O.l 0.6•
0.6.
0.5.
rt,f 0.5•
•
•
poly• •
•
..
sigmoid 0.4 0.4 0.0 0.2 0.4 0.6 0.8 LO 0.0 0.2 0.4 0.6 0.8 LO Eps Epssystematicallyfilledandtheclosestneighborsareonlysearched amongnon-missingobservationsforagivenvariable.
• Step5:Selectionofthemostperformingmethod.
MICE-SVR performed best in both criteria, it is therefore the bestperformingcompetingmethodinthisparticularcase. The proposed methodology can handle datasets with a high missingness rate, and is also suitable for high-dimensionaldata. It is a flexible method that can take into account complex non-linear relationshipsbetweenvariables (if the competingmethods are non-linear). It makes it possibleto automate theselection of thebest methodtosolvemissingness,whichreducestheamount of workof the dataanalyst, who canfocus on taskswith higher addedvalue,aimingatextractingknowledge.However,afew lim-itations are worth noting, particularly concerning the numberof cyclespresetto10,andtherelativelylownumberofpotential hy-perparameters values (that does not allow a rigorous sensitivity analysisofthesehyperparameters).Furthermore,theseparameters are tuned using an artificial dataset whichhas been constructed bymodifying therealone.Alltheselimitationsaremainlydueto algorithmic complexity, whichconstitutesby itself a challenge as wellasagreatscientificissue.
5. Conclusionandperspectives
It is widely acknowledged that data-driven methods provide powerful algorithms to analyze any issue that is of interest for decision-makers.However, performing such analyses with incom-pletedatamaynotbehelpfultotakereliabledecisions.Inthis pa-per,amethodologyforselectingthebestalgorithmstoaddressthe issue of data imputation, in the context ofwater quality assess-ment,hasbeenproposed. Abenchmarkoffourofthemost pow-erful andcommonlyusedML algorithms hasbeen performedfor that purpose (Random Forest, Booted Regression Trees,K-Nearest Neighbors, Support Vector Regression). The results showed that MICE-SVRisthebestinthatitconvergesfasterthanthethree oth-ers, andprovides the bestperformance (notably interms of pre-diction averageerror).It canthenbe appliedto highmissingness dataset,includingdataforwaterqualityassessmentthatareoften incomplete,asinthecaseofAdour(south-westofFrance) consid-eredinthepresentstudy.
Based on the weaknesses of the proposed method, as men-tionedinthediscussionoftheresults,thefollowingimprovements are planned forfurther studies: (1)deeper automate the mecha-nismofthemodelselection bysettingfuzzyrulesinaninference enginethatwillaggregateall theperformance indicatorsina sin-gle indicator; (2)improve, foreach competingmethod, the opti-mal choiceof thehyperparameters usingevolutionary algorithms inordertospeedupthecomputingtimeandincreasethenumber ofvaluesforeachhyperparameter;(3)automatethechoiceofthe number ofcycles neededforthe convergenceof theimputations bytakingintoaccountthesizeofthedataanditsmissingnessrate; (4) introduce thetemporaldimension withinthe imputation pro-cess.
Conflictofinterest
Thereisnoconflictofinterest.
Creditauthorshipcontributionstatement
Romy Ratolojanahary: Conceptualization, Methodology,
Soft-ware, Formal analysis, Investigation, Writing - original draft.
RaymondHoué Ngouna:Conceptualization,Methodology,
Investi-gation,Writing -original draft,Writing-review &editing, Super-vision.KamalMedjaher:Conceptualization,Methodology, Investi-gation,Writing -original draft,Writing-review &editing, Super-vision, Funding acquisition. Jean Junca-Bourié: Investigation, Re-sources,Supervision,Fundingacquisition.FabienDauriac: Investi-gation, Resources,Fundingacquisition.Mathieu Sebilo: Investiga-tion,Writing-review&editing,Supervision.
References
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20 (1), 40–49. doi: 10.1002/mpr.329 .
Betrie, G. D., Sadiq, R., Tesfamariam, S., & Morin, K. A. (2014). On the issue of incomplete and missing water-quality data in mine site databases: Compar- ing three imputation methods. Mine Water and the Environment, 35 (1), 3–9. doi: 10.1007/s10230- 014- 0322- 4 .
Buhi, E. (2008). Out of sight, not out of mind: Strategies for handling missing data. American Journal of Health Behavior, 32 (1). doi: 10.5993/ajhb.32.1.8 .
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16 (3), 219–242.
doi: 10.1177/0962280206074463 .
van Buuren, S. (2018). Flexible imputation of missing data . Chapman and Hall/CRC .
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations inr. Journal of Statistical Software, 45 (3). doi: 10.18637/jss.v045. i03 .
Clark, T. G., & Altman, D. G. (2003). Developing a prognostic model in the pres- ence of missing data. Journal of Clinical Epidemiology, 56 (1), 28–37. doi: 10.1016/ s0895- 4356(02)00539- 5 .
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? some practical clarifications of multiple imputation theory. Prevention Science, 8 (3), 206–213. doi: 10.1007/s11121- 007- 0070- 9 .
Honaker, J., King, G., & Blackwell, M. (2011). Ameliaii: A program for missing data. Journal of Statistical Software, 45 (7). doi: 10.18637/jss.v045.i07 .
Hong, T.-P., & Wu, C.-W. (2011). Mining rules from an incomplete dataset with a
high missing rate. Expert Systems with Applications, 38 (4), 3931–3936. doi: 10.
1016/j.eswa.2010.09.054 .
Jordanov, I., Petrov, N., & Petrozziello, A. (2018). Classifiers accuracy improvement based on missing data imputation. Journal of Artificial Intelligence and Soft Com- puting Research, 8 (1). doi: 10.1515/jaiscr- 2018- 0 0 02 .
Little, R. J. A., & Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys . John Wiley & Sons, Inc.. doi: 10.1002/9780470316696 .
Neter, J., Maynes, E. S., & Ramanathan, R. (1965). The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association,
60 (312), 1005–1027. doi: 10.1080/01621459.1965.10480846 .
Raghunathan, T. E. , Lepkowski, J. M. , Hoewyk, J. V. , & Solenberger, P. (2001). A mul- tivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27 (1), 85–95 .
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63 (3), 581–592. doi: 10.
1093/biomet/63.3.581 .
Rubin, D. B. , & Schafer, J. L. (1990). Efficiently creating multiple imputations for in- complete multivariate normal data. In Proceedings of the statistical computing
section of the American statistical association .
Shao, J., Meng, W., & Sun, G. (2016). Evaluation of missing value imputation meth- ods for wireless soil datasets. Personal and Ubiquitous Computing, 21 (1), 113–123. doi: 10.10 07/s0 0779- 016- 0978- 9 .
Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14 (3), 199–222. doi: 10.1023/b:stco.0 0 0 0 035301.4 954 9.88 .
Stekhoven, D. J., & Buhlmann, P. (2011). MissForest–non-parametric missing value
imputation for mixed-type data. Bioinformatics, 28 (1), 112–118. doi: 10.1093/
bioinformatics/btr597 .
Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., et al. (2009). Multiple imputation for missing data in epidemiological and clinical re- search: Potential and pitfalls. BMJ, 338 (jun29 1). doi: 10.1136/bmj.b2393 . b2393– b2393
Tutz, G., & Ramzan, S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics & Data Analysis, 90 , 84– 99. doi: 10.1016/j.csda.2015.04.009 .
White, I. R., Royston, P., & Wood, A. M. (2010). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30 (4), 377– 399. doi: 10.1002/sim.4067 .
Yang, C., Liu, J., Zeng, Y., & Xie, G. (2019). Real-time condition monitoring and fault detection of components based on machine-learning reconstruction model. Re- newable Energy, 133 , 433–441. doi: 10.1016/j.renene.2018.10.062 .
Zhao, L., Chen, Z., Yang, Z., Hu, Y., & Obaidat, M. S. (2018). Local similarity imputa- tion based on fast clustering for incomplete data in cyber-physical systems. IEEE Systems Journal, 12 (2), 1610–1620. doi: 10.1109/jsyst.2016.2576026 .