HAL Id: hal-02134695
https://hal.archives-ouvertes.fr/hal-02134695
Submitted on 20 May 2019
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset
Romy Ratolojanahary, Raymond Houé Ngouna, Kamal Medjaher, Jean Junca-Bourié, Fabien Dauriac, Mathieu Sebilo
To cite this version:
Romy Ratolojanahary, Raymond Houé Ngouna, Kamal Medjaher, Jean Junca-Bourié, Fabien Dau- riac, et al.. Model selection to improve multiple imputation for handling high rate missing- ness in a water quality dataset. Expert Systems with Applications, Elsevier, 2019, pp.299-307.
�10.1016/j.eswa.2019.04.049�. �hal-02134695�
Open Archive Toulouse Archive Ouverte (OATAO)
OATAO is an open access repository that collects the work of some Toulouse researchers and makes it freely available over the web where possible.
This is an author's version published in: https://oatao.univ-toulouse.fr/23807
Official URL : https://doi.org/10.1016/j.eswa.2019.04.049 To cite this version :
Any correspondence concerning this service should be sent to the repository administrator:
tech-oatao@listes-diff.inp-toulouse.fr
Ratolojanahary, Romy and Houé Ngouna, Raymond and Medjaher, Kamal and Junca- Bourié, Jean and Dauriac, Fabien and Sebilo, Mathieu Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. (2019) Expert Systems with Applications (131). 299-307. ISSN 0957-4174
OATAO
Open Archive T oulouse Archive Ouverte
Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset
Romy Ratolojanahary
a,∗, Raymond Houé Ngouna
a, Kamal Medjaher
a, Jean Junca-Bourié
b, Fabien Dauriac
c, Mathieu Sebilo
daLaboratoire Génie de Production, École Nationale d’Ingénieurs de Tarbes, BP1629, 47 avenue d’Azereix, Tarbes Cedex 16 65016, France
bAgence de l’eau Adour-Garonne, Tarbes, 7 Passage de l’Europe, Pau 640 0 0, France
cChambre d’Agriculture des Hautes-Pyrénées, 20 Place du Foirail, Tarbes 650 0 0, France
dIEES, Université Pierre et Marie Curie, 4 Place Jussieu, Paris 75005, France
a rt i c l e i n f o
Keywords:
Multiple imputation High missingness Model selection Machine learning Data preprocessing Water quality
a b s t r a c t
Inthecurrenteraof“informationeverywhere”,extractingknowledgefromagreatamountofdataisin- creasinglyacknowledgedasapromisingchannelforprovidingrelevantinsightstodecisionmakers.One keyissueencounteredmaybethepoorqualityoftherawdata,particularlyduetothehighmissingness, thatmayaffectthequalityandtherelevanceoftheresults’interpretation.Automatingtheexplorationof theunderlyingdatawithpowerfulmethods,allowingtohandlemissingnessandthenperformalearn- ingprocesstodiscoverrelevantknowledge,canthenbeconsideredasasuccessfulstrategyforsystems’
monitoring.Withinthecontextofwaterqualityanalysis,theaimofthepresentstudyistoproposearo- bustmethodforselectingthebestalgorithmtocombinewithMICE(MultivariateImputationsbyChained Equations)inordertohandlemultiplerelationshipsbetweenahighamountoffeaturesofinterest(more than200)concernedwithahighrateofmissingness(morethan80%).Themaincontributionistoim- proveMICE,takingadvantageoftheabilityofMachineLearningalgorithmstoaddresscomplexrelation- shipsamongalargenumberofparameters.ThecompetingmethodsthatareimplementedareRandom Forest(RF),BoostedRegressionTrees(BRT),K-NearestNeighbors(KNN)andSupportVectorRegression (SVR).Theobtainedresults showthatthehybridization ofMICEwithSVR,KNN,RFand BRTperforms betterthanthe original MICEtakenalone.Furthermore, MICE-SVRgives agood trade-off interms of performanceandcomputingtime.
1. Introduction
Theproliferationofsensingdeviceshasincreasedtheabilityof organizationstoacquire variousandgreatamountofdata,allow- ingthemtoimplementreal-timemonitoringoftheirsystems.This is generally based on the analyses of complex relationships be- tweenseveralfactorsofinterest,suchasinwaterqualityanalysis.
Onlinemonitoringhasindeedofferedthedevelopmentofdecision systemsthatareabletoacceleratedecision-makingandanticipate actions toprevent undesiredeventsortoeradicatecriticalissues.
Toachieve suchagoal,itisrequiredtopre-process therawdata, especiallywhensomevaluesaremissingonacertainlevel.
∗ Corresponding author.
E-mail addresses: romy-alinoro.ratolojanahary@enit.fr (R. Ratolojanahary),
Missingdataisarecurringphenomenoninreal-worldapplica- tions(Sterneetal.,2009;Yang,Liu,Zeng,&Xie,2019).Itmayoc- curduetosensorfailures,badornon-existingstrategyfordataac- quisition,budgetissues,lackofresponsefromaparticipantinthe caseofsurvey or various other reasons.If thecomplete data are representativeof the studiedphenomenon, this missinginforma- tionisnegligible,otherwisetheresultsmaybeincorrectandmay leadtowronginterpretations.Forexample,anomaliescouldgoun- detectediftheyhappenduringanon-monitoredperiodoftime.
There are two waysofdealing withmissingdata: deletion or imputation(Buhi, 2008). Deletion means discarding the observa- tionsorthevariableswithmissingdata,whichiscalledcomplete- caseanalysis,whileimputationconsistsinreconstructingthemiss- ingvalues.Becauseofitssimplicity,deletionisusuallythedefault methodusedinpractice.However,therearemanycasesinvarious fields in which thismethod showedsome limitations. Indeed, it decreasesthesamplesizeandmayleadtoalossofsubstantialin- formation.InClarkandAltman(2003)forinstance,thenumberof raymond.houe-ngouna@enit.fr (R.HouéNgouna), kamal.medjaher@enit.fr (K.Med-
jaher), jean.junca-bourie@eau-adour-garonne.fr (J. Junca-Bourié), fdauriac@hautes- pyrenees.chambagri.fr (F. Dauriac), mathieu.sebilo@upmc.fr (M. Sebilo).
observationsdropped from1189to 518(43%ofthe originaldata) inanovariancancerdataset,whichledtobiasedinterpretation.
Another deletion method is pairwise deletion through which onlynon-missingvaluesareusedforanalyses,forinstanceincor- relationsscorescalculationwherethemethodfailswhenthetwo correlatedvariablesarenotfilledatthesametime.Insteadofdis- cardinganobservationoravariableconcernedwithmissingvalue, itispreferabletoestimateaccuratelythosemissingvaluesinorder toproviderelevantinterpretations.
Quoting White and co-authors, “awareness has grown of the need to go beyondcomplete-analysis” and some major improve- mentsofthesimplistic methodshavebeen proposedinthe liter- ature,since Rubin’s innovative proposalfor approaching missing- ness(White,Royston, & Wood, 2010). Among others, Rubin,who is the author of Multiple Imputation (MI), defined a conceptual framework for characterizing missing data that allows to distin- guishvarious types andto determine when missingdata can be ignored (Little & Rubin, 1987; Rubin, 1976). The major insight of theproposed imputation methodis that it addresses uncertainty andcomplexityofthedatastructure,allowingtogobeyonddelet- ingordiscardingdata.
Following Rubin, vanBuuren introduced the Multiple Imputa- tions by Chained Equations (MICE), a MI techniquethat requires fewerassumptions on missingnessandalsohandles relationships between variables (van Buuren & Groothuis-Oudshoorn, 2011).
However,originalMICEconsidersonlylinearrelationshipsandhas beensuccessfullyappliedtodatasetwithatmost70%ofmissing- ness.Itmaythereforefail inother casessuchasinwaterquality dataasconsideredinthepresentstudy,whicharecharacterizedby ahighrateofmissingnessandagreatamountoffactorsofinter- estthatarenotnecessarilylinearlyrelated.Thissuggeststheneed ofanalternativemethodtoimprovetheimputationmechanismin ordertoproviderelevantinterpretationoftheresults,whichisthe purposeofthiswork.
Therestofthepaperisorganizedasfollows:themainimputa- tionmethodsavailableintheliteraturearereviewedinSection 2, followed by the presentation of a method to improve MICE for multipledataimputationin Section3. Anapplicationof thepro- posedmethodonexperimental dataset,alongwithassociatedre- sults,aredescribedinSection4whilethelastsectioncontainsthe conclusionandperspectivesofthepresentwork.
2. Relatedwork
Inordertochooseanappropriatemethodforhandlingmissing data,the underlying causeof the missingness has to be investi- gated. Indeed, as mentionned in Buhi (2008), each method only works under certain assumptions, namely complete randomness, conditionalrandomnessorsystematicreasons.
2.1.Missingnesspatterns
Theconceptualframeworkallowingtotakeintoaccountcertain assumptions, as noted above, has been defined by Rubin (1976). Thereare three typesof missingdata,depending onthe missing mechanism:(1)MissingcompletelyatRandom(MCAR),(2)Miss- ingatRandom(MAR)and(3)MissingNotatRandom(MNAR).
Let R be the locations of the missing data in a dataset X= (Xobs,Xmiss), and
ψ
the parameters of the missing data model;whereXobs andXmiss arerespectively theobserved andthemiss- ing values.MCAR, MAR andMNAR patternsare formally defined asfollows(vanBuuren,2018):
• DataareMCARiftheprobabilityofmissingnessisindependent ofboth the observedvariables andthevariables withmissing
values.Thisisthecase,forexample,whenpeopleforgettoan- sweraquestioninasurvey.Formally,
P
(
R=0|
Xobs,Xmiss,ψ )
=P(
R=0| ψ )
(1)• DataareMARiftheprobabilityofmissingnessisdueentirelyto theobservedvariablesandisindependentoftheunseendata.
Inotherwords,themissingnessisafunctionofsomeotherob- servedvariablesinthedataset(forexample,peopleofonesex arelesslikelytodisclosetheirweight):
P
(
R=0|
Xobs,Xmiss,ψ )
=P(
R=0|
Xobs,ψ )
(2)Therefore,MARdataare agoodcandidatefordataimputation basedonobservedvariables(Buhi,2008).
• DataareMNARifthemissingvalueisrelatedtotheactualval- ues(for example, people who weighmore are most likely to notdisclosetheirweight):
P
(
R=0|
Xobs,Xmiss,ψ )
(3)dependsonallthreeelements.
When data are MNAR, the missingness process is called non- ignorable,meaning thatthe causeofthemissingnessmust bein- cluded in the model,whereas MAR andMCAR data missingness processes are called ignorable. Following the assumptions behind thesethree patterns, severalmethods have beenprovided in the literatureforsolvingappropriatelythemissingness.
2.2. Singleimputationmethods
Methods that compute one single value per missing data are referred as single imputation methods. The most common single imputationmethodsare mean,medianormodeimputation, con- sisting inreplacing the missing value withthe mean, medianor modeoftheassociatedvariable(Buhi,2008).Inthiscase,themiss- ing value is easy to compute, butthe method ignores the corre- lation among the variables andunderestimates the standard de- viation. If thevariable containing missingvalues iscategorical, a simple option is to create a new category for the missing val- ues. Thismethod is suitable for MNAR data,i.e. when the miss- ingness is correlated to the values of the missing data. When a variableoftheincompletedatasetisaperiodictimeseries,amore elaboratedsingleimputation techniqueisto apply a linearinter- polationoranAutoregressive IntegratedMovingAverage(ARIMA) modeltofillinthemissingvalues(Shao,Meng,&Sun,2016). Al- though thosetwo techniquesare simple,thefirst one isnot effi- cient whenthemissinggapislarge,andthesecond onerequires a periodic time series. Anothertechnique involves predictingthe valuesfromtheobservedvariables.Forexample,K-nearestneigh- bors (KNN)replaces the missingvalue witha linearcombination of the K nearest non-missing observations (Jordanov, Petrov, &
Petrozziello,2018;Tutz&Ramzan,2015).Tousethisalgorithm,it isnecessary to choosethe optimalKanddefine a distance mea- surementbetweentwoobservations.Alocalsimilarityimputation based on Fast Clustering was proposed in Zhao, Chen, Yang, Hu, andObaidat(2018).Theauthorspartitiontheincompletedatawith afastclusteringmethod(StackedAutoencoder-based),thenfillthe missingdatawithin each cluster usinga KNNalgorithm. Theob- tained results showed that the proposed method outperformed otherlocalsimilarity-basedmethods.Shaoandco-authorsapplied two Single Layer Feed ForwardNeuralNetworks (ExtremeLearn- ingMachineandRadialBasisFunctionNetwork)onaperiodicsoil moisture time series (Shaoetal., 2016). Thismethodperformed betterpredictions thanalinearinterpolation andARIMA ininfill- ingmissingsegments.However,itrequiresparametertuninginor- dertobeperforming.
Table 1
Advantages and Drawbacks of the reported single imputation methods.
Method Advantages Drawbacks
Mean Easy to implement - Underestimates standard deviation
- Ignores relationships between variables
Add a category Easy to implement Only works with categorical and MNAR data
Linear Interpolation Takes time into account Does not work when the missing gap is large
ARIMA Takes time into account Requires a periodic time series
Linear Regression Takes into account relationships between variables - Underestimates the variance
- Ignores non linear relationships between variables Stochastic linear regression Takes into account relationships between variables Ignores non linear relationships between variables KNN Takes into account relationships between variables Requires parameter tuning
ANN Takes into account the time factor Requires parameter tuning
Fig. 1. Overview of the multiple imputation method.
Abriefsummaryoftheseimplementationsofsingleimputation methodsispresentedinTable1thatprovidesthemaindrawbacks and advantages. Awell-known limitationthat they havein com- mon is that once a missing value is imputed, it is treated as a non-missingvalue.
2.3. Multipleimputationmethods
In order to solve the limitations of single imputation, some authors have proposed to take into account the uncertainty of the imputed values (Little & Rubin, 1987; Neter, Maynes, & Ra- manathan, 1965). In that purpose,Rubin hasdeveloped theMul- tiple Imputation(MI)method,whichcombines severalsingleim- putations(Little&Rubin,1987),asdescribedinthefollowing.
2.3.1. Principlesofmultipleimputation
The principlesofMIare illustratedinFig. 1,basedonthefol- lowing main steps: (1) imputation phase where m datasets are producedby drawingthemfromadistribution,whichcanbe dif- ferent foreach variable (vanBuuren, 2018), (2)analysis phasein which the m datasets are analyzed, and (3) pooling phase that combines the m datasets to produce a final result, for example by calculating the mean of the imputed valuesfor each missing value.Themdatasetscanbegeneratedinparallelusingparametric
statistical theory and assuming a joint model for all the vari- ables (van Buuren, 2007; Rubin & Schafer, 1990), such as in Multiple imputAtions of incoMplEte muLtIvariate dAta (AMELIA), which uses expectation-maximization witha bootstrapping algo- rithm (Honaker, King, & Blackwell, 2011). Such approach lacks flexibility and may lead to bias (van Buuren, 2007). The other alternative is to generate the m datasets until a stop criterion is met: in Hong and Wu (2011) for instance, the authors iter- atively used association rules to successfully estimate the miss- ingvalues.Although thestudied datasethadahighmissingrate, it was relatively small (there were only three variables). Some other examples of the sequential methods are Sequential Impu- tationfor Missing Value (IMPSEQ) (Betrie, Sadiq, Tesfamariam, &
Morin, 2014), a covariance-based imputation method and MICE, a series of linear regressions that consider a different distribu- tionforeachvariable(vanBuuren,2007;Raghunathan,Lepkowski, Hoewyk, & Solenberger, 2001). Betrieand co-authors havefound thatthetwosequentialmethodsoutperformAMELIA(Betrieetal., 2014).InStekhovenandBuhlmann(2011),theauthorsintroduced a MI methodcalled MissForest, whichis similar to MICE, except that it uses Random Forest instead of Linear Regression in the imputationstep.AsMissForestyieldedabetter performancethan MICE,that resultisencouragingtowardstweakingtheMICEalgo- rithm,whichistheobjectofthepresentwork.Abriefsummaryof
,---,
0
missing datae
filled dataV1
obs1
•
obs2
0
obs3
...
•
...V2 ...
0 ...
• ...
0
...... ...
/
---+
multiple imputations
r---
V1 V2
...
obs1 (1)
V12
...
obs2 (1)
V21
•
...obs3
•
V32 (1)...
... ... ... ...
V1 V2 ...
obs1
•
V12 (2) ...obs2 (2)
V21
• ...
obs3
•
V32 (2) ..... . ... ... ...
v(k) imputed data during the ,, k-th imputation process
, ~I _ _
I---
Imputed data
,---i
---+
/
imputations obs1
obs2
obs3
...
V1 V2 ...
•
V12 ...V21
• ...
• ...
V32...
..... .
aggregated value Vij of an imputed data
~---
analyses and
Pooled results
results pooling...._ _ _ _ _ _ _ _ _
I
Fig. 2. Overview of the MICE algorithm.
Table 2
Advantages and Drawbacks of the reported MI methods.
Method Advantages Drawbacks
AMELIA Can be applied to categorical, ordinal or continuous data Assumes a joint model for all the variables MI using decision rules Works well when the missing-value rate is high Not adapted to data with a large number of variables
IMPSEQ Time complexity - Lack of robustness toward outliers
- Does not take into account nonlinear relationships between variables
MICE Flexibility - Does not take into account non-linear relationships between variables
- Theoretical justification needed MissForest - Adapted to high dimensional datasets Computation time issue
- Takes into account linear relationships between variables
theadvantagesanddrawbacksofthemethodspresentedabove is giveninTable 2,whiletheoriginal MICEprinciplesaredescribed inthefollowing.
2.3.2. MainprinciplesofMICE
The mainstepsofMICEaresummarizedinFig.2anddetailed inAlgorithm 1. MICE algorithm implementation was based on a method describedin Azur, Stuart, Frangakis, and Leaf (2011). It assumesthat missing data are of MAR type. The first step is to initialize the missing values to the mean of each column. Then themissingvaluesofthe firstvariableare resetto “missing”.Af- terthat, aregressionmodel isfittedonthe subsetofthe dataset wherethe value of this variable is present. Finally, the obtained model is used to fill in the value and update the dataset. This process is repeated for each variable until all the missing data are estimated. The whole process, first step excluded, is reiter- atedn_cycles times until theestimated dataconverge. Inthe lit- erature,itisadvised toincrease thenumberofcyclesinfunction ofthesize ofthedatasetandthe missingnessratio(Graham, Ol- chowski,& Gilreath, 2007). Although MICE hasbeen proved effi- cient in the literature, the trade-off between computational cost and performance becomes imbalanced when dealing with large datasetsand/ordatasetswithahighmissingnessrate.Indeed,the numberofimputeddatasetshastobe increased,andsodoesthe computationaltime.Furthermore,ahighmissingnessrateimplies highuncertainty. Anotherkey issueisthat thisformofthe algo- rithmisbasedonlinearregression,whichmaynotreflecttheac- tual relationships betweenthe variables of the current study. To addresstheseissues,animprovedversionofMICEisproposedand describedinthefollowing.
3. TheproposedmethodtoimproveMICE
Asnotedabove,thedatasetconcernedwithwaterqualitycon- sideredinthisstudyhasa very highmissingrate(82%).Besides, thereisagreatamountofvariables(morethan200)inwhicheach isconcerned with atleastone missingvalue. The methodsmen- tionedabove,includingthemostperforming,havebeenappliedin a lessconstrained contextandtherefore,can failto providegood results in the specific caseof the dataset considered in this pa- per.ItisthenproposedtotakeadvantageoftheabilityofMachine Learningalgorithms forhandling suchissues inorder toimprove MICE.Thetwomainideasare:(1)defineasetofcompetingmeth- ods,andthen(2)replacetheLinearRegressionintheoriginalMICE by eachofthesemethods inordertoselectthe mostperforming thatfitsthecontextofthepresentstudy.
Thecompetingmethodshavebeenchosenamongthemostper- forming supervised learning algorithms in the literature, namely Random Forest(RF), Boosted Regression Trees (BRT), and Sup- portVectorRegression(SVR).Besides,K-NearestNeighbors(KNN), which iscommonlyused to solve missingness,hasalso beense- lected.
ThemainstepsoftheproposedmethodareillustratedinFig.3. 1. ThefirstphaseoftheoriginalMICEisinitialized(step1).
2. Acompetingmethodisthenchosen,followedbyamechanism foroptimallysettingitshyperparameters(step2).
3. Next, phase (II) of the original MICE ismodified by replacing LinearRegressionwiththechosen method,andthenlaunched inaloopthatgoesanumberoftimescorrespondingtothepre- definednumberofcycles(step3).
v, v ,
•
00
•
•
0• •
0 missing data
• filled data
...
... ...
...
-+
...
v, v,
...•
m2 ...m,
•
...•
m2 ...• • m ,
= ...mean(V;)
'j repeat
r
: until a : :maximum:
: number : : of cycles 1
~ :
-+~: -+
y
v,
-e- •
• •
- • v, -
(i)
-
V21• -
•
v ,
...v , v ,
...m2 ...
•
m2 .. .- • -
- ···- V, ~(V2 , ... ) (i)V21
•
...m2 ... regression linear
•
m2 .. .•
...• •
.. .v ,
...v , v ,
...-e-
- ···-•
v,2 (i) ...•
... V2~(V1 , ... ) (i)V21
•
...-e- - ·-
regression linear•
V32 (i) ...•
...• •
.. ....
I
...I
~--- ---
(I) lnltlallzatlon of missing values {II) Sequential prediction of missing values
Fig. 3. The proposed method for model selection to improve MICE.
Algorithm1 MICE.
Input:
• Xincompletedatamatrixofsizen_obs×n_f eatures
• n_cyclesnumberofcycles Output:
• Completeddatamatrixofsizen_obs−n_f eatures Xf ull:=mean_impute(X)
fori:=1ton_cyclesdo for j:=1ton_f eaturesdo
yj:=Xj/*the j−thcolumn*/
X(j):=X
\
Xjm⊂
{
1,n}
={
i|
Xj!=NaN}
/*m denotes theindices whereXjisnotmissing*/
regressor:=linear_regressor()
regressor.fit(X(mj),ymj)/*themodelisfittedonthesubsetof thedatasetwhereXjisnotmissing*/
y¬(jm) := regressor.predict(X(¬jm)) /* ¬m denotes the indices whereXjismissing*/
endfor endfor return Xf ull
4. After convergence, performance indicators for the current methodarecomputed(step4).
5. Whenallthecompetingmethodshavebeenprocessedaccord- ing to the four previous steps, a selection mechanism takes placebycomparingtheirperformanceindicators(step5).
6. Finally, the best method is applied to solve the missingness (step6).
Duetothehighmissingnessrate,theoptimalchoiceofthehy- perparameters(asconsideredinstep2)isbasedonamodifiedver- sionofthestudieddatasetconstructedaccordingtothefollowing procedure:
• For each variable, a triangular distribution is simulated with differentparameters(min,mode,max).Ifavariablealwayshas thesame value,then that value isreplicated ineach observa- tion.Thetriangulardistributionhasbeenusedbecauseit pro-
vides a simple representation of the real distribution of the dataset andallows moreflexibility by takinginto account the uncertaintyofthevalues.
• The data are scaled so that the units ofthe variables donot playanyrole.
• The observationsareshuffledandthe missingnessdistribution of the real dataset is reproduced in order to mimic the real problemasaccuratelyaspossible.
Twomainperformance indicatorshavebeenusedforthecom- parison(asrealizedinstep5), namelyprocessingtime andMean SquaredError(MSE).
3.1. Theoreticalbackgroundofthecompetingmethods 3.1.1. RandomForest
RandomForestisanensemblemethodbasedonfullygrownre- gressiontrees.Theobjectiveistobuildseveralweaklearners(the regressiontrees)inparallelinordertoproduceastrongregressor.
Themainstepsareasfollows:
1. Theobservationsaresampledwithreplacement(bootstrapag- gregating).
2. Asetofvariablesisselectedrandomly.
3. Thetree isbuiltupon the observations fromstep(1) andthe variablesfromstep(2).
4. Thefinalpredictionismadeby averagingoverthe predictions ofalldecisiontrees.
Inthisalgorithm,oneofthemostrelevanthyperparametersto set in order to make the model perform well is the number of trees.
3.1.2. BoostedRegressionTrees
Similarly to the Random Forest algorithm, BRT is an ensem- ble methodbased on regression trees. Gradient boosting is used totrain theweak learners(shallowregression trees)sequentially.
Inthisalgorithm, ahigher focusis seton observationsthat have highererrorsontheprevious treeanda gradientdescent isused tominimizethelossfunction(leastsquarederrors)ateachstep.
Letyibethetargetvalueandf(xi)itspredictor.
TheobjectivefunctionisgivenasinEq.(4): L
(
y,f)
=n
i=1
l
(
yi,f(
xi))
(4)Oo
Supervised learning methods trials
Set of selected supervised
methods
Choose a method
f)&
optimize hyper.MODIFIED MICE
Raw data preprocessing
I
= ~ - - - ~
II A I
: wsolve the 1
: missingness :
I ~ - -~ - ~ I
I I
•--- ---•
.. ..
Bestleamll'II method selection
0
Computeperformance indicators
wherel(yi,f(xi)):=(yi−f(xi))2. Thealgoritmgoesasfollows:
f0isthetrivialtree,itreturnsthemeanvalueofY. Fork:=1tom:
• Calculate the negative gradient −l(yi,f(xi)), which corre- spondstotheresidualfori=1ton.
• Fitaregressiontreehkfortheresiduals.
• Create a model fk= fk−1+
νγ
khk, whereγ
is the stepmagnitude, found by searchingargminγn
i=1l(yi,(fk−1(xi))+
νγ
hk(xi),andν
is the learning rate.Returnfm.
Forthisalgorithm,thenumberoftreesm,aswellasthelearn- ingrate
ν
,arethehyperparametersthatneedtobesetbytheuser inorderforthemethodtoperformwell.3.1.3. K-NearestNeighbors
LetX andy bethe trainingdata,X∗ anewobservationandy∗ theassociated value to predict.The KNN algorithm goesthrough thefollowingsteps:
1. CalculatethedistancebetweenX∗andeachoftheobservations ofthetrainingset;
2. TaketheyvaluesoftheKclosestobservationsyi1,yi2,...,yik; 3. Assign toy∗ a linearcombinationof thesevalues(usuallythe
mean).
Threehyperparametershavetobedefinedproperlysothatthe algorithmperforms well:thedistance,thenumberofneighborsK andthetypeofaggregationoftheneighborsvalues.
3.1.4. SupportVectorRegression
Let X, y be a trainingdata. The objective ofSVR is to find a function f such that the deviation of f(.) from the real values y is at most
ε
(Smola & Schölkopf, 2004). If the problem has nosolution,slack variables
ξ
i,ξ
i∗ are introduced to tolerate part of theerror.First,let’sconsiderthecasewherefislinear,i.e. f(x)= w.x+b.fisthenthesolution ofthefollowingoptimizationprob- lem(Eq.(5)):Minimize 1
2
||
w||
2+Cni=1
( ξ
i+ξ
i∗)
s.t. yi−<w,xi>−b ≤ε
+ξ
i<w,xi>+b−yi ≤
ε
+ξ
i∗ξ
i,ξ
i∗ ≥ 0 (5) where C>0 is the trade-off between the flatness of f and the amountoftolerateddeviationslargerthanε
,and <, > isascalar product. By using the dual representation of the problembased on Lagrange multipliers, we finally get: f(x)=sumni=1(α
i+α
i∗)<xi,x>+bwhere
α
iaretheLagrangianmultipliers.Iftheadequate fisnotlinear,wecanmapthedataintoahighdimensionalspace wherethefunctionfbecomeslinear(Fig.4).Instead ofsearchingFig. 4. Mapping to the feature space in SVR.
fortheexpressionof
φ
,afunctionkcalledakernelfunction,whichsatisfies k(x,x)=<
φ
(x),φ
(x)> is used. The existence of such a function is proved by the Mercer’s theorem.ε
, C andthe kernelfunctions are the Support Vector Regression (SVR) hyperparame- tersthat needtobe selectedproperlyfortheperformance ofthe algorithm.
4. Applicationandresults
4.1. Thecontextofthestudy
Theincompletedatasetusedinthispaperistakenfromawa- tersampleanalysismadeatOursbelille,intheAdourplain,South- WestofFrance,from1991to2017.Theoperationalprincipleofthis drinkingwatercollectionpointisdescribedinFig.5.First,thewa- terispumped,itsnitraterateismeasuredandisconveyedtolarge aerial tanks in order to be treated by active charcoal. Then, the treatedwaterisstoredinawatertank.Inathirdstep,somesens- ingdevicesarethenusedtomonitorsomequalityindicators,such asthepH.Ina fourthstep,ondemand,thestoredwaterischlo- rinated, before being dragged to another underground well, few kilometersawayfromthepumpingwell.Fromthissecondstorage tank, waterisdistributedto thecitizensoftheAdourregion.The regionbenefits ofan oceanicclimate,witha rainywinterandan averagetemperaturerangingfrom4to19◦ C.
Theacquireddatacontain148 observationsof411waterqual- ity indicators, with an overall missingness of 82%. Fig. 6a is an overviewofthedataset,wheresome ofthemeasuredwaterqual- ityindicatorsaredisplayed,whileFig.6bsummarizesthemissing- nessdistributionpervariableinthedataset.
Onlythevariablesthat aremeasuredatleast5timesarecon- sidered,whichreducesthedatasetto257variables(52%ofthe411 variables).Itisnotedthattheremovedvariablesdonotrestrictthe analysissince they are not amongthe common hyperparameters forwaterqualityassessmentfoundintheliterature.
4.2. Settingsandassumptionsoftheimplementation
Based on the presentation of the three missingness patterns, andthe natureof thestudied dataset(as describedinthe previ- oussubsection),wecanassumethat ourstudyiswithintheMAR pattern.
Moreover,theproposedmethoddependsonseveralfactors:(a) the number of cycles to perform the imputations, (b) the num-
Fig. 5. Operational principle of the drinking water well of Oursbelille.
y• y•
~ r * .
l E* * *
(l)
' * *
* * *
t *
* * *
* * *
<D(x)
}.":'.~· ... ~:-.•~·-.'"'<: .... ~· ·~.,- ., ....
l}: I~\}f.: :~I ~ : '. -~""....1. __
.J. ___L_~*+~~;;;;;;,~
\. -~ : .:· • ~··:._.. '1"-.:, .:·
/(;!?~t/\ ':, ' .
..:. .. .,, ~ · :, • ·.:,- -~·• . .. active charcoa
mping lick nslng nltoring
!
Storage for public distribution
-'• ,·
.. .:~,;. ~N
(a)Overview of the dataset. (b) Number of missing values per variable.
Fig. 6. Description of the dataset.
ber ofvaluesdefinedforeach hyperparameter,(c)thesizeofthe dataset,(d) thenumberofvariables ofinterest,and(e)the com- plexity of the ML algorithm itself. Forthese reasons, inorder to obtainrelevantresultsinareasonablerunningtime,andbyoppo- sition towhatis commonlyusedinliterature, onlyone value for thenumberofcycles(i.e.10cycles)isconsideredinthiswork.
Theimplementationoftheproposedmethodwasperformedby usingPythonprogramminglanguage,onacomputer withthefol- lowingmainfeatures:
• OperatingSystem:Windows10;
• RAM:Intel(R)Core(TM)i5-7200UCPU@2.50GHz2.70GHz;
• Processor:8.00Go.
The corresponding results are described and discussed in the following.
4.3. Implementationoftheproposedmethod
The main steps of the proposed method have been imple- mentedaccordingtothefollowingexplanations.
• Step1.ThefirstphaseoftheoriginalMICE,thatismeanimpu- tation,islaunched(initializationstep).
• Step 2.The next step concerns the hyperparametertuning of the Machine Learning algorithms. There is no analytical solu- tionthatallowstofindtheoptimalvalues.Therefore,todoso, across-validationisperformedusingthemodifieddataset,and a meansquared error(MSE) is measured. The optimalhyper- parametersarethereforethosethathavethelowestMSE.Note thatonlyalimitednumberofcandidatevalueshavebeentaken intoaccount becauseaddingmorewoulddrasticallyaffectthe algorithmiccomplexity.
The competing methods are MICE, MICE combined with RF (MICE-RF),MICE combinedwithBRT (MICE-BRT),KNN (MICE- KNN)andMICEcombinedwithSVR(MICE-SVR).
Thecandidate valuesforthehyperparametersofthe fourMa- chine Learningalgorithms (KNN,RF, BRT, SVR)are detailedin Table 3.For KNN, let usnotice that since the studieddataset containsvariableswithonlyfivenonmissingvalues,thenum- berofneighborsisatmost4.
• Step3.Phase(II)oftheoriginalMICEismodifiedbyreplacing Linear Regressionwithone ofthe competingalgorithms, each withitsoptimalhyperparameters(asobtainedinstep2).
• Step4.Theperformanceindicators,namelyMSEandprocessing time,arecomputedforeachalgorithm.
• Step5.ThemethodthatperformedbestintermsofMSE,and withareasonablecomputingtime,isthenselected.
• Step6.Finally,thewinningmethodisused tosolvethemiss- ingness.
Table 3
Candidate values of the hyperparameters for each Machine Learning method.
Algorithm Hyperparameter Candidate values
RF n_trees {10, 15, 20, 50, 100}
BRT m {30, 50, 100, 150}
ν {0.01, 0.1, 0.5}
KNN K {2, 3, 4}
d {euclidean, manhattan}
y ∗ {uniform, weighted}
SVR ε {0.01, 0.1}
C {0.01, 0.1, 1, 10, 100}
kernel {rbf, poly, sigmoid}
γ {1e-3, 0.01, 0.1, 1}
Themainresultsofthisimplementationarepresentedanddis- cussedinthenextsubsection.
4.4.Resultsanddiscussion
Inthefollowing,onlysteps2,4and5,whichcontainthemain resultsoftheimplementation,arepresented.
• Step2:Hyperparametertuning.
Random Forest. In thisalgorithm,the performance increasespro- portionally to the number of trees. However, it becomes rapidly time consuming. The objective is to find the smallest value for whichtheperformanceisgoodenough.Althoughitisnottheop- timalvalue,thenumberoftreesissetto15inordertoreducethe
Fig. 7. Variation of MSE to choose the hyperparameters in BRT.
0ate Nitrates Conductivi• Total Slmazlne Atrazlne
tyat 2SGC metolachlor 140
?
1991-06-24 44,00 NaN NaN NaN NaN
120
1992-06-23 45,00 NaN NaN NaN NaN 100
1993·10-19 43,40 NaN NaN 0.0 0.13
"' 8
I
60 0
1994.08-23 43.00 NaN NaN NaN NaN 0
40 0
1996·09·16 39.75 NaN NaN N,N NaN
#Missing Values
0.675
• •
n trees= 300.650
•
n trees= 500.625
...
n trees= 100•
n trees = 1500.600
LJ.J 0.575
•
V,
!
::;:
0.550 0.525 0.500
...
0-475
•
• '
0.0 0.1 0.2 0.3 0-4 0.5
V
(a) Choice of K and the linear combination. (b) Choice of K and the distance.
Fig. 8. Variation of MSE to choose the hyperparameters in KNN.
(a) Choice ofεand the kernel function. (b) Choice of C andε.
Fig. 9. Variation of MSE to choose the hyperparameters in SVR.
Table 4
Performance indicator (MSE) of the main RF hyperparameter.
n_trees MSE
10 0.5159
15 0.4943
20 0.4850
50 0.4691
100 0.4653
computationtime. Furthermore,theerrordoesnotdecreasea lot between15and100estimators(seeTable4).
BoostedRegression Trees. Similarly to the previous algorithm,the besttrade-off betweencomputingtimeandperformanceissought.
Itisnotedthatthenumberoftreesishigher,becauseshallowtrees arebuiltinBRTinsteadoffullygrownonesinRF.
Fig.7representsMSEinfunctionofthelearningrate
ν
,wherethe labels representthe number of trees. According to these re- sults,theoptimalhyperparametersforthisstudyare
ν
=0.01and m=150. For computational time sake, hyperparameters with a slightlyhighermeansquarederror(onlyadifferenceof0.001)are chosen:ν
=0.1andm=30.K-Nearest Neighbors. For this algorithm, the hyperparameters to tunearethenumberofneighborsK,thedistancedandthelinear combinationmethodofthe neighborsvalue y∗.Inthisstudy, the euclideandistanceischosen,K=4,andy∗ istheweightedmean oftheKNN.TheirchoiceisillustratedinFig.8.Indeed,MSEscore islowerforthesevalues.
SupportVectorRegression. Forthisalgorithm,
ε
,C,thekernelfunc-tionandthe parameter
γ
associated to thekernel functionneedTable 5
Performance indicator scores.
MICE MICE-SVR MICE-4NN MICE-RF MICE-BRT
Processing time 6.87 5.29 8.25 65.18 32.59
MSE 1.09e24 0.44 0.58 0.55 0.54
tobe tuned. InFig.9, itisseen that theMSE isgenerallylowest forthepolynomialkernel,andfor
ε
=0.1.The lowestMSEscore isobtainedwithε
=0.01,C=1,kernel=polyandtheassociatedγ
=0.01.• Step4:Computingtheperformanceindicators
TheresultssummarizedinTable5show that MICE-SVRisthe mostperforming methodregardingboth processingtime (5.29 seconds)andMSE(0.44).
The processing time was significantly high while combining MICEwithRFandBRT.Indeed,allthreemethods,MICE,RF,and BRTarealreadycomputationallyexpensivebythemselves.With anumberofestimatorssetto15forRandomForest,anumber ofcyclessetto 10forMICE and251variables tofill, MICE-RF computes 15×10×251=37651 fully grown regression trees.
Similarly,MICE-BRTcomputes43500shallowregressiontrees.
MICEperformedtheworstbecauseintermsofMSEinthecur- rentimplementationofthealgorithm.Indeed,all thevariables wereusedaspredictorsintheregression,whereasaninterme- diate variable selection step would have been appropriate. It alsoproves thattherelationshipbetweenthevariablesarenot linear.
MICE-KNNis a littlelessperforming than theother combina- tionsofMICEwithMachineLearningalgorithms.Thisisdueto the fact that the closest resembling observations are logically those that are closer in time. However, these values are not
• .
i.niform• .
euclidean0.57
•
weighted 0.57•
manhattan. •
0.56
• .
0.56• .
• . • •
0.55
•
0.55•
~ 0_54
. . .
~ Q54. • •
0.53 0.53
0.52
• •
0.52.
• • .
0.51
•
0.51•
• • • •
2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.(
K K
MSE for C=l MSE for kernel= 'poly'
0.9 0.9
• • . .
0.8 0.8
* .
0.7 0.7
w
• •
w• .
Eps:0.01~ 0.6
.. ..
~• •
Eps=O.l•
0.6.
0.5
.
rt,f 0.5• • • ..
poly sigmoid• • •
0.4 0.4
0.0 0.2 0.4 0.6 0.8 LO 0.0 0.2 0.4 0.6 0.8 LO
Eps Eps
systematicallyfilledandtheclosestneighborsareonlysearched amongnon-missingobservationsforagivenvariable.
• Step5:Selectionofthemostperformingmethod.
MICE-SVR performed best in both criteria, it is therefore the bestperformingcompetingmethodinthisparticularcase.
The proposed methodology can handle datasets with a high missingness rate, and is also suitable for high-dimensionaldata.
It is a flexible method that can take into account complex non- linear relationshipsbetweenvariables (if the competingmethods are non-linear). It makes it possibleto automate theselection of thebest methodtosolvemissingness,whichreducestheamount of workof the dataanalyst, who canfocus on taskswith higher addedvalue,aimingatextractingknowledge.However,afewlim- itations are worth noting, particularly concerning the numberof cyclespresetto10,andtherelativelylownumberofpotentialhy- perparameters values (that does not allow a rigorous sensitivity analysisofthesehyperparameters).Furthermore,theseparameters are tuned using an artificial dataset whichhas been constructed bymodifying therealone.Alltheselimitationsaremainlydueto algorithmic complexity, whichconstitutesby itself a challenge as wellasagreatscientificissue.
5. Conclusionandperspectives
It is widely acknowledged that data-driven methods provide powerful algorithms to analyze any issue that is of interest for decision-makers.However, performing such analyses withincom- pletedatamaynotbehelpfultotakereliabledecisions.Inthispa- per,amethodologyforselectingthebestalgorithmstoaddressthe issue of data imputation, in the context ofwater quality assess- ment,hasbeenproposed. Abenchmarkoffourofthemostpow- erful andcommonlyusedML algorithms hasbeen performedfor that purpose (Random Forest, Booted Regression Trees,K-Nearest Neighbors, Support Vector Regression). The results showed that MICE-SVRisthebestinthatitconvergesfasterthanthethreeoth- ers, andprovides the bestperformance (notably interms ofpre- diction averageerror).It canthenbe appliedto highmissingness dataset,includingdataforwaterqualityassessmentthatareoften incomplete,asinthecaseofAdour(south-westofFrance)consid- eredinthepresentstudy.
Based on the weaknesses of the proposed method, as men- tionedinthediscussionoftheresults,thefollowingimprovements are planned forfurther studies: (1)deeper automate the mecha- nismofthemodelselection bysettingfuzzyrulesinaninference enginethatwillaggregateall theperformance indicatorsinasin- gle indicator; (2)improve, foreach competingmethod, theopti- mal choiceof thehyperparameters usingevolutionary algorithms inordertospeedupthecomputingtimeandincreasethenumber ofvaluesforeachhyperparameter;(3)automatethechoiceofthe number ofcycles neededforthe convergenceof theimputations bytakingintoaccountthesizeofthedataanditsmissingnessrate;
(4) introduce thetemporaldimension withinthe imputationpro- cess.
Conflictofinterest
Thereisnoconflictofinterest.
Creditauthorshipcontributionstatement
Romy Ratolojanahary: Conceptualization, Methodology, Soft- ware, Formal analysis, Investigation, Writing - original draft.
RaymondHoué Ngouna:Conceptualization,Methodology,Investi- gation,Writing -original draft,Writing-review &editing,Super- vision.KamalMedjaher:Conceptualization,Methodology, Investi- gation,Writing -original draft,Writing-review &editing,Super- vision, Funding acquisition. Jean Junca-Bourié: Investigation, Re- sources,Supervision,Fundingacquisition.FabienDauriac:Investi- gation, Resources,Fundingacquisition.Mathieu Sebilo:Investiga- tion,Writing-review&editing,Supervision.
References
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20 (1), 40–49. doi: 10.1002/mpr.329 .
Betrie, G. D., Sadiq, R., Tesfamariam, S., & Morin, K. A. (2014). On the issue of incomplete and missing water-quality data in mine site databases: Compar- ing three imputation methods. Mine Water and the Environment, 35 (1), 3–9.
doi: 10.1007/s10230- 014- 0322- 4 .
Buhi, E. (2008). Out of sight, not out of mind: Strategies for handling missing data.
American Journal of Health Behavior, 32 (1). doi: 10.5993/ajhb.32.1.8 .
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16 (3), 219–242.
doi: 10.1177/0962280206074463 .
van Buuren, S. (2018). Flexible imputation of missing data . Chapman and Hall/CRC . van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by
chained equations inr. Journal of Statistical Software, 45 (3). doi: 10.18637/jss.v045.
i03 .
Clark, T. G., & Altman, D. G. (2003). Developing a prognostic model in the pres- ence of missing data. Journal of Clinical Epidemiology, 56 (1), 28–37. doi: 10.1016/
s0895- 4356(02)00539- 5 .
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? some practical clarifications of multiple imputation theory.
Prevention Science, 8 (3), 206–213. doi: 10.1007/s11121- 007- 0070- 9 .
Honaker, J., King, G., & Blackwell, M. (2011). Ameliaii: A program for missing data.
Journal of Statistical Software, 45 (7). doi: 10.18637/jss.v045.i07 .
Hong, T.-P., & Wu, C.-W. (2011). Mining rules from an incomplete dataset with a high missing rate. Expert Systems with Applications, 38 (4), 3931–3936. doi: 10.
1016/j.eswa.2010.09.054 .
Jordanov, I., Petrov, N., & Petrozziello, A. (2018). Classifiers accuracy improvement based on missing data imputation. Journal of Artificial Intelligence and Soft Com- puting Research, 8 (1). doi: 10.1515/jaiscr- 2018- 0 0 02 .
Little, R. J. A., & Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys . John Wiley & Sons, Inc.. doi: 10.1002/9780470316696 .
Neter, J., Maynes, E. S., & Ramanathan, R. (1965). The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association, 60 (312), 1005–1027. doi: 10.1080/01621459.1965.10480846 .
Raghunathan, T. E. , Lepkowski, J. M. , Hoewyk, J. V. , & Solenberger, P. (2001). A mul- tivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27 (1), 85–95 .
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63 (3), 581–592. doi: 10.
1093/biomet/63.3.581 .
Rubin, D. B. , & Schafer, J. L. (1990). Efficiently creating multiple imputations for in- complete multivariate normal data. In Proceedings of the statistical computing section of the American statistical association .
Shao, J., Meng, W., & Sun, G. (2016). Evaluation of missing value imputation meth- ods for wireless soil datasets. Personal and Ubiquitous Computing, 21 (1), 113–123.
doi: 10.10 07/s0 0779- 016- 0978- 9 .
Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14 (3), 199–222. doi: 10.1023/b:stco.0 0 0 0 035301.4 954 9.88 . Stekhoven, D. J., & Buhlmann, P. (2011). MissForest–non-parametric missing value
imputation for mixed-type data. Bioinformatics, 28 (1), 112–118. doi: 10.1093/
bioinformatics/btr597 .
Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., et al.
(2009). Multiple imputation for missing data in epidemiological and clinical re- search: Potential and pitfalls. BMJ, 338 (jun29 1). doi: 10.1136/bmj.b2393 . b2393–
b2393
Tutz, G., & Ramzan, S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics & Data Analysis, 90 , 84–
99. doi: 10.1016/j.csda.2015.04.009 .
White, I. R., Royston, P., & Wood, A. M. (2010). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30 (4), 377–
399. doi: 10.1002/sim.4067 .
Yang, C., Liu, J., Zeng, Y., & Xie, G. (2019). Real-time condition monitoring and fault detection of components based on machine-learning reconstruction model. Re- newable Energy, 133 , 433–441. doi: 10.1016/j.renene.2018.10.062 .
Zhao, L., Chen, Z., Yang, Z., Hu, Y., & Obaidat, M. S. (2018). Local similarity imputa- tion based on fast clustering for incomplete data in cyber-physical systems. IEEE Systems Journal, 12 (2), 1610–1620. doi: 10.1109/jsyst.2016.2576026 .