Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset

(1)

HAL Id: hal-02134695

https://hal.archives-ouvertes.fr/hal-02134695

Submitted on 20 May 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset

Romy Ratolojanahary, Raymond Houé Ngouna, Kamal Medjaher, Jean Junca-Bourié, Fabien Dauriac, Mathieu Sebilo

To cite this version:

Romy Ratolojanahary, Raymond Houé Ngouna, Kamal Medjaher, Jean Junca-Bourié, Fabien Dau- riac, et al.. Model selection to improve multiple imputation for handling high rate missing- ness in a water quality dataset. Expert Systems with Applications, Elsevier, 2019, pp.299-307.

�10.1016/j.eswa.2019.04.049�. �hal-02134695�

(2)

Open Archive Toulouse Archive Ouverte (OATAO)

OATAO is an open access repository that collects the work of some Toulouse researchers and makes it freely available over the web where possible.

This is an author's version published in: https://oatao.univ-toulouse.fr/23807

Official URL : https://doi.org/10.1016/j.eswa.2019.04.049 To cite this version :

Any correspondence concerning this service should be sent to the repository administrator:

tech-oatao@listes-diff.inp-toulouse.fr

Ratolojanahary, Romy and Houé Ngouna, Raymond and Medjaher, Kamal and Junca- Bourié, Jean and Dauriac, Fabien and Sebilo, Mathieu Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. (2019) Expert Systems with Applications (131). 299-307. ISSN 0957-4174

OATAO

Open Archive T oulouse Archive Ouverte

(3)

Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset

Romy Ratolojanahary

^a^,^∗

, Raymond Houé Ngouna

^a

, Kamal Medjaher

^a

, Jean Junca-Bourié

^b

, Fabien Dauriac

^c

, Mathieu Sebilo

^d

aLaboratoire Génie de Production, École Nationale d’Ingénieurs de Tarbes, BP1629, 47 avenue d’Azereix, Tarbes Cedex 16 65016, France

bAgence de l’eau Adour-Garonne, Tarbes, 7 Passage de l’Europe, Pau 640 0 0, France

cChambre d’Agriculture des Hautes-Pyrénées, 20 Place du Foirail, Tarbes 650 0 0, France

dIEES, Université Pierre et Marie Curie, 4 Place Jussieu, Paris 75005, France

a rt i c l e i n f o

Keywords:

Multiple imputation High missingness Model selection Machine learning Data preprocessing Water quality

a b s t r a c t

Inthecurrenteraof“informationeverywhere”,extractingknowledgefromagreatamountofdataisin- creasinglyacknowledgedasapromisingchannelforprovidingrelevantinsightstodecisionmakers.One keyissueencounteredmaybethepoorqualityoftherawdata,particularlyduetothehighmissingness, thatmayaffectthequalityandtherelevanceoftheresults’interpretation.Automatingtheexplorationof theunderlyingdatawithpowerfulmethods,allowingtohandlemissingnessandthenperformalearn- ingprocesstodiscoverrelevantknowledge,canthenbeconsideredasasuccessfulstrategyforsystems’

monitoring.Withinthecontextofwaterqualityanalysis,theaimofthepresentstudyistoproposearo- bustmethodforselectingthebestalgorithmtocombinewithMICE(MultivariateImputationsbyChained Equations)inordertohandlemultiplerelationshipsbetweenahighamountoffeaturesofinterest(more than200)concernedwithahighrateofmissingness(morethan80%).Themaincontributionistoim- proveMICE,takingadvantageoftheabilityofMachineLearningalgorithmstoaddresscomplexrelation- shipsamongalargenumberofparameters.ThecompetingmethodsthatareimplementedareRandom Forest(RF),BoostedRegressionTrees(BRT),K-NearestNeighbors(KNN)andSupportVectorRegression (SVR).Theobtainedresults showthatthehybridization ofMICEwithSVR,KNN,RFand BRTperforms betterthanthe original MICEtakenalone.Furthermore, MICE-SVRgives agood trade-off interms of performanceandcomputingtime.

1. Introduction

Theproliferationofsensingdeviceshasincreasedtheabilityof organizationstoacquire variousandgreatamountofdata,allow- ingthemtoimplementreal-timemonitoringoftheirsystems.This is generally based on the analyses of complex relationships be- tweenseveralfactorsofinterest,suchasinwaterqualityanalysis.

Onlinemonitoringhasindeedofferedthedevelopmentofdecision systemsthatareabletoacceleratedecision-makingandanticipate actions toprevent undesiredeventsortoeradicatecriticalissues.

Toachieve suchagoal,itisrequiredtopre-process therawdata, especiallywhensomevaluesaremissingonacertainlevel.

∗ Corresponding author.

E-mail addresses: romy-alinoro.ratolojanahary@enit.fr (R. Ratolojanahary),

Missingdataisarecurringphenomenoninreal-worldapplica- tions(Sterneetal.,2009;Yang,Liu,Zeng,&Xie,2019).Itmayoc- curduetosensorfailures,badornon-existingstrategyfordataac- quisition,budgetissues,lackofresponsefromaparticipantinthe caseofsurvey or various other reasons.If thecomplete data are representativeof the studiedphenomenon, this missinginforma- tionisnegligible,otherwisetheresultsmaybeincorrectandmay leadtowronginterpretations.Forexample,anomaliescouldgoun- detectediftheyhappenduringanon-monitoredperiodoftime.

There are two waysofdealing withmissingdata: deletion or imputation(Buhi, 2008). Deletion means discarding the observa- tionsorthevariableswithmissingdata,whichiscalledcomplete- caseanalysis,whileimputationconsistsinreconstructingthemiss- ingvalues.Becauseofitssimplicity,deletionisusuallythedefault methodusedinpractice.However,therearemanycasesinvarious ﬁelds in which thismethod showedsome limitations. Indeed, it decreasesthesamplesizeandmayleadtoalossofsubstantialin- formation.InClarkandAltman(2003)forinstance,thenumberof raymond.houe-ngouna@enit.fr (R.HouéNgouna), kamal.medjaher@enit.fr (K.Med-

jaher), jean.junca-bourie@eau-adour-garonne.fr (J. Junca-Bourié), fdauriac@hautes- pyrenees.chambagri.fr (F. Dauriac), mathieu.sebilo@upmc.fr (M. Sebilo).

(4)

observationsdropped from1189to 518(43%ofthe originaldata) inanovariancancerdataset,whichledtobiasedinterpretation.

Another deletion method is pairwise deletion through which onlynon-missingvaluesareusedforanalyses,forinstanceincor- relationsscorescalculationwherethemethodfailswhenthetwo correlatedvariablesarenotﬁlledatthesametime.Insteadofdis- cardinganobservationoravariableconcernedwithmissingvalue, itispreferabletoestimateaccuratelythosemissingvaluesinorder toproviderelevantinterpretations.

Quoting White and co-authors, “awareness has grown of the need to go beyondcomplete-analysis” and some major improve- mentsofthesimplistic methodshavebeen proposedinthe literature,since Rubin’s innovative proposalfor approaching missingness(White,Royston, & Wood, 2010). Among others, Rubin,who is the author of Multiple Imputation (MI), deﬁned a conceptual framework for characterizing missing data that allows to distin- guishvarious types andto determine when missingdata can be ignored (Little & Rubin, 1987; Rubin, 1976). The major insight of theproposed imputation methodis that it addresses uncertainty andcomplexityofthedatastructure,allowingtogobeyonddelet- ingordiscardingdata.

Following Rubin, vanBuuren introduced the Multiple Imputa- tions by Chained Equations (MICE), a MI techniquethat requires fewerassumptions on missingnessandalsohandles relationships between variables (van Buuren & Groothuis-Oudshoorn, 2011).

However,originalMICEconsidersonlylinearrelationshipsandhas beensuccessfullyappliedtodatasetwithatmost70%ofmissing- ness.Itmaythereforefail inother casessuchasinwaterquality dataasconsideredinthepresentstudy,whicharecharacterizedby ahighrateofmissingnessandagreatamountoffactorsofinter- estthatarenotnecessarilylinearlyrelated.Thissuggeststheneed ofanalternativemethodtoimprovetheimputationmechanismin ordertoproviderelevantinterpretationoftheresults,whichisthe purposeofthiswork.

Therestofthepaperisorganizedasfollows:themainimputa- tionmethodsavailableintheliteraturearereviewedinSection 2, followed by the presentation of a method to improve MICE for multipledataimputationin Section3. Anapplicationof thepro- posedmethodonexperimental dataset,alongwithassociatedre- sults,aredescribedinSection4whilethelastsectioncontainsthe conclusionandperspectivesofthepresentwork.

2. Relatedwork

Inordertochooseanappropriatemethodforhandlingmissing data,the underlying causeof the missingness has to be investi- gated. Indeed, as mentionned in Buhi (2008), each method only works under certain assumptions, namely complete randomness, conditionalrandomnessorsystematicreasons.

2.1.Missingnesspatterns

Theconceptualframeworkallowingtotakeintoaccountcertain assumptions, as noted above, has been deﬁned by Rubin (1976). Thereare three typesof missingdata,depending onthe missing mechanism:(1)MissingcompletelyatRandom(MCAR),(2)Miss- ingatRandom(MAR)and(3)MissingNotatRandom(MNAR).

Let R be the locations of the missing data in a dataset X= (^Xobs,X_miss), and

ψ

^the ^parameters ^of ^the ^missing ^data ^model;

whereX_obs andX_miss arerespectively theobserved andthemiss- ing values.MCAR, MAR andMNAR patternsare formally deﬁned asfollows(vanBuuren,2018):

• DataareMCARiftheprobabilityofmissingnessisindependent ofboth the observedvariables andthevariables withmissing

values.Thisisthecase,forexample,whenpeopleforgettoan- sweraquestioninasurvey.Formally,

P

(

^R⁼⁰

|

^Xobs,X_miss,

ψ )

⁼^P

(

^R⁼⁰

| ψ )

⁽¹⁾

• DataareMARiftheprobabilityofmissingnessisdueentirelyto theobservedvariablesandisindependentoftheunseendata.

Inotherwords,themissingnessisafunctionofsomeotherob- servedvariablesinthedataset(forexample,peopleofonesex arelesslikelytodisclosetheirweight):

P

(

^R=0

|

^Xobs,Xmiss,

ψ )

=P

(

^R=0

|

^Xobs,

ψ )

⁽²⁾

Therefore,MARdataare agoodcandidatefordataimputation basedonobservedvariables(Buhi,2008).

• DataareMNARifthemissingvalueisrelatedtotheactualval- ues(for example, people who weighmore are most likely to notdisclosetheirweight):

P

(

^R=0

|

^Xobs,X_miss,

ψ )

⁽³⁾

dependsonallthreeelements.

When data are MNAR, the missingness process is called non- ignorable,meaning thatthe causeofthemissingnessmust bein- cluded in the model,whereas MAR andMCAR data missingness processes are called ignorable. Following the assumptions behind thesethree patterns, severalmethods have beenprovided in the literatureforsolvingappropriatelythemissingness.

2.2. Singleimputationmethods

Methods that compute one single value per missing data are referred as single imputation methods. The most common single imputationmethodsare mean,medianormodeimputation, con- sisting inreplacing the missing value withthe mean, medianor modeoftheassociatedvariable(Buhi,2008).Inthiscase,themiss- ing value is easy to compute, butthe method ignores the corre- lation among the variables andunderestimates the standard deviation. If thevariable containing missingvalues iscategorical, a simple option is to create a new category for the missing values. Thismethod is suitable for MNAR data,i.e. when the missingness is correlated to the values of the missing data. When a variableoftheincompletedatasetisaperiodictimeseries,amore elaboratedsingleimputation techniqueisto apply a linearinter- polationoranAutoregressive IntegratedMovingAverage(ARIMA) modeltofillinthemissingvalues(Shao,Meng,&Sun,2016). Al- though thosetwo techniquesare simple,thefirst one isnot effi- cient whenthemissinggapislarge,andthesecond onerequires a periodic time series. Anothertechnique involves predictingthe valuesfromtheobservedvariables.Forexample,K-nearestneighbors (KNN)replaces the missingvalue witha linearcombination of the K nearest non-missing observations (Jordanov, Petrov, &

Petrozziello,2018;Tutz&Ramzan,2015).Tousethisalgorithm,it isnecessary to choosethe optimalKanddefine a distance mea- surementbetweentwoobservations.Alocalsimilarityimputation based on Fast Clustering was proposed in Zhao, Chen, Yang, Hu, andObaidat(2018).Theauthorspartitiontheincompletedatawith afastclusteringmethod(StackedAutoencoder-based),thenfillthe missingdatawithin each cluster usinga KNNalgorithm. Theob- tained results showed that the proposed method outperformed otherlocalsimilarity-basedmethods.Shaoandco-authorsapplied two Single Layer Feed ForwardNeuralNetworks (ExtremeLearn- ingMachineandRadialBasisFunctionNetwork)onaperiodicsoil moisture time series (Shaoetal., 2016). Thismethodperformed betterpredictions thanalinearinterpolation andARIMA ininfill- ingmissingsegments.However,itrequiresparametertuninginor- dertobeperforming.

(5)

Table 1

Advantages and Drawbacks of the reported single imputation methods.

Method Advantages Drawbacks

Mean Easy to implement - Underestimates standard deviation

- Ignores relationships between variables

Add a category Easy to implement Only works with categorical and MNAR data

Linear Interpolation Takes time into account Does not work when the missing gap is large

ARIMA Takes time into account Requires a periodic time series

Linear Regression Takes into account relationships between variables - Underestimates the variance

- Ignores non linear relationships between variables Stochastic linear regression Takes into account relationships between variables Ignores non linear relationships between variables KNN Takes into account relationships between variables Requires parameter tuning

ANN Takes into account the time factor Requires parameter tuning

Fig. 1. Overview of the multiple imputation method.

Abriefsummaryoftheseimplementationsofsingleimputation methodsispresentedinTable1thatprovidesthemaindrawbacks and advantages. Awell-known limitationthat they havein common is that once a missing value is imputed, it is treated as a non-missingvalue.

2.3. Multipleimputationmethods

In order to solve the limitations of single imputation, some authors have proposed to take into account the uncertainty of the imputed values (Little & Rubin, 1987; Neter, Maynes, & Ra- manathan, 1965). In that purpose,Rubin hasdeveloped theMul- tiple Imputation(MI)method,whichcombines severalsingleim- putations(Little&Rubin,1987),asdescribedinthefollowing.

2.3.1. Principlesofmultipleimputation

The principlesofMIare illustratedinFig. 1,basedonthefol- lowing main steps: (1) imputation phase where m datasets are producedby drawingthemfromadistribution,whichcanbe different foreach variable (vanBuuren, 2018), (2)analysis phasein which the m datasets are analyzed, and (3) pooling phase that combines the m datasets to produce a ﬁnal result, for example by calculating the mean of the imputed valuesfor each missing value.Themdatasetscanbegeneratedinparallelusingparametric

statistical theory and assuming a joint model for all the variables (van Buuren, 2007; Rubin & Schafer, 1990), such as in Multiple imputAtions of incoMplEte muLtIvariate dAta (AMELIA), which uses expectation-maximization witha bootstrapping algorithm (Honaker, King, & Blackwell, 2011). Such approach lacks ﬂexibility and may lead to bias (van Buuren, 2007). The other alternative is to generate the m datasets until a stop criterion is met: in Hong and Wu (2011) for instance, the authors iter- atively used association rules to successfully estimate the missingvalues.Although thestudied datasethadahighmissingrate, it was relatively small (there were only three variables). Some other examples of the sequential methods are Sequential Impu- tationfor Missing Value (IMPSEQ) (Betrie, Sadiq, Tesfamariam, &

Morin, 2014), a covariance-based imputation method and MICE, a series of linear regressions that consider a different distribu- tionforeachvariable(vanBuuren,2007;Raghunathan,Lepkowski, Hoewyk, & Solenberger, 2001). Betrieand co-authors havefound thatthetwosequentialmethodsoutperformAMELIA(Betrieetal., 2014).InStekhovenandBuhlmann(2011),theauthorsintroduced a MI methodcalled MissForest, whichis similar to MICE, except that it uses Random Forest instead of Linear Regression in the imputationstep.AsMissForestyieldedabetter performancethan MICE,that resultisencouragingtowardstweakingtheMICEalgo- rithm,whichistheobjectofthepresentwork.Abriefsummaryof

,---,

0

missing data

e

^filled^data

V1

obs1

•

obs2

0

obs3

...

•

...

V2 ...

0 ...

• ^...

0

^...

... ...

/

---+

multiple imputations

r---

V1 V2

...

obs1 (1)

V12

...

obs2 (1)

V21

•

^...

obs3

•

^V32⁽¹⁾

^...

... ... ... ...

V1 V2 ...

obs1

•

^V12⁽²⁾ ^...

obs2 (2)

V21

• ^...

obs3

•

^V32⁽²⁾ ^...

.. . ... ... ...

v(k) imputed data during the ,, k-th imputation process

, ~I _ _

I---

Imputed data

,---i

---+

/

imputations obs1

obs2

obs3

...

V1 V2 ...

•

^V12 ^...

V21

• ^...

• _...

^V32

_...

^...

_.. _.

aggregated value Vij of an imputed data

~---

analyses and

Pooled results

results pooling...._ _ _ _ _ _ _ _ _

I

(6)

Fig. 2. Overview of the MICE algorithm.

Table 2

Advantages and Drawbacks of the reported MI methods.

Method Advantages Drawbacks

AMELIA Can be applied to categorical, ordinal or continuous data Assumes a joint model for all the variables MI using decision rules Works well when the missing-value rate is high Not adapted to data with a large number of variables

IMPSEQ Time complexity - Lack of robustness toward outliers

- Does not take into account nonlinear relationships between variables

MICE Flexibility - Does not take into account non-linear relationships between variables

- Theoretical justiﬁcation needed MissForest - Adapted to high dimensional datasets Computation time issue

- Takes into account linear relationships between variables

theadvantagesanddrawbacksofthemethodspresentedabove is giveninTable 2,whiletheoriginal MICEprinciplesaredescribed inthefollowing.

2.3.2. MainprinciplesofMICE

The mainstepsofMICEaresummarizedinFig.2anddetailed inAlgorithm 1. MICE algorithm implementation was based on a method describedin Azur, Stuart, Frangakis, and Leaf (2011). It assumesthat missing data are of MAR type. The first step is to initialize the missing values to the mean of each column. Then themissingvaluesofthe firstvariableare resetto “missing”.Af- terthat, aregressionmodel isfittedonthe subsetofthe dataset wherethe value of this variable is present. Finally, the obtained model is used to fill in the value and update the dataset. This process is repeated for each variable until all the missing data are estimated. The whole process, first step excluded, is reiter- atedn_cycles times until theestimated dataconverge. Inthe literature,itisadvised toincrease thenumberofcyclesinfunction ofthesize ofthedatasetandthe missingnessratio(Graham, Ol- chowski,& Gilreath, 2007). Although MICE hasbeen proved effi- cient in the literature, the trade-off between computational cost and performance becomes imbalanced when dealing with large datasetsand/ordatasetswithahighmissingnessrate.Indeed,the numberofimputeddatasetshastobe increased,andsodoesthe computationaltime.Furthermore,ahighmissingnessrateimplies highuncertainty. Anotherkey issueisthat thisformofthe algo- rithmisbasedonlinearregression,whichmaynotreflecttheac- tual relationships betweenthe variables of the current study. To addresstheseissues,animprovedversionofMICEisproposedand describedinthefollowing.

3. TheproposedmethodtoimproveMICE

Asnotedabove,thedatasetconcernedwithwaterqualitycon- sideredinthisstudyhasa very highmissingrate(82%).Besides, thereisagreatamountofvariables(morethan200)inwhicheach isconcerned with atleastone missingvalue. The methodsmen- tionedabove,includingthemostperforming,havebeenappliedin a lessconstrained contextandtherefore,can failto providegood results in the specific caseof the dataset considered in this pa- per.ItisthenproposedtotakeadvantageoftheabilityofMachine Learningalgorithms forhandling suchissues inorder toimprove MICE.Thetwomainideasare:(1)defineasetofcompetingmeth- ods,andthen(2)replacetheLinearRegressionintheoriginalMICE by eachofthesemethods inordertoselectthe mostperforming thatfitsthecontextofthepresentstudy.

Thecompetingmethodshavebeenchosenamongthemostper- forming supervised learning algorithms in the literature, namely Random Forest(RF), Boosted Regression Trees (BRT), and Sup- portVectorRegression(SVR).Besides,K-NearestNeighbors(KNN), which iscommonlyused to solve missingness,hasalso beense- lected.

ThemainstepsoftheproposedmethodareillustratedinFig.3. 1. TheﬁrstphaseoftheoriginalMICEisinitialized(step1).

2. Acompetingmethodisthenchosen,followedbyamechanism foroptimallysettingitshyperparameters(step2).

3. Next, phase (II) of the original MICE ismodiﬁed by replacing LinearRegressionwiththechosen method,andthenlaunched inaloopthatgoesanumberoftimescorrespondingtothepre- deﬁnednumberofcycles(step3).

v, v ,

•

⁰

0

•

⁰

• •

0 missing data

• filled data

...

... ...

...

-+

...

v, v,

...

•

^m2 ^.^..

m,

•

^...

•

^m2 ^...

• • _m _,

₌ ^...

mean(V;)

'j repeat

r

: until a : :maximum:

: number : : of cycles 1

~ :

-+~: -+

y

v,

-e- •

• •

- • v, -

(i)

-

V21

• -

• v ,

^...

v , v ,

^...

m2 ...

•

^m2 ^..^.

- • -

^- ^···- ^{V, ~(V}2 , ... ) (i)

V21

•

^...

m2 ... regression ^linear

•

^m² ^..^.

•

^...

• •

^..^.

v ,

...

v , v ,

...

-e-

- ···-

•

^v,2⁽ⁱ⁾ ^...

•

^... ^V²^~(V1 , ... ) (i)

V21

•

^...

-e- - ·-

regression ^linear

•

V32 ⁽ⁱ⁾ ^...

•

^...

• •

^..^.

...

I

^...

I

~--- ---

(I) lnltlallzatlon of missing values {II) Sequential prediction of missing values

(7)

Fig. 3. The proposed method for model selection to improve MICE.

Algorithm1 MICE.

Input:

• Xincompletedatamatrixofsizen_obs×n_f eatures

• n_cyclesnumberofcycles Output:

• Completeddatamatrixofsizen_obs−n_f eatures X_{f ull}:=mean_impute(X)

fori:=1ton_cyclesdo for j:=1ton_f eaturesdo

y_j:=X_j/*the j−thcolumn*/

X₍_j₎:=X

\

^Xj

m⊂

{

¹,n

}

=

{

ⁱ

|

^Xj!=NaN

}

^/*^m ^denotes ^the^indices ^where

X_jisnotmissing*/

regressor:=linear_regressor()

regressor.ﬁt(X₍^m_j₎,y^m_j)/*themodelisﬁttedonthesubsetof thedatasetwhereX_jisnotmissing*/

y^¬₍_j^m₎ := regressor.predict(X₍^¬_j^m₎) /* ¬m denotes the indices whereX_jismissing*/

endfor endfor return X_{f ull}

4. After convergence, performance indicators for the current methodarecomputed(step4).

5. Whenallthecompetingmethodshavebeenprocessedaccord- ing to the four previous steps, a selection mechanism takes placebycomparingtheirperformanceindicators(step5).

6. Finally, the best method is applied to solve the missingness (step6).

Duetothehighmissingnessrate,theoptimalchoiceofthehy- perparameters(asconsideredinstep2)isbasedonamodiﬁedver- sionofthestudieddatasetconstructedaccordingtothefollowing procedure:

• For each variable, a triangular distribution is simulated with differentparameters(min,mode,max).Ifavariablealwayshas thesame value,then that value isreplicated ineach observa- tion.Thetriangulardistributionhasbeenusedbecauseit pro-

vides a simple representation of the real distribution of the dataset andallows moreﬂexibility by takinginto account the uncertaintyofthevalues.

• The data are scaled so that the units ofthe variables donot playanyrole.

• The observationsareshuﬄedandthe missingnessdistribution of the real dataset is reproduced in order to mimic the real problemasaccuratelyaspossible.

Twomainperformance indicatorshavebeenusedforthecom- parison(asrealizedinstep5), namelyprocessingtime andMean SquaredError(MSE).

3.1. Theoreticalbackgroundofthecompetingmethods 3.1.1. RandomForest

RandomForestisanensemblemethodbasedonfullygrownre- gressiontrees.Theobjectiveistobuildseveralweaklearners(the regressiontrees)inparallelinordertoproduceastrongregressor.

Themainstepsareasfollows:

1. Theobservationsaresampledwithreplacement(bootstrapag- gregating).

2. Asetofvariablesisselectedrandomly.

3. Thetree isbuiltupon the observations fromstep(1) andthe variablesfromstep(2).

4. Theﬁnalpredictionismadeby averagingoverthe predictions ofalldecisiontrees.

Inthisalgorithm,oneofthemostrelevanthyperparametersto set in order to make the model perform well is the number of trees.

3.1.2. BoostedRegressionTrees

Similarly to the Random Forest algorithm, BRT is an ensem- ble methodbased on regression trees. Gradient boosting is used totrain theweak learners(shallowregression trees)sequentially.

Inthisalgorithm, ahigher focusis seton observationsthat have highererrorsontheprevious treeanda gradientdescent isused tominimizethelossfunction(leastsquarederrors)ateachstep.

Lety_ibethetargetvalueandf(x_i)itspredictor.

TheobjectivefunctionisgivenasinEq.(4): L

(

^y,f

)

=

n

i=1

l

(

^yi,f

(

^xi

))

⁽⁴⁾

Oo

Supervised learning methods trials

Set of selected supervised

methods

Choose a method

f)&

optimize hyper.

MODIFIED MICE

Raw data preprocessing

I

= ~ - - - ~

^I

I A ^I

: wsolve the ¹

: missingness :

I ~ - -~ - ~ I

I I

•--- ---•

.. ..

Bestleamll'II method selection

0

^Compute

performance indicators

(8)

wherel(^yi,f(^xi))^:=(^yi−f(^xi))². Thealgoritmgoesasfollows:

f₀isthetrivialtree,itreturnsthemeanvalueofY. Fork:=1tom:

• Calculate the negative gradient −^l(^yi,f(^xi)), which corre- spondstotheresidualfori=1ton.

• Fitaregressiontreeh_kfortheresiduals.

• Create a model f_k= f_k₋₁+

νγ

kh_k, where

γ

^is ^the ^step

magnitude, found by searchingargmin_γn

i=1l(^yi,(^fk−1(^xi))+

νγ

^hk(^xi),and

ν

is the learning rate.

Returnfm.

Forthisalgorithm,thenumberoftreesm,aswellasthelearn- ingrate

ν

^,^are^thehyperparametersthatneedtobesetbytheuser inorderforthemethodtoperformwell.

3.1.3. K-NearestNeighbors

LetX andy bethe trainingdata,X^∗ anewobservationandy^∗ theassociated value to predict.The KNN algorithm goesthrough thefollowingsteps:

1. CalculatethedistancebetweenX^∗andeachoftheobservations ofthetrainingset;

2. TaketheyvaluesoftheKclosestobservationsy_i₁,y_i₂,...,y_ik; 3. Assign toy^∗ a linearcombinationof thesevalues(usuallythe

mean).

Threehyperparametershavetobedeﬁnedproperlysothatthe algorithmperforms well:thedistance,thenumberofneighborsK andthetypeofaggregationoftheneighborsvalues.

3.1.4. SupportVectorRegression

Let X, y be a trainingdata. The objective ofSVR is to ﬁnd a function f such that the deviation of f(.) from the real values y is at most

ε

⁽^Smola ^& ^Schölkopf, ²⁰⁰⁴^). ^If ^the ^problem ^has ^no

solution,slack variables

ξ

i,

ξ

i∗ are introduced to tolerate part of theerror.First,let’sconsiderthecasewherefislinear,i.e. f(^x)= w.x+b.fisthenthesolution ofthefollowingoptimizationprob- lem(Eq.(5)):

Minimize 1

2

||

^w

||

²⁺^Cⁿ

i=1

( ξ

i+

ξ

i^∗

)

s.t. yi−<w,xi>−b ≤

ε

⁺

ξ

ⁱ

<w,x_i>+b−y_i ≤

ε

⁺

ξ

i^∗

ξ

i,

ξ

i^∗ ≥ 0 (5) where C>0 is the trade-off between the ﬂatness of f and the amountoftolerateddeviationslargerthan

ε

^,^and <, > isascalar product. By using the dual representation of the problembased on Lagrange multipliers, we ﬁnally get: f(^x)=sumⁿ_i₌₁(

α

i+

α

i^∗)<

x_i,x>+bwhere

α

iaretheLagrangianmultipliers.Iftheadequate fisnotlinear,wecanmapthedataintoahighdimensionalspace wherethefunctionfbecomeslinear(Fig.4).Instead ofsearching

Fig. 4. Mapping to the feature space in SVR.

fortheexpressionof

φ

^,^a^function^k^called^a^kernel^function,^which

satisﬁes k(^x,x)=<

φ

(^x),

φ

(^x)> is used. The existence of such a function is proved by the Mercer’s theorem.

ε

^, ^C ^and^the ^kernel

functions are the Support Vector Regression (SVR) hyperparame- tersthat needtobe selectedproperlyfortheperformance ofthe algorithm.

4. Applicationandresults

4.1. Thecontextofthestudy

Theincompletedatasetusedinthispaperistakenfromawa- tersampleanalysismadeatOursbelille,intheAdourplain,South- WestofFrance,from1991to2017.Theoperationalprincipleofthis drinkingwatercollectionpointisdescribedinFig.5.First,thewa- terispumped,itsnitraterateismeasuredandisconveyedtolarge aerial tanks in order to be treated by active charcoal. Then, the treatedwaterisstoredinawatertank.Inathirdstep,somesens- ingdevicesarethenusedtomonitorsomequalityindicators,such asthepH.Ina fourthstep,ondemand,thestoredwaterischlo- rinated, before being dragged to another underground well, few kilometersawayfromthepumpingwell.Fromthissecondstorage tank, waterisdistributedto thecitizensoftheAdourregion.The regionbeneﬁts ofan oceanicclimate,witha rainywinterandan averagetemperaturerangingfrom4to19^◦ C.

Theacquireddatacontain148 observationsof411waterqual- ity indicators, with an overall missingness of 82%. Fig. 6a is an overviewofthedataset,wheresome ofthemeasuredwaterqual- ityindicatorsaredisplayed,whileFig.6bsummarizesthemissing- nessdistributionpervariableinthedataset.

Onlythevariablesthat aremeasuredatleast5timesarecon- sidered,whichreducesthedatasetto257variables(52%ofthe411 variables).Itisnotedthattheremovedvariablesdonotrestrictthe analysissince they are not amongthe common hyperparameters forwaterqualityassessmentfoundintheliterature.

4.2. Settingsandassumptionsoftheimplementation

Based on the presentation of the three missingness patterns, andthe natureof thestudied dataset(as describedinthe previ- oussubsection),wecanassumethat ourstudyiswithintheMAR pattern.

Moreover,theproposedmethoddependsonseveralfactors:(a) the number of cycles to perform the imputations, (b) the num-

Fig. 5. Operational principle of the drinking water well of Oursbelille.

y• y•

~ r _* ^.

^l^E

* * *

(l)

' * *

* * *

t *

* * *

<D(x)

}.":'.~· ... ~:-.•~·-.'"'<: .... ~· ·~.,- ., ....

l}: I~\}f.: :~I ~ : '. -~""....1. __

^{.J. __}

_L_~*+~~;;;;;;,~

\. -~ : .:· • ~··:._.. '1"-.:, .:·

/(;!?~t/\ ':, ' .

..:. .. .,, ~ · :, • ·.:,- -~·• . .. active charcoa

mping lick nslng nltoring

!

Storage for public distribution

-'• ,·

.. .:~,;. ~N

(9)

(a)Overview of the dataset. (b) Number of missing values per variable.

Fig. 6. Description of the dataset.

ber ofvaluesdeﬁnedforeach hyperparameter,(c)thesizeofthe dataset,(d) thenumberofvariables ofinterest,and(e)the complexity of the ML algorithm itself. Forthese reasons, inorder to obtainrelevantresultsinareasonablerunningtime,andbyoppo- sition towhatis commonlyusedinliterature, onlyone value for thenumberofcycles(i.e.10cycles)isconsideredinthiswork.

Theimplementationoftheproposedmethodwasperformedby usingPythonprogramminglanguage,onacomputer withthefol- lowingmainfeatures:

• OperatingSystem:Windows10;

• RAM:Intel(R)Core(TM)i5-7200UCPU@2.50GHz2.70GHz;

• Processor:8.00Go.

The corresponding results are described and discussed in the following.

4.3. Implementationoftheproposedmethod

The main steps of the proposed method have been imple- mentedaccordingtothefollowingexplanations.

• Step1.TheﬁrstphaseoftheoriginalMICE,thatismeanimpu- tation,islaunched(initializationstep).

• Step 2.The next step concerns the hyperparametertuning of the Machine Learning algorithms. There is no analytical solu- tionthatallowstoﬁndtheoptimalvalues.Therefore,todoso, across-validationisperformedusingthemodiﬁeddataset,and a meansquared error(MSE) is measured. The optimalhyper- parametersarethereforethosethathavethelowestMSE.Note thatonlyalimitednumberofcandidatevalueshavebeentaken intoaccount becauseaddingmorewoulddrasticallyaffectthe algorithmiccomplexity.

The competing methods are MICE, MICE combined with RF (MICE-RF),MICE combinedwithBRT (MICE-BRT),KNN (MICE- KNN)andMICEcombinedwithSVR(MICE-SVR).

Thecandidate valuesforthehyperparametersofthe fourMa- chine Learningalgorithms (KNN,RF, BRT, SVR)are detailedin Table 3.For KNN, let usnotice that since the studieddataset containsvariableswithonlyﬁvenonmissingvalues,thenum- berofneighborsisatmost4.

• Step3.Phase(II)oftheoriginalMICEismodiﬁedbyreplacing Linear Regressionwithone ofthe competingalgorithms, each withitsoptimalhyperparameters(asobtainedinstep2).

• Step4.Theperformanceindicators,namelyMSEandprocessing time,arecomputedforeachalgorithm.

• Step5.ThemethodthatperformedbestintermsofMSE,and withareasonablecomputingtime,isthenselected.

• Step6.Finally,thewinningmethodisused tosolvethemiss- ingness.

Table 3

Candidate values of the hyperparameters for each Machine Learning method.

Algorithm Hyperparameter Candidate values

RF n_trees {10, 15, 20, 50, 100}

BRT m {30, 50, 100, 150}

ν {0.01, 0.1, 0.5}

KNN K {2, 3, 4}

d {euclidean, manhattan}

y ^∗ {uniform, weighted}

SVR ε {0.01, 0.1}

C {0.01, 0.1, 1, 10, 100}

kernel {rbf, poly, sigmoid}

γ {1e-3, 0.01, 0.1, 1}

Themainresultsofthisimplementationarepresentedanddis- cussedinthenextsubsection.

4.4.Resultsanddiscussion

Inthefollowing,onlysteps2,4and5,whichcontainthemain resultsoftheimplementation,arepresented.

• Step2:Hyperparametertuning.

Random Forest. In thisalgorithm,the performance increasespro- portionally to the number of trees. However, it becomes rapidly time consuming. The objective is to ﬁnd the smallest value for whichtheperformanceisgoodenough.Althoughitisnottheop- timalvalue,thenumberoftreesissetto15inordertoreducethe

Fig. 7. Variation of MSE to choose the hyperparameters in BRT.

0ate Nitrates Conductivi• Total Slmazlne Atrazlne

tyat 2SGC metolachlor 140

?

1991-06-24 44,00 NaN NaN NaN NaN

120

1992-06-23 45,00 NaN NaN NaN NaN 100

1993·10-19 43,40 NaN NaN 0.0 0.13

"' 8

I

60 0

1994.08-23 43.00 NaN NaN NaN NaN 0

40 0

1996·09·16 39.75 NaN NaN N,N NaN

#Missing Values

0.675

• •

^{n trees=}³⁰

0.650

•

^{n trees=}⁵⁰

0.625

...

n trees= 100

•

^{n tree}^{s =}¹⁵⁰

0.600

LJ.J 0.575

•

V,

!

::;:

0.550 0.525 0.500

...

0-475

• • '

0.0 0.1 0.2 0.3 0-4 0.5

V

(10)

(a) Choice of K and the linear combination. (b) Choice of K and the distance.

Fig. 8. Variation of MSE to choose the hyperparameters in KNN.

(a) Choice ofεand the kernel function. (b) Choice of C andε.

Fig. 9. Variation of MSE to choose the hyperparameters in SVR.

Table 4

Performance indicator (MSE) of the main RF hyperparameter.

n_trees MSE

10 0.5159

15 0.4943

20 0.4850

50 0.4691

100 0.4653

computationtime. Furthermore,theerrordoesnotdecreasea lot between15and100estimators(seeTable4).

BoostedRegression Trees. Similarly to the previous algorithm,the besttrade-off betweencomputingtimeandperformanceissought.

Itisnotedthatthenumberoftreesishigher,becauseshallowtrees arebuiltinBRTinsteadoffullygrownonesinRF.

Fig.7representsMSEinfunctionofthelearningrate

ν

^,^where

the labels representthe number of trees. According to these results,theoptimalhyperparametersforthisstudyare

ν

=0.01and m=150. For computational time sake, hyperparameters with a slightlyhighermeansquarederror(onlyadifferenceof0.001)are chosen:

ν

=0.1andm=30.

K-Nearest Neighbors. For this algorithm, the hyperparameters to tunearethenumberofneighborsK,thedistancedandthelinear combinationmethodofthe neighborsvalue y^∗.Inthisstudy, the euclideandistanceischosen,K=4,andy^∗ istheweightedmean oftheKNN.TheirchoiceisillustratedinFig.8.Indeed,MSEscore islowerforthesevalues.

SupportVectorRegression. Forthisalgorithm,

ε

^,^C^,^the^kernel^func-

tionandthe parameter

γ

^associated ^to ^the^kernel ^function^need

Table 5

Performance indicator scores.

MICE MICE-SVR MICE-4NN MICE-RF MICE-BRT

Processing time 6.87 5.29 8.25 65.18 32.59

MSE 1.09e24 0.44 0.58 0.55 0.54

tobe tuned. InFig.9, itisseen that theMSE isgenerallylowest forthepolynomialkernel,andfor

ε

=0.1.The lowestMSEscore isobtainedwith

ε

=0.01,C=1,kernel=polyandtheassociated

γ

=0.01.

• Step4:Computingtheperformanceindicators

TheresultssummarizedinTable5show that MICE-SVRisthe mostperforming methodregardingboth processingtime (5.29 seconds)andMSE(0.44).

The processing time was signiﬁcantly high while combining MICEwithRFandBRT.Indeed,allthreemethods,MICE,RF,and BRTarealreadycomputationallyexpensivebythemselves.With anumberofestimatorssetto15forRandomForest,anumber ofcyclessetto 10forMICE and251variables toﬁll, MICE-RF computes 15×10×251=37651 fully grown regression trees.

Similarly,MICE-BRTcomputes43500shallowregressiontrees.

MICEperformedtheworstbecauseintermsofMSEinthecur- rentimplementationofthealgorithm.Indeed,all thevariables wereusedaspredictorsintheregression,whereasaninterme- diate variable selection step would have been appropriate. It alsoproves thattherelationshipbetweenthevariablesarenot linear.

MICE-KNNis a littlelessperforming than theother combina- tionsofMICEwithMachineLearningalgorithms.Thisisdueto the fact that the closest resembling observations are logically those that are closer in time. However, these values are not

• .

^i.niform

^• .

^euclidean

0.57

•

^weighted ^0.57

•

^manhattan

. •

0.56

• .

^0.56

^• .

• . _• •

0.55

•

^0.55

•

~ 0_54

. . .

~ ^Q54

. • _•

0.53 0.53

0.52

• _•

^0.52

.

• • .

0.51

•

0.51

• • • • •

2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.(

K K

MSE for C=l MSE for kernel= 'poly'

0.9 0.9

• • . .

0.8 0.8

* .

0.7 0.7

w

• •

^w

• .

^Eps:0.01

~ _0.6

.. ..

^~

^• ^•

^Eps=O.l

•

^0.6

.

0.5

.

^rt,f ^0.5

• • • ..

^poly^sigmoid

• • •

0.4 0.4

0.0 0.2 0.4 0.6 0.8 LO 0.0 0.2 0.4 0.6 0.8 LO

Eps Eps

(11)

systematicallyﬁlledandtheclosestneighborsareonlysearched amongnon-missingobservationsforagivenvariable.

• Step5:Selectionofthemostperformingmethod.

MICE-SVR performed best in both criteria, it is therefore the bestperformingcompetingmethodinthisparticularcase.

The proposed methodology can handle datasets with a high missingness rate, and is also suitable for high-dimensionaldata.

It is a flexible method that can take into account complex non- linear relationshipsbetweenvariables (if the competingmethods are non-linear). It makes it possibleto automate theselection of thebest methodtosolvemissingness,whichreducestheamount of workof the dataanalyst, who canfocus on taskswith higher addedvalue,aimingatextractingknowledge.However,afewlim- itations are worth noting, particularly concerning the numberof cyclespresetto10,andtherelativelylownumberofpotentialhy- perparameters values (that does not allow a rigorous sensitivity analysisofthesehyperparameters).Furthermore,theseparameters are tuned using an artificial dataset whichhas been constructed bymodifying therealone.Alltheselimitationsaremainlydueto algorithmic complexity, whichconstitutesby itself a challenge as wellasagreatscientificissue.

5. Conclusionandperspectives

It is widely acknowledged that data-driven methods provide powerful algorithms to analyze any issue that is of interest for decision-makers.However, performing such analyses withincom- pletedatamaynotbehelpfultotakereliabledecisions.Inthispa- per,amethodologyforselectingthebestalgorithmstoaddressthe issue of data imputation, in the context ofwater quality assess- ment,hasbeenproposed. Abenchmarkoffourofthemostpow- erful andcommonlyusedML algorithms hasbeen performedfor that purpose (Random Forest, Booted Regression Trees,K-Nearest Neighbors, Support Vector Regression). The results showed that MICE-SVRisthebestinthatitconvergesfasterthanthethreeoth- ers, andprovides the bestperformance (notably interms ofpre- diction averageerror).It canthenbe appliedto highmissingness dataset,includingdataforwaterqualityassessmentthatareoften incomplete,asinthecaseofAdour(south-westofFrance)consid- eredinthepresentstudy.

Based on the weaknesses of the proposed method, as men- tionedinthediscussionoftheresults,thefollowingimprovements are planned forfurther studies: (1)deeper automate the mecha- nismofthemodelselection bysettingfuzzyrulesinaninference enginethatwillaggregateall theperformance indicatorsinasin- gle indicator; (2)improve, foreach competingmethod, theopti- mal choiceof thehyperparameters usingevolutionary algorithms inordertospeedupthecomputingtimeandincreasethenumber ofvaluesforeachhyperparameter;(3)automatethechoiceofthe number ofcycles neededforthe convergenceof theimputations bytakingintoaccountthesizeofthedataanditsmissingnessrate;

(4) introduce thetemporaldimension withinthe imputationpro- cess.

Conﬂictofinterest

Thereisnoconﬂictofinterest.

Creditauthorshipcontributionstatement

Romy Ratolojanahary: Conceptualization, Methodology, Soft- ware, Formal analysis, Investigation, Writing - original draft.

RaymondHoué Ngouna:Conceptualization,Methodology,Investi- gation,Writing -original draft,Writing-review &editing,Super- vision.KamalMedjaher:Conceptualization,Methodology, Investi- gation,Writing -original draft,Writing-review &editing,Super- vision, Funding acquisition. Jean Junca-Bourié: Investigation, Re- sources,Supervision,Fundingacquisition.FabienDauriac:Investi- gation, Resources,Fundingacquisition.Mathieu Sebilo:Investiga- tion,Writing-review&editing,Supervision.

References

Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20 (1), 40–49. doi: 10.1002/mpr.329 .

Betrie, G. D., Sadiq, R., Tesfamariam, S., & Morin, K. A. (2014). On the issue of incomplete and missing water-quality data in mine site databases: Compar- ing three imputation methods. Mine Water and the Environment, 35 (1), 3–9.

doi: 10.1007/s10230- 014- 0322- 4 .

Buhi, E. (2008). Out of sight, not out of mind: Strategies for handling missing data.

American Journal of Health Behavior, 32 (1). doi: 10.5993/ajhb.32.1.8 .

van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional speciﬁcation. Statistical Methods in Medical Research, 16 (3), 219–242.

doi: 10.1177/0962280206074463 .

van Buuren, S. (2018). Flexible imputation of missing data . Chapman and Hall/CRC . van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by

chained equations inr. Journal of Statistical Software, 45 (3). doi: 10.18637/jss.v045.

i03 .

Clark, T. G., & Altman, D. G. (2003). Developing a prognostic model in the pres- ence of missing data. Journal of Clinical Epidemiology, 56 (1), 28–37. doi: 10.1016/

s0895- 4356(02)00539- 5 .

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? some practical clariﬁcations of multiple imputation theory.

Prevention Science, 8 (3), 206–213. doi: 10.1007/s11121- 007- 0070- 9 .

Honaker, J., King, G., & Blackwell, M. (2011). Ameliaii: A program for missing data.

Journal of Statistical Software, 45 (7). doi: 10.18637/jss.v045.i07 .

Hong, T.-P., & Wu, C.-W. (2011). Mining rules from an incomplete dataset with a high missing rate. Expert Systems with Applications, 38 (4), 3931–3936. doi: 10.

1016/j.eswa.2010.09.054 .

Jordanov, I., Petrov, N., & Petrozziello, A. (2018). Classiﬁers accuracy improvement based on missing data imputation. Journal of Artiﬁcial Intelligence and Soft Com- puting Research, 8 (1). doi: 10.1515/jaiscr- 2018- 0 0 02 .

Little, R. J. A., & Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys . John Wiley & Sons, Inc.. doi: 10.1002/9780470316696 .

Neter, J., Maynes, E. S., & Ramanathan, R. (1965). The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association, 60 (312), 1005–1027. doi: 10.1080/01621459.1965.10480846 .

Raghunathan, T. E. , Lepkowski, J. M. , Hoewyk, J. V. , & Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27 (1), 85–95 .

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63 (3), 581–592. doi: 10.

1093/biomet/63.3.581 .

Rubin, D. B. , & Schafer, J. L. (1990). Eﬃciently creating multiple imputations for incomplete multivariate normal data. In Proceedings of the statistical computing section of the American statistical association .

Shao, J., Meng, W., & Sun, G. (2016). Evaluation of missing value imputation methods for wireless soil datasets. Personal and Ubiquitous Computing, 21 (1), 113–123.

doi: 10.10 07/s0 0779- 016- 0978- 9 .

Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14 (3), 199–222. doi: 10.1023/b:stco.0 0 0 0 035301.4 954 9.88 . Stekhoven, D. J., & Buhlmann, P. (2011). MissForest–non-parametric missing value

imputation for mixed-type data. Bioinformatics, 28 (1), 112–118. doi: 10.1093/

bioinformatics/btr597 .

Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., et al.

(2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ, 338 (jun29 1). doi: 10.1136/bmj.b2393 . b2393–

b2393

Tutz, G., & Ramzan, S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics & Data Analysis, 90 , 84–

99. doi: 10.1016/j.csda.2015.04.009 .

White, I. R., Royston, P., & Wood, A. M. (2010). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30 (4), 377–

399. doi: 10.1002/sim.4067 .

Yang, C., Liu, J., Zeng, Y., & Xie, G. (2019). Real-time condition monitoring and fault detection of components based on machine-learning reconstruction model. Re- newable Energy, 133 , 433–441. doi: 10.1016/j.renene.2018.10.062 .

Zhao, L., Chen, Z., Yang, Z., Hu, Y., & Obaidat, M. S. (2018). Local similarity imputation based on fast clustering for incomplete data in cyber-physical systems. IEEE Systems Journal, 12 (2), 1610–1620. doi: 10.1109/jsyst.2016.2576026 .