• Aucun résultat trouvé

Imputation methods for missing environmental data 3

5.2 Materials and methods

5.2.2 Imputation techniques

Four single-value imputation techniques were selected for this study: (i) mean imputation (mean), (ii) iterative least squares (ls), (iii) k-nearest neighbours (kNN), and (iv) a random forest-based algorithm missForest (mF). All techniques were initially applied with their default settings and, if applicable, tested for potential optimisation via (i) inclusion of additional information and (ii) iterative hyperparameter setting.

Imputation of the mean is the simplest approach and has been applied at instance- and variable level. Despite its application within microarray research (Troyanskaya et al., 2001), instance-wise imputation of the mean is not considered appropriate with environmental data, hence a variable-wise imputation is applied. Imputation is performed via the Hmisc package (Harrel, 2018).

The iterative least squares method assumes an underlying linear relationship among the variables within the data set, thereby supporting its successful application within the field of microarray analysis (Bø et al., 2004; Brock et al., 2008; Zhang et al., 2008) and its potential within the field of environmental data. Imputation is based on the description provided by Bø et al. (2004), starting with the imputation of the variable-wise mean, after which the covariance matrices (S) are determined and used to solve Equation 5.1. Following the first imputation, means and covariance matrices are updated and a new imputation value is determined until convergence. Here, maximally 10 iterations were run as additional iterations resulted in relatively minor changes within the covariance matrix.

𝑦̂ = 𝑦𝑖 ̅ + 𝑺𝑖 𝑦𝑖𝒙𝑺𝒙𝒙−𝟏(𝒙 − 𝒙̅) (Equation 5.1) With 𝑦̂𝑖 the estimated value (to be imputed), 𝑦̅𝑖 the average value over 𝑦𝑖, … , 𝑦𝑛 , 𝑺𝑦𝑖𝒙 the covariance matrix (vector) between the variable with missing value and the remaining variables, 𝑺𝒙𝒙 the covariance matrix among the remaining variables, 𝒙 = [𝑥1, 𝑥2, … , 𝑥𝑘]𝑇 the variables’ values for the considered instance and 𝒙̅ = [𝑥̅̅̅, 𝑥1 ̅̅̅, … , 𝑥2 ̅̅̅]𝑘 𝑇 the variables’

average values.

IMPUTATION OF MISSING DATA

113 The kNN approach is a distance-based method and uses the information of the knn

closest neighbours of the instance with a missing value. Subsequently, the mean (or median) of these knn neighbours is used to replace the missing value, optionally weighted for the neighbours’ distance from the instance. Within this study, imputation is based on the Gower distance and the distance-weighted average of knn neighbours. At first, the default value of knn = 5 is considered for imputation, followed by an assessment of how NRMSE-based optimisation of knn can improve imputation performance. This optimisation is conducted for each combination in Table 4.1 at six levels of missing data and two repetitions (i.e. N = 144, see Appendix B.2). Imputation via kNN is applied via the VIM package (Kowarik and Templ, 2016).

Lastly, the missForest algorithm was introduced by Stekhoven and Bühlmann (2012) and relies on the random forest technique (see also Box 3.1). This technique belongs to the data-driven supervised machine learning classification and regression trees (CARTs) and has been reported to outperform more traditional methods as it creates an ensemble of independent trees rather than a single tree (Stekhoven and Bühlmann, 2012; Waljee et al., 2013). As such, it can be considered as a multiple-value imputation technique, although only a single imputed data set is obtained.

Imputation via random forest works iteratively, comparing each imputed value with its previous value and combining this in an overall difference. Baseline imputation is performed via variable-wise mean imputation, while the stopping criterion is defined as the moment when the calculated difference starts to increase again, as defined by Equation 5.2 for continuous variables (see Stekhoven and Bühlmann (2012) for discrete variables). Alternatively, the number of iterations can be defined a priori to avoid non-convergence errors.

𝑿= (𝑫𝑛𝑒𝑤

𝑖𝑚𝑝−𝑫𝑜𝑙𝑑𝑖𝑚𝑝)2 𝑘𝑗=1

𝑘𝑗=1(𝑫𝑛𝑒𝑤𝑖𝑚𝑝)2 (Equation 5.2) With X the set of k continuous variables and D the data matrix.

Within this chapter, random data sampling within missForest was performed without replacement and three hyperparameters were selected for optimisation: ntree, mtry and nodesize. At first, hyperparameters were set at their default values (i.e. 𝑛𝑡𝑟𝑒𝑒 = 100, 𝑚𝑡𝑟𝑦 = √𝑁𝑣𝑎𝑟 and 𝑛𝑜𝑑𝑒𝑠𝑖𝑧𝑒 = 1), with maximally 10 iterations. Subsequently, these hyperparameters were iteratively altered for each combination mentioned in Table 4.1 at all six levels of missing data and two repetitions (i.e. N = 144, see Appendix, Section B.2.2), followed by an analysis of the difference in performance. The missForest algorithm was implemented as part of the missForest package (Stekhoven, 2013).

CHAPTER 5

114

5.3 Results

All imputation methods obtained in at least 94 % of the cases a NRMSE value lower than 1. Ranges differed, with ls representing the narrowest range (0.03 up to 2.36) and kNN the widest range (0.05 up to 3.73). Both mean and mF scored in between, ranging from 0.89 up to 4.10 and from 0.06 up to 3.63, respectively (Figure 5.1). Best overall performance was obtained by mF (0.45 ± 0.27) and ls (0.47 ± 0.26), followed by kNN (0.53 ± 0.31) and reflecting a clear difference from mean (0.97 ± 0.12).

Indeed, higher NRMSE values were observed for mean, represented by scores of ls, kNN and mF being mostly situated underneath the agreement line (Figure 5.1). Moreover, the majority of kNN results are positioned above the mF-based agreement line and, vice versa, the majority of mF results are situated below the kNN-based agreement line (Figure 5.1). No clear difference is observed between the results for ls and mF, as indicated by NRMSE values at both sides of the ls- and mF-based agreement lines (Figure 5.1). These observations are confirmed by the adjusted Tukey test, showing that mean performed significantly worse than ls, kNN and mF (p < 0.001 for all pairwise tests), while differences among the latter three methods were non-significant (p > 0.05).

Figure 5.1: General overview of the NRMSE scores for each imputation approach, conditional to the other methods. To improve visualisation, the y-axis was chosen to be similar to the x-axis range. Values below the agreement line indicate better performance of the method on the y-axis, while values above the agreement line indicate better performance of the method on the x-axis. Methods: mean: mean imputation; ls: iterative least squares; kNN: k nearest neighbours and mF: the missForest algorithm. NRMSE: Normalised Root Mean Squared Error.

IMPUTATION OF MISSING DATA

115 In the following sections, more specific results are presented, focusing on the methods’

variability in performance and required computation time for (i) a fixed number of both variables and instances (i.e. Dopt), (ii) a varying number of instances, given a fixed number of variables (Nvar,opt and flexible Ninst) and (iii) a variety in dimensionality (flexible Nvar). A detailed overview of performance scores can be found in Table B.3.

Moreover, in order to support the obtained NRMSE scores with a variable- and technique-specific accuracy assessment, two case studies are provided in Appendix B.4:

(i) a small data set (5 variables, 5385 instances) with 1 % missing data and (ii) the optimal data set (10 variables, 17 264 instances) with 50 % missing data. The latter is based on the description of the common data in Section 4.2.1.3. Based on these results, mF seemed to perform best for imputing both extensive and confined variables, while kNN and ls showed to be less applicable, respectively.