• Aucun résultat trouvé

Imputation methods for missing environmental data 3

5.2 Materials and methods

5.3.3 Dimensionality variability

Inclusion of additional variables was expected to increase imputation performance, despite the underlying reduction in sample size and a potential to increase model overfitting. The latter is consequential to the consideration of a high number of variables to explain or describe the patterns within the data and is characterised by a reduced accuracy outside its training range. Hence, despite an increased explanatory power by including additional variables, a decrease in imputation accuracy can be obtained. Still, dimensionality clearly affected imputation performance, with a general decrease in error following an increase in dimensionality (Figure 5.4). Only kNN did not show a monotonous increase in performance when 50 % or more of the data was missing, but rather performed worst at intermediate dimensionality (0.97 ± 0.06 at Nvar = 5 versus 1.05 ± 0.06 at Nvar = 10, with 75 % missing).

The saturated model indicated that a significant overall interaction existed and that inclusion of a main effect and interaction with imputation method significantly improved model fit (p < 0.001). Interaction coefficients indicated a higher effect of dimensionality on mF (𝛽𝑚𝐹:𝑉𝑎𝑟 = -0.067) compared to ls and kNN (𝛽𝑙𝑠:𝑉𝑎𝑟 = -0.044 and 𝛽𝑘𝑁𝑁:𝑉𝑎𝑟 = -0.025), causing the overall significant differences between ls, kNN and mF in the baseline performance (see earlier section) to disappear. Still, they provided significantly higher performance than mean, regardless of missing data and dimensionality (all p < 0.001), except for kNN at 75 % missing values and only 5 variables. In contrast, with 50 % or less of the data missing and only 5 variables, kNN performed similarly as ls and mF, yet performance discrepancy increased when 10 (mF) or 15 (ls) variables were available (p < 0.05). Differences between ls and mF were mostly non-significant, except at increased dimensionality (≥ 10 variables) and elevated degrees of missing data (≥ 20 %).

Both mF and kNN showed maximal required computation time at intermediate dimensionality (Nvar = 10, up to 2100 ± 400 s for mF) and minimal at increased dimensionality (Nvar = 15, 220 ± 60 s for mF at 1 % missing) (Figure 5.4). Surprisingly, the latter did not result in a clear change in computation time for kNN or mF along the range of missing data, while reduced dimensionality showed a similar pattern as observed in Figure 5.2. In contrast, both mean and ls were not clearly affected by dimensionality nor the degree of missing data (Figure 5.4).

IMPUTATION OF MISSING DATA

119 Figure 5.4: Effect of dimensionality on performance and required time of four imputation methods. Not only the number of variables, but also sample size is different among data sets. The top row (NRMSE) represents performance, while the second row represents the required computation time. Columns represent the different degrees of missing data used. Data of 10 repetitions (maximum number of instances for a specific number of variables, different missing values) are reported. Symbols represent the average for each combination, while vertical lines represent the standard deviation. Methods: mean: mean imputation; mF: the missForest algorithm; kNN: k nearest neighbours and ls: iterative least squares. NRMSE: Normalised Root Mean Squared Error.

5.3.4 Optimisation

Preliminary assessment showed that additional typological information did not result in improved imputation performance (see Appendix, Figure B.1), hence this was not considered for further elaboration. In contrast, altering hyperparameter settings often improved performance and was therefore included in subsequent analyses. Specific effects of each individual hyperparameter were considered being beyond the current scope and merit additional study.

By default, kNN considers five neighbours, yet the optimised knn value ranged from 1 up to 47, with a median value of 9. Almost 33 % of the knn values were equal to or lower than 5, while another 33 % ranged from 15 up to 47. In general, data sets with low rates of missing data (fMD ≤ 10 %) supported improved imputation when low knn values were applied (knn ≤ 10) and vice versa for data sets with elevated rates of missing data (Figure 5.5).

CHAPTER 5

120

Figure 5.5: Selected number of neighbours to be considered after optimisation. Optimised knn values were determined for the six classes of missing data (0.01, 0.05, 0.1, 0.2, 0.5 and 0.75), represented by two repetitions of each possible combination of sample size (number of instances) and dimensionality (number of variables), hence a total of 144 data sets. Values range from 1 up to 47, with low values being selected when imputing data sets with limited rates of missing values and vice versa for data sets with high rates of missing values. Boxes represent the 50 % central values around the median, while whiskers represent the first and third quartile extended to the last case within 1.5 times the interquartile range.

Similar patterns could not be identified for mF due to the simultaneous alteration of three hyperparameters during the iteration process, yet observations suggested that the majority of the data sets was imputed with higher accuracy when less individual trees were constructed and more variables were randomly selected at each split (see Table 5.2 and Appendix, Figure B.5 and Figure B.7). For instance, ntree ranged from 5 up to 225, with the majority of the data sets requiring less than the default number of trees (i.e.

ntree = 100).

Indeed, 75 % of the data sets required 84 trees or less to improve imputation accuracy, while median values for mtry were similar (Nvar = 5) or higher (Nvar > 5) than the default value (Table 5.2). Quantitative improvements in NRMSE values were, in general, smaller than 0.25 and relatively unaffected by the rate of missing data and the default performance for both methods (Figure 5.6). On average, the absolute decrease in NRMSE values between the default and optimised imputation settings was 0.05 ± 0.05 (N = 144) for both methods.

IMPUTATION OF MISSING DATA

121 Table 5.2: Summarising statistics for the optimised hyperparameter values of mF.

Optimised values were determined for the six classes of missing data (0.01, 0.05, 0.1, 0.2, 0.5 and 0.75), represented by two repetitions of each possible combination of sample size (number of instances) and dimensionality (number of variables), hence a total of 144 data sets. In general, the majority of the data sets benefit when imputation is performed with less individual trees (ntree) and more variables to be considered for each split (mtry).

Default Min Q1 Median Mean Q3 Max

ntree 100 5 25 50 62 84 225

mtry (Nvar = 5) 2 1 1 2 2 3 4

mtry (Nvar = 10) 3 1 2 3 4 5 9

mtry (Nvar = 15) 3 2 4 6 7 9 14

nodesize 1 1 1 2 2 2 6

Figure 5.6: Error reduction following optimisation of hyperparameter settings of the kNN and missForest algorithm. The difference is calculated as the NRMSE value after optimisation minus the NRMSE value in case of default settings (Baseline NRMSE). The horizontally grey dotted line represents the reference condition (i.e. no change in NRMSE), with symbols below it reflecting an improvement of performance and symbols on it reflecting a steady state. Selected hyperparameters included the number of neighbours (kNN) and the number of individual trees, number of variables to be considered for each split and final nodesize for missForest (mF).

CHAPTER 5

122

5.4 Discussion