Individual pre-processing .1 Instance-based removal .1 Instance-based removal

data pre-processing 4

6.2 Materials and method

6.3.2 Individual pre-processing .1 Instance-based removal .1 Instance-based removal

Excessively deviating instances were removed from the dataset for τo values ranging between 0 and 15. Generally, a decrease in model performance is obtained by outlier removal, yet shows to be relatively stable as soon as the most excessive outliers are removed (i.e. 15 < τo < 10). Further threshold reduction (τo → 5) considers more instances to be outliers, though causes only limited reduction of model performance for P.

australis, L. minor and C. demersum, while models for M. aquatica and L. minuta already indicate a performance decrease when τo drops below 7. Overall, the effect of outlier removal on model performance is relatively limited, with a maximum decrease in AUC of 0.05 (C. demersum, see Figure 6.4).

In contrast, required computation time continuously decreases over the applied range for τo, showing a larger initial effect for P. australis compared to the other macrophytes.

Moreover, time reduction shows a dependency on data availability with a gradual reduction for P. australis and a more abrupt reduction for L. minuta for τo-values smaller than 5. Similar patterns are observed for overall computation time, including the dependency on data availability (see Appendix, Figure C.6). For instance, an overall beneficial effect of outlier removal is observed for P. australis, while model development for L. minuta indicates to be negatively affected. In order to avoid an excessive performance decrease for low data-availability species, while already providing a 10-30

% reduction in computation time for high data-availability species, a threshold value of τo = 3 can be derived, resulting in a removal of about 760 instances (see Appendix, Figure C.2).

CHAPTER 6

140

Figure 6.4: Effect of outlier-based instance removal on model performance and computation time. Removal of outliers has, at first, a limited effect on performance and computation time (except for P. australis). A slight decrease in performance is observed when more deviating values are considered as outliers (τo → 0), while causing the required computation time to decrease. A visual trade-off between performance and computation time supports a threshold of τo = 3 (dashed grey line). Performance analyses for all 58 species can be found in Appendix, Figure C.7 and Figure C.8.

The removal of false absences provides a positive effect on model performance, with AUC values increasing as τa decreases, without reaching a plateau (Figure 6.5). As the threshold becomes more strict (i.e. τa → 0 %), performance keeps increasing up to net AUC improvements of 0.2 (L. minor). In general, patterns among macrophytes are relatively similar and show performance improvements for conservative threshold values (i.e. τa = 15 %), causing AUC scores to increase with about 0.05 (Figure 6.5).

Similar analyses can be performed for the remaining macrophytes.

In contrast, computation time assessment indicates the existence of a species-specific tipping point for τa, below which computation time decreases drastically. These tipping points are related to overall data availability after false absence removal. For instance, data for P. australis originally represents about 1700 presences and around 2600 absences. As the threshold becomes stricter, more absences are removed, rising to 1000 at τa = 7 % and 2000 at τa = 0 % (see Appendix, Figure C.3), which results in only 1600 and 600 absences remaining, respectively. These absences are lower than the number of presences, which requires subsampling of the latter to create a balanced training set for model development. The resulting decrease in data size reduces the required computation time as less instances need to be classified.

DATA PRE-PROCESSING

141 A similar pattern is present in overall computation time, additionally showing an increase when data availability is too low to support faster model development (e.g. L.

minuta) (see Appendix, Figure C.6). Consequently, any threshold value will affect model performance positively, yet selecting low values for τa (e.g. τa < 5 %) not only improves performance, but also causes high numbers of instances to be eliminated (up to 1200 instances, see Appendix, Figure C.3). A trade-off threshold value of τa = 5 % is suggested to avoid excessive removal.

Figure 6.5: Effect of false-absence-based instance removal on model performance and computation time. A continuous increase in performance is observed for each macrophyte as the threshold becomes more strict. In contrast, computation time remains relatively stable at first, while a sharp decrease is observed for some macrophytes as τa decreases. A trade-off between model performance, computation time and the consequences of removing too much ambiguous instances provides a compromise at τa = 5 % (dashed grey line). Performance analyses for all 58 species can be found in Appendix, Figure C.9 and Figure C.10.

6.3.2.2 Variable-based removal

Correlated variables provide similar information, yet removal of these variables goes along with removal of information, illustrated by a decrease in model performance for decreasing correlation thresholds (Figure 6.6). At high threshold values (i.e. τc > 0.85), the reduction in performance remains relatively limited while at extremely low threshold values (i.e. τc < 0.40) a clear decrease in AUC values is observed due to the limited amount of shared information between limitedly-correlated variables.

CHAPTER 6

142

Simultaneously, however, a gain in computation time is observed, following an overall dimensionality reduction within the search space caused by a decreased number of variables. The different plateaus observed within the time-specific graphs illustrate the inherent characteristics of the algorithm, selecting only a subset of all variables for each split within the tree. This number is based on the number of available variables and defined as 𝑚𝑡𝑟𝑦 = √𝑁_𝑣𝑎𝑟, being rounded to the lower integer. Hence, as soon as Nvar

decreases sufficiently, mtry will drop with 1 unit, causing less variables to be selected and, consequently, less potential splitting points to be considered. Therefore, plateaus exist between each drop, as mtry does not change with every variable being removed.

Similar patterns are observed for overall computation time, showing generally faster data pre-processing and model development, though patterns become less clear as data availability decreases (see Appendix, Figure C.6). Threshold selection based on these results is not straightforward, yet was chosen at τc = 0.70 to avoid removal of more than 10 variables (see Appendix, Figure C.4).

Figure 6.6: Effect of correlation-based variable removal on model performance and computation time. Removal of correlated variables has no straightforward effect on performance, though a limited effect on performance and computation time is observed at first (τc

> 0.9). Required computation time decreases with variable removal and illustrates the characteristic plateaus related with algorithm settings. Selection of an intermediate threshold value (i.e. τc = 0.7) considers variables with a relatively high correlation. Performance analyses for all 58 species can be found in Appendix, Figure C.11 and Figure C.12.

DATA PRE-PROCESSING

143 A reduction in performance is also observed following the removal of irrelevant variables. As the required contribution of each variable increases (i.e. τi → 50 %) AUC values decrease to reach a macrophyte-specific plateau (Figure 6.7) caused by many variables being removed (see Appendix, Figure C.5). Nevertheless, at low threshold values (i.e. τi < 10 %) model performance is hardly affected due to the removal of mostly irrelevant variables, while providing a minor decrease in computation time (1 to 6 %).

Similar to correlation-based variable removal, computation time decreases when more variables are eliminated, reaching a species-specific plateau. However, overall computation time tends to increase as the calculation of variable importance requires an additional model to be developed, being the main contributor to the overall required time (see Appendix, Figure C.6). Threshold setting at τi = 10 % was supported by visual assessment of performance, computation time and number of variables being removed.

Figure 6.7: Effects of importance-based variable removal on model performance and computation time. Removal of irrelevant variables has, at first, limited effect on performance and computation time. A clear decrease in performance can be observed as soon as relative importance scores exceed 20 %. In contrast, effects on computation time are already visible when removing the most irrelevant variables (τi < 15 %). Threshold selection at τi (10 %, dashed grey line) illustrates the technique-specific trade-off between performance and speed. Performance analyses for all 58 species can be found in Appendix, Figure C.13 and Figure C.14.

CHAPTER 6

144

Dans le document Data-driven models and trait-oriented experiments of aquatic macrophytes to support freshwater management (Page 169-174)