• Aucun résultat trouvé

data pre-processing 4

6.2 Materials and method

6.4.2 Implications for environmental research

Raw environmental data harbours an invaluable treasure of information, hidden in complex patterns and a significant amount of noise. Elimination of the latter simplifies pattern discovery and the development of species distribution hypotheses. The qualitative trade-off analyses performed here provided threshold values for the identification and elimination of outliers (τo = 3), false absences (τa = 5 %), correlated variables (τc = 0.7) and irrelevant variables (τi = 10 %). Despite frequent application within correlative ecological modelling, threshold values are only limitedly reported and often case-specific, underlining the need for a solid conceptual framework to govern sound and comparable results and conclusions to support decision-making (Kotsiantis et al., 2006; Zhang et al., 2003).

Unfortunately, data collection and cleaning remain expensive steps within species distribution studies (Zhang et al., 2003). To start, data collection by means of field campaigns is time-, energy- and budget-intensive, causing researchers to refrain from data removal and data sharing, which increases the need for thorough data cleaning (Catalano et al., 2019). Recent movements towards open data and uniform data bases (e.g. Global Biodiversity Information Facility, GBIF) have eased the process of gathering occurrence information, thereby causing an exponential growth in occurrence-based modelling of habitat suitability and species distributions (Peterson et al., 2015). Yet, the available data is to be used with care as the provided quality is subject to the preferences of the original owner of the data (Maldonado et al., 2015), causing data reliability to become an additional aspect to be considered within correlative habitat suitability and species distribution modelling. For instance, herbaria and museums are increasingly improving data availability by digitising their collections, though these observations often bias results as they lack detailed georeferencing (Maldonado et al., 2015; Peterson et al., 2015). In addition, due to the high variety in data quality, data cleaning can take up to 80 % of all time spent on a research project (Zhang et al., 2003). Even when automated, further tuning remains necessary to find the appropriate threshold values.

Here, the selected techniques have been tuned manually to act as a filter for the data to be used, while they provide the opportunity to be included in the model development algorithm and act as wrapper functions with tuneable hyperparameters (e.g. Boets et al.

(2013a), Gobeyn et al. (2017)). Moreover, alternative approaches do exist, including visual outlier identification (Gobeyn et al., 2017), distance-based pseudo-absence selection, input variable selection by means of Genetic Algorithms (D'Heygere et al., 2003; Gobeyn et al., 2017), variable transformation (Kotsiantis et al., 2006) and variable construction (Kotsiantis et al., 2006). Each of these techniques includes some kind of user-dependent threshold selection and influences model performance and output (including decision-making) differently. This underlines the need for a well-developed framework to support sound model development.

DATA PRE-PROCESSING

149 6.4.3 Contribution to the study objective

The aim of this chapter was to assess the effects of technique-specific threshold selection on model performance and the required computation time in order to provide guidelines for further pre-processing of the adopted Limnodata Neerlandica. Throughout the chapter, threshold values were altered to infer their effect on model performance and to allow a trade-off between model performance, computation time and data loss. By considering these ranges, a more pronounced basis was created to bring forward a set of threshold values for supporting after-imputation data cleaning within the overall study objective (see Section 1.2.1). Similar to Chapter 5, it should remain clear that this chapter contributes mostly to the overall study objective, while providing suggestions for application outside the considered framework. More specifically, it is recommended to perform similar analyses with different combinations of environmental variables and species occurrences to support empirical threshold selection.

The chapter complies to the recommendation of performing data pre-processing prior to data-driven model development in order to eliminate noise within publicly available data (Maldonado et al., 2015). It was expected that noise was present in the Limnodata Neerlandica, as data was collected by various companies and institutions over a period of thirty years (see Section 4.2.1). More specifically, this noise was expected to be present in the instances (i.e. extremely deviation values, recording of false absences) and among the variables (i.e. correlations and non-influential variables), with a potential to negatively affect model performance (Murphy et al., 2010). In literature, noise elimination through data pre-processing is often done in a partial and subjective manner (e.g. Forio et al. (2018), Fox et al. (2017), Gobeyn et al. (2017)), though deserves more scrutiny due to its negative effect on data availability.

In general, the removal of noise (outliers, false absences, correlated and irrelevant variables) supported the expected changes in model performance, although three out of four methods caused a decrease in the performance metric score (see Section 6.3.2).

Only the removal of false absences affected model performance positively, mainly due to a clearer delineation of the realised niche. Due to the performed range assessment, threshold values for the pre-processing of the imputed Limnodata Neerlandica could be defined via a visual trade-off between model performance, computation time and data availability, resulting in thresholds for the elimination of outliers (τo = 3), false absences a = 5 %), correlated variables (τc = 0.7) and irrelevant variables (τi = 10 %). By performing such a visual trade-off, a certain degree of subjectivity is introduced, yet this is considered to be lower than simply adopting thresholds from similar studies. More importantly, the implementation of these pre-processing thresholds creates species-specific data sets, which support the construction of qualitative models to describe the abiotic suitability of wetland habitats for specific aquatic macrophytes.

CHAPTER 6

150

6.5 Conclusion

Occurrence data contain valuable information on species distribution patterns and dynamics, but require data cleaning prior to pattern inference. During cleaning, data is unavoidably lost as environmental domains become more strictly delineated.

Identification and elimination of outliers and variables that are correlated or irrelevant inherently increase potential overlap of presence and background domains, while discarding potential false absences supports the identification of more distinct (yet less detailed) environmental niches. Accordingly, a decrease or increase in model performance is observed whenever the environmental domains of presences and absences are characterised by respectively more or less relative overlap due to data quality improvement. In contrast, a decrease in computation time required for model development is observed for each type of data cleaning, with inclusion of the data pre-processing step causing overall computation time to be both lower and higher than without data pre-processing, depending on the applied technique. A visual trade-off analysis of performance and computation time, supplemented with the effects of threshold selection on the sample size or dimensionality of the data, identifies thresholds for the elimination of outliers (τo = 3), false absences (τa = 5 %), correlated variables (τc = 0.7) and irrelevant variables (τi = 10 %), while supporting improved model performance following combinatory data pre-processing. The increased data quality and resulting decreased model complexity underline the added value of data pre-processing within the framework of species distribution modelling and model transferability.

7

Abiotic habitat suitability models to